stevedore-uploader 1.0.0-java

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ea9762ea02f37f5e56993854cb08ae63f21f7e1e
4
+ data.tar.gz: f525070b72ea4f56e44096de00ec4789a059b042
5
+ SHA512:
6
+ metadata.gz: 1e6cd4cbae534a1bd67091daea696c0cb33daf5dc94c740be2b6f6ce0fbf23195c02d05adfbe00d618e6b36f3bd890e8223fa7c8b7f02a3644aaa4e988b6d1b4
7
+ data.tar.gz: d4b4971924a013b3bf7a64b375a6e2fcf5246fa29a975099415c1128a0d0ebbb9d0526776d98f8f81e79ffbe6dea21e1f6eeacf713485785c9f8f4b8f62f749b
data/README.md ADDED
@@ -0,0 +1,62 @@
1
+ stevedore-uploader
2
+ ==================
3
+
4
+ A tool for uploading documents into [Stevedore](https://github.com/newsdev/stevedore), a flexible, extensible search engine for document dumps created by The New York Times.
5
+
6
+ Stevedore is essentially an ElasticSearch endpoint with a customizable frontend attached to it. Stevedore's primary document store is ElasticSearch, so `stevedore-uploader`'s primary task is merely uploading documents to ElasticSearch, with a few attributes that Stevedore depends on. Getting a new document set ready for search requires a few steps, but this tool helps with the hardest one: Converting the documents you want to search into a format that ElasticSearch understands. Customizing the search interface is often not necessary, but if it is, information on how to do that is in the [Stevedore](https://github.com/newsdev/stevedore) repository.
7
+
8
+ Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/). A `case` statement in `bin/upload_to_elasticsearch.rb` distinguishes between a few default types, like emails and text blobs, (and PRs would be appreciated adding new ones); for specialized types, the `do` function takes a block allowing you to modify the documents with just a few lines of Ruby.
9
+
10
+ For more details on the entire workflow, see [Stevedore](https://github.com/newsdev/stevedore)
11
+
12
+ Installation
13
+ ------------
14
+
15
+ This project is in JRuby, so we can leverage the transformative enterprise stability features of the JVM Java TM Platform and the truly-American Bruce-Springsteen-Born-to-Run freedom-to-roam of Ruby. (And the fact that Tika's in Java.)
16
+
17
+ 1. install jruby. if you use rbenv, you'd do this:
18
+ `rbenv install jruby-9.0.5.0` (or greater versions okay too)
19
+ 2. be sure you're running Java 8. (java 7 is deprecated, c'mon c'mon)
20
+ 3. `bundle install`
21
+
22
+ Usage
23
+ -----
24
+
25
+ **This is a piece of a larger upload workflow, [described here](https://github.com/newsdev/stevedore/blob/master/README.md). You should read that first, then come back here.**
26
+
27
+ upload documents from your local disk
28
+ ```
29
+ bundle exec ruby bin/upload_to_elasticsearch.rb --index=INDEXNAMEx [--host=localhost:9200] [--s3path=name-of-path-under-bucket] path/to/documents/to/parse
30
+ ```
31
+ or from s3
32
+ ```
33
+ bundle exec ruby bin/upload_to_elasticsearch.rb --index=INDEXNAMEx [--host=localhost:9200] s3://my-bucket/path/to/documents/to/parse
34
+ ```
35
+
36
+ if host isn't specified, we assume `localhost:9200`.
37
+
38
+ e.g.
39
+ ```
40
+ bundle exec ruby bin/upload_to_elasticsearch.rb --index=jrubytest --host=https://stevedore.newsdev.net/es/ ~/code/marco-rubios-emails/emls/
41
+ ```
42
+
43
+ you may also specify an S3 location of documents to parse, instead of a local directory, e.g.
44
+ ```
45
+ bundle exec ruby bin/upload_to_elasticsearch.rb --index=jrubytest --host=https://stevedore.newsdev.net/es/ s3://int-data-dumps/marco-rubio-fire-drill
46
+ ```
47
+ if you choose to process documents from S3, you should upload those documents using your choice of tool -- but `awscli` is a good choice. *Stevedore-Uploader does NOT upload documents to S3 on your behalf.
48
+
49
+ If you need to process documents in a specialized, customized way, follow this example:
50
+
51
+ uploader = Stevedore::ESUploader.new(ES_HOST, ES_INDEX, S3_BUCKET, S3_PATH_PREFIX) # S3_BUCKET, S3_PATH_PREFIX are optional
52
+ uploader.do! FOLDER do |doc, filename, content, metadata|
53
+ next if doc.nil?
54
+ doc["analyzed"]["metadata"]["date"] = Date.parse(File.basename(filename).split("_")[-2])
55
+ doc["analyzed"]["metadata"]["title"] = my_title_getter_function(File.basename(filename))
56
+ end
57
+
58
+
59
+ Questions?
60
+ ==========
61
+
62
+ Hit us up in the [Stevedore](https://github.com/newsdev/stevedore) issues. Whichever suits your fancy.
@@ -0,0 +1,114 @@
1
+ #!/usr/bin/env jruby
2
+ # -*- coding: utf-8 -*-
3
+
4
+ raise Exception, "You've gotta use JRuby" unless RUBY_PLATFORM == 'java'
5
+ raise Exception, "You've gotta use Java 1.8; you're on #{java.lang.System.getProperties["java.runtime.version"]}" unless java.lang.System.getProperties["java.runtime.version"] =~ /1\.8/
6
+
7
+ require "#{File.expand_path(File.dirname(__FILE__))}/../lib/stevedore-uploader.rb"
8
+
9
+ if __FILE__ == $0
10
+ require 'optparse'
11
+ require 'ostruct'
12
+ options = OpenStruct.new
13
+ options.ocr = true
14
+
15
+ op = OptionParser.new("Usage: upload_to_elasticsearch [options] target_(dir_or_csv)") do |opts|
16
+ opts.on("-hSERVER:PORT", "--host=SERVER:PORT",
17
+ "The location of the ElasticSearch server") do |host|
18
+ options.host = host
19
+ end
20
+
21
+ opts.on("-iNAME", "--index=NAME",
22
+ "A name to use for the ES index (defaults to using the directory name)") do |index|
23
+ options.index = index
24
+ end
25
+
26
+ opts.on("-sPATH", "--s3path=PATH",
27
+ "The path under your bucket where these files have been uploaded. (defaults to ES index)"
28
+ ) do |s3path|
29
+ options.s3path = s3path
30
+ end
31
+ opts.on("-bPATH", "--s3bucket=PATH",
32
+ "The s3 bucket where these files have already been be uploaded (or will be later)."
33
+ ) do |s3bucket|
34
+ options.s3bucket = s3bucket
35
+ end
36
+
37
+ opts.on("--title_column=COLNAME",
38
+ "If target file is a CSV, which column contains the title of the row. Integer index or string column name."
39
+ ) do |title_column|
40
+ options.title_column = title_column
41
+ end
42
+ opts.on("--text_column=COLNAME",
43
+ "If target file is a CSV, which column contains the main, searchable of the row. Integer index or string column name."
44
+ ) do |text_column|
45
+ options.text_column = text_column
46
+ end
47
+
48
+ opts.on("-o", "--[no-]ocr", "Run verbosely") do |v|
49
+ options.ocr = v
50
+ end
51
+
52
+ opts.on( '-?', '--help', 'Display this screen' ) do
53
+ puts opts
54
+ exit
55
+ end
56
+ end
57
+
58
+ op.parse!
59
+
60
+ # to delete an index: curl -X DELETE localhost:9200/indexname/
61
+ unless ARGV.length == 1
62
+ puts op
63
+ exit
64
+ end
65
+ end
66
+
67
+ # you can provide either a path to files locally or
68
+ # an S3 endpoint as s3://int-data-dumps/YOURINDEXNAME
69
+ FOLDER = ARGV.shift
70
+
71
+
72
+ ES_INDEX = if options.index.nil? || options.index == ''
73
+ if(FOLDER.downcase.include?('s3://'))
74
+ s3_path_without_bucket = FOLDER.gsub(/s3:\/\//i, '').split("/", 2).last
75
+ s3_path_without_bucket.gsub(/^.+\//, '').gsub(/[^A-Za-z0-9\-_]/, '')
76
+ else
77
+ FOLDER.gsub(/^.+\//, '').gsub(/[^A-Za-z0-9\-_]/, '')
78
+ end
79
+ else
80
+ options.index
81
+ end
82
+
83
+ S3_BUCKET = FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : options.s3bucket
84
+ raise ArgumentError, 's3 buckets other than int-data-dumps aren\'t supported by the frontend yet' if S3_BUCKET != 'int-data-dumps'
85
+ ES_HOST = options.host || "localhost:9200"
86
+ S3_PATH = options.s3path || options.index
87
+ S3_BASEPATH = "https://#{S3_BUCKET}.s3.amazonaws.com/#{S3_PATH}"
88
+
89
+ raise ArgumentError, "specify a destination" unless FOLDER
90
+ raise ArgumentError, "specify the elasticsearch host" unless ES_HOST
91
+
92
+ ###############################
93
+ # actual stuff
94
+ ###############################
95
+
96
+ if __FILE__ == $0
97
+ f = Stevedore::ESUploader.new(ES_HOST, ES_INDEX, S3_BUCKET, S3_BASEPATH)
98
+ f.should_ocr = options.ocr
99
+ puts "Will not OCR, per --no-ocr option" unless f.should_ocr
100
+
101
+ if FOLDER.match(/\.[ct]sv$/)
102
+ f.do_csv!(FOLDER, File.join(f.s3_basepath, File.basename(FOLDER)), options.title_column, options.text_column)
103
+ else
104
+ f.do!(FOLDER)
105
+ end
106
+ puts "Finished uploading documents at #{Time.now}"
107
+
108
+ puts "Created Stevedore for #{ES_INDEX}; go check out https://stevedore.newsdev.net/search/#{ES_INDEX} or http://stevedore.adm.prd.newsdev.nytimes.com/search/#{ES_INDEX}"
109
+ if f.errors.size > 0
110
+ STDERR.puts "#{f.errors.size} failed documents:"
111
+ STDERR.puts f.errors.inspect
112
+ puts "Uploading successful, but with #{f.errors.size} errors."
113
+ end
114
+ end
@@ -0,0 +1,52 @@
1
+ require 'json'
2
+ require 'digest/sha1'
3
+
4
+ module Stevedore
5
+ class StevedoreBlob
6
+ attr_accessor :title, :text, :download_url, :extra
7
+ def initialize(title, text, download_url=nil, extra={})
8
+ self.title = title || download_url
9
+ self.text = text
10
+ self.download_url = download_url
11
+ self.extra = extra
12
+ raise ArgumentError, "StevedoreBlob extra support not yet implemented" if extra.keys.size > 0
13
+ end
14
+
15
+ def clean_text
16
+ @clean_text ||= text.gsub(/<\/?[^>]+>/, '') # removes all tags
17
+ end
18
+
19
+ def self.new_from_tika(content, metadata, download_url, filename)
20
+ self.new(metadata["title"], content, download_url)
21
+ end
22
+
23
+ def analyze!
24
+ # probably does nothing on blobs.
25
+ # this should do the HTML boilerplate extraction thingy on HTML.
26
+ end
27
+
28
+ def to_hash
29
+ {
30
+ "sha1" => Digest::SHA1.hexdigest(download_url),
31
+ "title" => title.to_s,
32
+ "source_url" => download_url.to_s,
33
+ "file" => {
34
+ "title" => title.to_s,
35
+ "file" => clean_text.to_s
36
+ },
37
+ "analyzed" => {
38
+ "body" => clean_text.to_s,
39
+ "metadata" => {
40
+ "Content-Type" => extra["Content-Type"] || "text/plain"
41
+ }
42
+ },
43
+ "_updatedAt" => Time.now
44
+ }
45
+ end
46
+
47
+ # N.B. the elasticsearch gem converts your hashes to JSON for you. You don't have to use this at all.
48
+ # def to_json
49
+ # JSON.dump to_hash
50
+ # end
51
+ end
52
+ end
@@ -0,0 +1,37 @@
1
+ require 'digest/sha1'
2
+
3
+ module Stevedore
4
+ class StevedoreCsvRow
5
+ attr_accessor :title, :text, :download_url, :whole_row, :row_num
6
+ def initialize(title, text, row_num, download_url, whole_row={})
7
+ self.title = title || download_url
8
+ self.text = text
9
+ self.download_url = download_url
10
+ self.whole_row = whole_row
11
+ self.row_num = row_num
12
+ end
13
+
14
+ def clean_text
15
+ @clean_text ||= text.gsub(/<\/?[^>]+>/, '') # removes all tags
16
+ end
17
+
18
+ def to_hash
19
+ {
20
+ "sha1" => Digest::SHA1.hexdigest(download_url + row_num.to_s),
21
+ "title" => title.to_s,
22
+ "source_url" => download_url.to_s,
23
+ "file" => {
24
+ "title" => title.to_s,
25
+ "file" => clean_text.to_s
26
+ },
27
+ "analyzed" => {
28
+ "body" => clean_text.to_s,
29
+ "metadata" => {
30
+ "Content-Type" => "text/plain"
31
+ }.merge( whole_row.to_h )
32
+ },
33
+ "_updatedAt" => DateTime.now
34
+ }
35
+ end
36
+ end
37
+ end
@@ -0,0 +1,84 @@
1
+ require_relative './stevedore_blob'
2
+ require 'CGI'
3
+ require 'digest/sha1'
4
+ require 'manticore'
5
+ module Stevedore
6
+ class StevedoreEmail < StevedoreBlob
7
+
8
+
9
+ # TODO write wrt other fields. where do those go???
10
+ attr_accessor :creation_date, :message_to, :message_from, :message_cc, :subject, :attachments, :content_type
11
+
12
+ def self.new_from_tika(content, metadata, download_url, filepath)
13
+ t = super
14
+ t.creation_date = metadata["Creation-Date"]
15
+ t.message_to = metadata["Message-To"]
16
+ t.message_from = metadata["Message-From"]
17
+ t.message_cc = metadata["Message-Cc"]
18
+ t.subject = metadata["subject"]
19
+ t.attachments = metadata["X-Attachments"].to_s.split("|").map do |raw_attachment_filename|
20
+ attachment_filename = CGI::unescape(raw_attachment_filename)
21
+ possible_filename = File.join(File.dirname(filepath), attachment_filename)
22
+ eml_filename = File.join(File.dirname(filepath), File.basename(filepath, '.eml') + '-' + attachment_filename)
23
+ s3_path = S3_BASEPATH + File.dirname(filepath).gsub(::FOLDER, '')
24
+ possible_s3_url = S3_BASEPATH + '/' + CGI::escape(File.basename(possible_filename))
25
+ possible_eml_s3_url = S3_BASEPATH + '/' + CGI::escape(File.basename(eml_filename))
26
+
27
+ # we might be uploading from the disk in which case we see if we can find an attachment on disk with the name from X-Attachments
28
+ # or we might be uploading via S3, in which case we see if an object exists, accessible on S3, with the path from X-Attachments
29
+ # TODO: support private S3 buckets
30
+ s3_url = if File.exists? possible_filename
31
+ possible_s3_url
32
+ elsif File.exists? eml_filename
33
+ possible_eml_s3_url
34
+ else
35
+ nil
36
+ end
37
+ s3_url = begin
38
+ if Manticore::Client.new.head(possible_s3_url).code == 200
39
+ puts "found attachment: #{possible_s3_url}"
40
+ possible_s3_url
41
+ elsif Manticore::Client.new.head(possible_eml_s3_url).code == 200
42
+ puts "found attachment: #{possible_eml_s3_url}"
43
+ possible_eml_s3_url
44
+ end
45
+ rescue
46
+ nil
47
+ end if s3_url.nil?
48
+ if s3_url.nil?
49
+ STDERR.puts "Tika X-Attachments: " + metadata["X-Attachments"].to_s.inspect
50
+ STDERR.puts "Couldn't find attachment '#{possible_s3_url}' aka '#{possible_eml_s3_url}' from '#{raw_attachment_filename}' from #{download_url}"
51
+ end
52
+ s3_url
53
+ end.compact
54
+ t
55
+ end
56
+
57
+
58
+ def to_hash
59
+ {
60
+ "sha1" => Digest::SHA1.hexdigest(download_url),
61
+ "title" => title.to_s,
62
+ "source_url" => download_url.to_s,
63
+ "file" => {
64
+ "title" => title.to_s,
65
+ "file" => text.to_s
66
+ },
67
+ "analyzed" => {
68
+ "body" => text.to_s,
69
+ "metadata" => {
70
+ "Content-Type" => content_type || "message/rfc822",
71
+ "Creation-Date" => creation_date,
72
+ "Message-To" => message_from.is_a?(Enumerable) ? message_from : [ message_from ],
73
+ "Message-From" => message_to.is_a?(Enumerable) ? message_to : [ message_to ],
74
+ "Message-Cc" => message_cc.is_a?(Enumerable) ? message_cc : [ message_cc ],
75
+ "subject" => subject,
76
+ "attachments" => attachments
77
+ }
78
+ },
79
+ "_updatedAt" => Time.now
80
+ }
81
+ end
82
+
83
+ end
84
+ end
@@ -0,0 +1,8 @@
1
+ require_relative './stevedore_blob'
2
+ require 'nokogiri'
3
+
4
+ module Stevedore
5
+ class StevedoreHTML < StevedoreBlob
6
+
7
+ end
8
+ end
@@ -0,0 +1,77 @@
1
+ # splits zip, mbox and pst files into their constituent documents -- mesages and attachments
2
+ # and puts them into a tmp folder
3
+ # which is then parsed normally
4
+ require 'mapi/msg'
5
+ require 'tmpdir'
6
+ require 'mapi/pst'
7
+ require 'zip'
8
+
9
+ # splits PST and Mbox formats
10
+
11
+ class ArchiveSplitter
12
+ def self.split(archive_filename)
13
+ # if it's a PST use split_pst
14
+ # if it's an mbox, use split_pst
15
+ # return a list of files
16
+ Dir.mktmpdir do |dir|
17
+ #TODO should probably do magic byte searching et.c
18
+ extension = dir.archive_filename.split(".")[-1]
19
+
20
+ constituent_files = if extension == "mbox"
21
+ self.split_mbox(archive_filename)
22
+ elsif extension == "pst"
23
+ self.split_pst(archive_filename)
24
+ elsif extension == "zip"
25
+ self.split_zip(archive_filename)
26
+ end
27
+ # should yield a relative filename
28
+ # and a lambda that will write the file contents to the given filename
29
+
30
+ constituent_files.each do |filename, contents_lambda|
31
+ contents_lambda.call(File.join(dir, File.basename(archive_filename), filename ))
32
+ end
33
+ end
34
+ end
35
+ end
36
+
37
+ class MailArchiveSplitter
38
+
39
+ def self.split_pst(archive_filename)
40
+ pst = Mapi::Pst.new open(archive_filename)
41
+ pst.each_with_index do |mail, idx|
42
+ msg = Mapi::Msg.load mail
43
+ yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << mail } }
44
+ msg.attachments.each do |attachment|
45
+ yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save fh }}
46
+ end
47
+ end
48
+
49
+ end
50
+
51
+ def self.split_mbox(archive_filename)
52
+ # stolen shamelessly from the Ruby Enumerable docs, actually
53
+ # split mails in mbox (slice before Unix From line after an empty line)
54
+ open(archive_filename) do |fh|
55
+ f.slice_before(empty: true) do |line, h|
56
+ previous_was_empty = h[:empty]
57
+ h[:empty] = line == "\n"
58
+ previous_was_empty && line.start_with?("From ")
59
+ end.each_with_index do |mail, idx|
60
+ mail.pop if mail.last == "\n" # remove last line if prexent
61
+ yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| f << mail } }
62
+ msg.attachments.each do |attachment|
63
+ yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save f }}
64
+ end
65
+ end
66
+ end
67
+ end
68
+
69
+ def self.split_zip(archive_filename)
70
+ Zip::File.open(archive_filename) do |zip_file|
71
+ zip_file.each do |entry|
72
+ yield entry.names, lambda{|fn| entryhextract(fn) }
73
+ end
74
+ end
75
+ end
76
+
77
+ end
@@ -0,0 +1,340 @@
1
+ Dir["#{File.expand_path(File.dirname(__FILE__))}/../lib/*.rb"].each {|f| require f}
2
+ Dir["#{File.expand_path(File.dirname(__FILE__))}/../lib/parsers/*.rb"].each {|f| require f}
3
+
4
+ require 'rika'
5
+
6
+ require 'net/https'
7
+ require 'elasticsearch'
8
+ require 'elasticsearch/transport/transport/http/manticore'
9
+ require 'net/https'
10
+
11
+ require 'manticore'
12
+ require 'fileutils'
13
+ require 'csv'
14
+
15
+
16
+ require 'aws-sdk'
17
+
18
+
19
+ module Stevedore
20
+ class ESUploader
21
+ #creates blobs
22
+ attr_reader :errors
23
+ attr_accessor :should_ocr, :slice_size
24
+
25
+ def initialize(es_host, es_index, s3_bucket=nil, s3_path=nil)
26
+ @errors = []
27
+ @client = Elasticsearch::Client.new({
28
+ log: false,
29
+ url: es_host,
30
+ transport_class: Elasticsearch::Transport::Transport::HTTP::Manticore,
31
+ request_timeout: 5*60,
32
+ socket_timeout: 60
33
+ },
34
+ )
35
+ @es_index = es_index
36
+ @s3_bucket = s3_bucket || FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : 'int-data-dumps'
37
+ @s3_basepath = "https://#{s3_bucket}.s3.amazonaws.com/#{s3_path || es_index}"
38
+
39
+
40
+ @slice_size = 100
41
+
42
+ @should_ocr = false
43
+
44
+ self.create_index!
45
+ self.create_mappings!
46
+ end
47
+
48
+ def create_index!
49
+ begin
50
+ @client.indices.create(
51
+ index: @es_index,
52
+ body: {
53
+ settings: {
54
+ analysis: {
55
+ analyzer: {
56
+ email_analyzer: {
57
+ type: "custom",
58
+ tokenizer: "email_tokenizer",
59
+ filter: ["lowercase"]
60
+ },
61
+ snowball_analyzer: {
62
+ type: "snowball",
63
+ language: "English"
64
+ }
65
+
66
+ },
67
+ tokenizer: {
68
+ email_tokenizer: {
69
+ type: "pattern",
70
+ pattern: "([a-zA-Z0-9_\\.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-\\.]+)",
71
+ group: "0"
72
+ }
73
+ }
74
+ }
75
+ },
76
+ })
77
+ # don't complain if the index already exists.
78
+ rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e
79
+ raise e unless e.message && (e.message.include?("IndexAlreadyExistsException") || e.message.include?("already exists as alias"))
80
+ end
81
+ end
82
+
83
+ def create_mappings!
84
+ @client.indices.put_mapping({
85
+ index: @es_index,
86
+ type: :doc,
87
+ body: {
88
+ "_id" => {
89
+ path: "sha1"
90
+ },
91
+ properties: { # feel free to add more, this is the BARE MINIMUM the UI depends on
92
+ sha1: {type: :string, index: :not_analyzed},
93
+ title: { type: :string, analyzer: :keyword },
94
+ source_url: {type: :string, index: :not_analyzed},
95
+ modifiedDate: { type: :date, format: "dateOptionalTime" },
96
+ _updated_at: { type: :date },
97
+ analyzed: {
98
+ properties: {
99
+ body: {
100
+ type: :string,
101
+ index_options: :offsets,
102
+ term_vector: :with_positions_offsets,
103
+ store: true,
104
+ fields: {
105
+ snowball: {
106
+ type: :string,
107
+ index: "analyzed",
108
+ analyzer: 'snowball_analyzer' ,
109
+ index_options: :offsets,
110
+ term_vector: :with_positions_offsets,
111
+ }
112
+ }
113
+ },
114
+ metadata: {
115
+ properties: {
116
+ # "attachments" => {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
117
+ "Message-From" => {
118
+ type: "string",
119
+ fields: {
120
+ email: {
121
+ type: "string",
122
+ analyzer: "email_analyzer"
123
+ },
124
+ "Message-From" => {
125
+ type: "string"
126
+ }
127
+ }
128
+ },
129
+ "Message-To" => {
130
+ type: "string",
131
+ fields: {
132
+ email: {
133
+ type: "string",
134
+ analyzer: "email_analyzer"
135
+ },
136
+ "Message-To" => {
137
+ type: "string"
138
+ }
139
+ }
140
+ }
141
+ }
142
+ }
143
+ }
144
+ }
145
+ }
146
+ }
147
+ }) # was "rescue nil" but that obscured meaningful errors
148
+ end
149
+
150
+ def bulk_upload_to_es!(data)
151
+ return nil if data.empty?
152
+ begin
153
+ resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
154
+ puts resp if resp[:errors]
155
+ rescue JSON::GeneratorError
156
+ data.each do |datum|
157
+ begin
158
+ @client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
159
+ rescue JSON::GeneratorError
160
+ next
161
+ end
162
+ end
163
+ resp = nil
164
+ end
165
+ resp
166
+ end
167
+
168
+ def process_document(filename, filename_for_s3)
169
+
170
+
171
+ begin
172
+ puts "begin to process #{filename}"
173
+ # puts "size: #{File.size(filename)}"
174
+ begin
175
+ content, metadata = Rika.parse_content_and_metadata(filename)
176
+ rescue StandardError
177
+ content = "couldn't be parsed"
178
+ metadata = "couldn't be parsed"
179
+ end
180
+ puts "parsed: #{content.size}"
181
+ if content.size > 10 * (10 ** 6)
182
+ @errors << filename
183
+ puts "skipping #{filename} for being too big"
184
+ return nil
185
+ end
186
+
187
+ # TODO: factor these out in favor of the yield/block situation down below.
188
+ # this should (eventually) be totally generic, but perhaps handle common
189
+ # document types on its own
190
+ ret = case # .eml # .msg
191
+ when metadata["Content-Type"] == "message/rfc822" || metadata["Content-Type"] == "application/vnd.ms-outlook"
192
+ ::Stevedore::StevedoreEmail.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
193
+ when metadata["Content-Type"] && ["application/html", "application/xhtml+xml"].include?(metadata["Content-Type"].split(";").first)
194
+ ::Stevedore::StevedoreHTML.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
195
+ when @should_ocr && metadata["Content-Type"] == "application/pdf" && (content.match(/\A\s*\z/) || content.size < 50 * metadata["xmpTPg:NPages"].to_i )
196
+ # this is a scanned PDF.
197
+ puts "scanned PDF #{File.basename(filename)} detected; OCRing"
198
+ pdf_basename = filename.gsub(".pdf", '')
199
+ system("convert","-monochrome","-density","300x300",filename,"-depth",'8',"#{pdf_basename}.png")
200
+ (Dir["#{pdf_basename}-*.png"] + Dir["#{pdf_basename}.png"]).sort_by{|png| (matchdata = png.match(/-\d+\.png/)).nil? ? 0 : matchdata[0].to_i }.each do |png|
201
+ system('tesseract', png, png, "pdf")
202
+ File.delete(png)
203
+ # no need to use a system call when we could use the stdlib!
204
+ # system("rm", "-f", png) rescue nil
205
+ File.delete("#{png}.txt")
206
+ end.join("\n\n")
207
+ # e.g. Analysis-Corporation-2.png.pdf or Torture.pdf
208
+ files = Dir["#{pdf_basename}.png.pdf"] + (Dir["#{pdf_basename}-*.png.pdf"].sort_by{|pdf| Regexp.new("#{pdf_basename}-([0-9]+).png.pdf").match(pdf)[1].to_i })
209
+ system('pdftk', *files, "cat", "output", "#{pdf_basename}.ocr.pdf")
210
+ content, _ = Rika.parse_content_and_metadata("#{pdf_basename}.ocr.pdf")
211
+ puts "OCRed content (#{File.basename(filename)}) length: #{content.length}"
212
+ ::Stevedore::StevedoreBlob.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
213
+ else
214
+ ::Stevedore::StevedoreBlob.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
215
+ end
216
+ [ret, content, metadata]
217
+ rescue StandardError, java.lang.NoClassDefFoundError, org.apache.tika.exception.TikaException => e
218
+ STDERR.puts e.inspect
219
+ STDERR.puts "#{e} #{e.message}: #{filename}"
220
+ STDERR.puts e.backtrace.join("\n") + "\n\n\n"
221
+ # puts "\n"
222
+ @errors << filename
223
+ nil
224
+ end
225
+ end
226
+
227
+ def do_csv!(file, download_url, title_column=0, text_column=nil)
228
+ docs_so_far = 0
229
+ CSV.open(file, headers: (!title_column.is_a? Fixnum ) ).each_slice(@slice_size).each_with_index do |slice, slice_index|
230
+ slice_of_rows = slice.map.each_with_index do |row, i|
231
+ doc = ::Stevedore::StevedoreCsvRow.new(
232
+ row[title_column],
233
+ (row.respond_to?(:to_hash) ? (text_column.nil? ? row.to_hash.each_pair.map{|k, v| "#{k}: #{v}"}.join(" \n\n ") : row[text_column]) : row.to_a.join(" \n\n ")) + " \n\n csv_source: #{File.basename(file)}",
234
+ (@slice_size * slice_index )+ i,
235
+ download_url,
236
+ row).to_hash
237
+ doc["analyzed"] ||= {}
238
+ doc["analyzed"]["metadata"] ||= {}
239
+ yield doc if block_given? && doc
240
+ doc
241
+ end
242
+ begin
243
+ resp = bulk_upload_to_es!(slice_of_rows.compact)
244
+ docs_so_far += @slice_size
245
+ rescue Manticore::Timeout, Manticore::SocketException
246
+ STDERR.puts("retrying at #{Time.now}")
247
+ retry
248
+ end
249
+ puts "uploaded #{slice_of_rows.size} rows to #{@es_index}; #{docs_so_far} uploaded so far"
250
+ puts "Errors in bulk upload: #{resp.inspect}" if resp && resp["errors"]
251
+ end
252
+ end
253
+
254
+ def do!(target_folder_path, output_stream=STDOUT)
255
+ output_stream.puts "Processing documents from #{target_folder_path}"
256
+
257
+ docs_so_far = 0
258
+
259
+ if target_folder_path.downcase.include?("s3://")
260
+ Dir.mktmpdir do |dir|
261
+ Aws.config.update({
262
+ region: 'us-east-1', # TODO should be configurable
263
+ })
264
+ s3 = Aws::S3::Resource.new
265
+
266
+ bucket = s3.bucket(@s3_bucket)
267
+ s3_path_without_bucket = target_folder_path.gsub(/s3:\/\//i, '').split("/", 2).last
268
+ bucket.objects(:prefix => s3_path_without_bucket).each_slice(@slice_size) do |slice_of_objs|
269
+ docs_so_far += slice_of_objs.size
270
+
271
+ output_stream.puts "starting a set of #{@slice_size} -- so far #{docs_so_far}"
272
+ slice_of_objs.map! do |obj|
273
+ next if obj.key[-1] == "/"
274
+ FileUtils.mkdir_p(File.join(dir, File.dirname(obj.key)))
275
+ tmp_filename = File.join(dir, obj.key)
276
+ begin
277
+ body = obj.get.body.read
278
+ File.open(tmp_filename, 'wb'){|f| f << body}
279
+ rescue Aws::S3::Errors::NoSuchKey
280
+ @errors << obj.key
281
+ rescue ArgumentError
282
+ File.open(tmp_filename, 'wb'){|f| f << body.nil? ? '' : body.chars.select(&:valid_encoding?).join}
283
+ end
284
+ download_filename = "https://#{@s3_bucket}.s3.amazonaws.com/" + obj.key
285
+ doc, content, metadata = process_document(tmp_filename, download_filename)
286
+ begin
287
+ FileUtils.rm(tmp_filename)
288
+ rescue Errno::ENOENT
289
+ # try to delete, but no biggie if it doesn't work for some weird reason.
290
+ end
291
+ yield doc, obj.key, content, metadata if block_given?
292
+ doc
293
+ end
294
+ begin
295
+ resp = bulk_upload_to_es!(slice_of_objs.compact)
296
+ rescue Manticore::Timeout, Manticore::SocketException
297
+ output_stream.puts("retrying at #{Time.now}")
298
+ retry
299
+ end
300
+ output_stream.puts "uploaded #{slice_of_objs.size} files to #{@es_index}; #{docs_so_far} uploaded so far"
301
+ output_stream.puts "Errors in bulk upload: #{resp.inspect}" if resp && resp["errors"]
302
+ end
303
+ end
304
+ else
305
+ Dir[target_folder_path + (target_folder_path.include?('*') ? '' : '/**/*')].each_slice(@slice_size) do |slice_of_files|
306
+ output_stream.puts "starting a set of #{@slice_size}"
307
+ docs_so_far += slice_of_files.size
308
+
309
+ slice_of_files.map! do |filename|
310
+ next unless File.file?(filename)
311
+
312
+ filename_basepath = filename.gsub(target_folder_path, '')
313
+ if use_s3
314
+ download_filename = @s3_basepath + filename_basepath
315
+ else
316
+ download_filename = "/files/#{@es_index}/#{filename_basepath}"
317
+ end
318
+
319
+ doc, content, metadata = process_document(filename, download_filename )
320
+ yield doc, filename, content, metadata if block_given?
321
+ doc
322
+ end
323
+ begin
324
+ puts "uploading"
325
+ resp = bulk_upload_to_es!(slice_of_files.compact)
326
+ puts resp.inspect if JSON.parse(resp)["errors"]
327
+ rescue Manticore::Timeout, Manticore::SocketException => e
328
+ output_stream.puts e.inspect
329
+ output_stream.puts "Upload error: #{e} #{e.message}."
330
+ output_stream.puts e.backtrace.join("\n") + "\n\n\n"
331
+ output_stream.puts("retrying at #{Time.now}")
332
+ retry
333
+ end
334
+ output_stream.puts "uploaded #{slice_of_files.size} files to #{@es_index}; #{docs_so_far} uploaded so far"
335
+ output_stream.puts "Errors in bulk upload: #{resp.inspect}" if resp && resp["errors"]
336
+ end
337
+ end
338
+ end
339
+ end
340
+ end
metadata ADDED
@@ -0,0 +1,135 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: stevedore-uploader
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: java
6
+ authors:
7
+ - Jeremy B. Merrill
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2016-04-19 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - "~>"
17
+ - !ruby/object:Gem::Version
18
+ version: '1.0'
19
+ name: elasticsearch
20
+ prerelease: false
21
+ type: :runtime
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.0'
27
+ - !ruby/object:Gem::Dependency
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - ">="
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
33
+ name: manticore
34
+ prerelease: false
35
+ type: :runtime
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - "~>"
45
+ - !ruby/object:Gem::Version
46
+ version: '0.9'
47
+ name: jruby-openssl
48
+ prerelease: false
49
+ type: :runtime
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '0.9'
55
+ - !ruby/object:Gem::Dependency
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - "~>"
59
+ - !ruby/object:Gem::Version
60
+ version: '2'
61
+ name: aws-sdk
62
+ prerelease: false
63
+ type: :runtime
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '2'
69
+ - !ruby/object:Gem::Dependency
70
+ requirement: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - ">="
73
+ - !ruby/object:Gem::Version
74
+ version: 1.6.1
75
+ name: rika-stevedore
76
+ prerelease: false
77
+ type: :runtime
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: 1.6.1
83
+ - !ruby/object:Gem::Dependency
84
+ requirement: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - "~>"
87
+ - !ruby/object:Gem::Version
88
+ version: '1.6'
89
+ name: nokogiri
90
+ prerelease: false
91
+ type: :runtime
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: '1.6'
97
+ description: TK
98
+ email: jeremy.merrill@nytimes.com
99
+ executables: []
100
+ extensions: []
101
+ extra_rdoc_files: []
102
+ files:
103
+ - README.md
104
+ - bin/upload_to_elasticsearch.rb
105
+ - lib/parsers/stevedore_blob.rb
106
+ - lib/parsers/stevedore_csv_row.rb
107
+ - lib/parsers/stevedore_email.rb
108
+ - lib/parsers/stevedore_html.rb
109
+ - lib/split_archive.rb
110
+ - lib/stevedore-uploader.rb
111
+ homepage: https://github.com/newsdev/stevedore-uploader
112
+ licenses:
113
+ - MIT
114
+ metadata: {}
115
+ post_install_message:
116
+ rdoc_options: []
117
+ require_paths:
118
+ - lib
119
+ required_ruby_version: !ruby/object:Gem::Requirement
120
+ requirements:
121
+ - - ">="
122
+ - !ruby/object:Gem::Version
123
+ version: '0'
124
+ required_rubygems_version: !ruby/object:Gem::Requirement
125
+ requirements:
126
+ - - ">="
127
+ - !ruby/object:Gem::Version
128
+ version: '0'
129
+ requirements: []
130
+ rubyforge_project:
131
+ rubygems_version: 2.4.8
132
+ signing_key:
133
+ specification_version: 4
134
+ summary: Upload documents to a Stevedore search engine.
135
+ test_files: []