RubyGems - stevedore-uploader - Versions diffs - 1.0.0-java - Mend

stevedore-uploader 1.0.0-java

Files changed (10) hide show

checksums.yaml +7 -0
data/README.md +62 -0
data/bin/upload_to_elasticsearch.rb +114 -0
data/lib/parsers/stevedore_blob.rb +52 -0
data/lib/parsers/stevedore_csv_row.rb +37 -0
data/lib/parsers/stevedore_email.rb +84 -0
data/lib/parsers/stevedore_html.rb +8 -0
data/lib/split_archive.rb +77 -0
data/lib/stevedore-uploader.rb +340 -0
metadata +135 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: ea9762ea02f37f5e56993854cb08ae63f21f7e1e
+  data.tar.gz: f525070b72ea4f56e44096de00ec4789a059b042
+SHA512:
+  metadata.gz: 1e6cd4cbae534a1bd67091daea696c0cb33daf5dc94c740be2b6f6ce0fbf23195c02d05adfbe00d618e6b36f3bd890e8223fa7c8b7f02a3644aaa4e988b6d1b4
+  data.tar.gz: d4b4971924a013b3bf7a64b375a6e2fcf5246fa29a975099415c1128a0d0ebbb9d0526776d98f8f81e79ffbe6dea21e1f6eeacf713485785c9f8f4b8f62f749b

data/README.md ADDED Viewed

@@ -0,0 +1,62 @@
+stevedore-uploader
+==================
+A tool for uploading documents into [Stevedore](https://github.com/newsdev/stevedore), a flexible, extensible search engine for document dumps created by The New York Times.
+Stevedore is essentially an ElasticSearch endpoint with a customizable frontend attached to it. Stevedore's primary document store is ElasticSearch, so `stevedore-uploader`'s primary task is merely uploading documents to ElasticSearch, with a few attributes that Stevedore depends on. Getting a new document set ready for search requires a few steps, but this tool helps with the hardest one: Converting the documents you want to search into a format that ElasticSearch understands. Customizing the search interface is often not necessary, but if it is, information on how to do that is in the [Stevedore](https://github.com/newsdev/stevedore) repository.
+Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/). A `case` statement in `bin/upload_to_elasticsearch.rb` distinguishes between a few default types, like emails and text blobs, (and PRs would be appreciated adding new ones); for specialized types, the `do` function takes a block allowing you to modify the documents with just a few lines of Ruby.
+For more details on the entire workflow, see [Stevedore](https://github.com/newsdev/stevedore)
+Installation
+------------
+This project is in JRuby, so we can leverage the transformative enterprise stability features of the JVM Java TM Platform and the truly-American Bruce-Springsteen-Born-to-Run freedom-to-roam of Ruby. (And the fact that Tika's in Java.)
+1. install jruby. if you use rbenv, you'd do this:
+`rbenv install jruby-9.0.5.0` (or greater versions okay too)
+2. be sure you're running Java 8. (java 7 is deprecated, c'mon c'mon)
+3. `bundle install`
+Usage
+-----
+**This is a piece of a larger upload workflow, [described here](https://github.com/newsdev/stevedore/blob/master/README.md). You should read that first, then come back here.**
+upload documents from your local disk
+```
+bundle exec ruby bin/upload_to_elasticsearch.rb --index=INDEXNAMEx [--host=localhost:9200]  [--s3path=name-of-path-under-bucket] path/to/documents/to/parse
+```
+or from s3
+```
+bundle exec ruby bin/upload_to_elasticsearch.rb --index=INDEXNAMEx [--host=localhost:9200]   s3://my-bucket/path/to/documents/to/parse
+```
+if host isn't specified, we assume `localhost:9200`.
+e.g.
+```
+bundle exec ruby bin/upload_to_elasticsearch.rb --index=jrubytest --host=https://stevedore.newsdev.net/es/ ~/code/marco-rubios-emails/emls/
+```
+you may also specify an S3 location of documents to parse, instead of a local directory, e.g.
+```
+bundle exec ruby bin/upload_to_elasticsearch.rb --index=jrubytest --host=https://stevedore.newsdev.net/es/ s3://int-data-dumps/marco-rubio-fire-drill
+```
+if you choose to process documents from S3, you should upload those documents using your choice of tool -- but `awscli` is a good choice. *Stevedore-Uploader does NOT upload documents to S3 on your behalf.
+If you need to process documents in a specialized, customized way, follow this example:
+uploader = Stevedore::ESUploader.new(ES_HOST, ES_INDEX, S3_BUCKET, S3_PATH_PREFIX) # S3_BUCKET, S3_PATH_PREFIX are optional
+uploader.do! FOLDER do |doc, filename, content, metadata|
+  next if doc.nil?
+  doc["analyzed"]["metadata"]["date"] = Date.parse(File.basename(filename).split("_")[-2])
+  doc["analyzed"]["metadata"]["title"] = my_title_getter_function(File.basename(filename))
+end
+Questions?
+==========
+Hit us up in the [Stevedore](https://github.com/newsdev/stevedore) issues. Whichever suits your fancy.

data/bin/upload_to_elasticsearch.rb ADDED Viewed

@@ -0,0 +1,114 @@
+#!/usr/bin/env jruby
+# -*- coding: utf-8 -*-
+raise Exception, "You've gotta use JRuby" unless RUBY_PLATFORM == 'java'
+raise Exception, "You've gotta use Java 1.8; you're on #{java.lang.System.getProperties["java.runtime.version"]}" unless java.lang.System.getProperties["java.runtime.version"] =~ /1\.8/
+require "#{File.expand_path(File.dirname(__FILE__))}/../lib/stevedore-uploader.rb"
+if __FILE__ == $0
+  require 'optparse'
+  require 'ostruct'
+  options = OpenStruct.new
+  options.ocr = true
+  op = OptionParser.new("Usage: upload_to_elasticsearch [options] target_(dir_or_csv)") do |opts|
+    opts.on("-hSERVER:PORT", "--host=SERVER:PORT",
+            "The location of the ElasticSearch server") do |host|
+      options.host = host
+    end
+    opts.on("-iNAME", "--index=NAME",
+            "A name to use for the ES index (defaults to using the directory name)") do |index|
+      options.index = index
+    end
+    opts.on("-sPATH", "--s3path=PATH",
+            "The path under your bucket where these files have been uploaded. (defaults to ES index)"
+      ) do |s3path|
+      options.s3path = s3path
+    end
+    opts.on("-bPATH", "--s3bucket=PATH",
+            "The s3 bucket where these files have already been be uploaded (or will be later)."
+      ) do |s3bucket|
+      options.s3bucket = s3bucket
+    end
+    opts.on("--title_column=COLNAME",
+            "If target file is a CSV, which column contains the title of the row. Integer index or string column name."
+      ) do |title_column|
+      options.title_column = title_column
+    end
+    opts.on("--text_column=COLNAME",
+            "If target file is a CSV, which column contains the main, searchable of the row. Integer index or string column name."
+      ) do |text_column|
+      options.text_column = text_column
+    end
+    opts.on("-o", "--[no-]ocr", "Run verbosely") do |v|
+      options.ocr = v
+    end
+    opts.on( '-?', '--help', 'Display this screen' ) do
+      puts opts
+      exit
+    end
+  end
+  op.parse!
+  # to delete an index: curl -X DELETE localhost:9200/indexname/
+  unless ARGV.length == 1
+    puts op
+    exit
+  end
+end
+# you can provide either a path to files locally or
+# an S3 endpoint as s3://int-data-dumps/YOURINDEXNAME
+FOLDER = ARGV.shift
+ES_INDEX =  if options.index.nil? || options.index == ''
+              if(FOLDER.downcase.include?('s3://'))
+                s3_path_without_bucket = FOLDER.gsub(/s3:\/\//i, '').split("/", 2).last
+                s3_path_without_bucket.gsub(/^.+\//, '').gsub(/[^A-Za-z0-9\-_]/, '')
+              else
+                FOLDER.gsub(/^.+\//, '').gsub(/[^A-Za-z0-9\-_]/, '')
+              end
+            else
+              options.index
+            end
+S3_BUCKET = FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : options.s3bucket
+raise ArgumentError, 's3 buckets other than int-data-dumps aren\'t supported by the frontend yet' if S3_BUCKET != 'int-data-dumps'
+ES_HOST = options.host || "localhost:9200"
+S3_PATH = options.s3path  || options.index
+S3_BASEPATH = "https://#{S3_BUCKET}.s3.amazonaws.com/#{S3_PATH}"
+raise ArgumentError, "specify a destination" unless FOLDER
+raise ArgumentError, "specify the elasticsearch host" unless ES_HOST
+###############################
+# actual stuff
+###############################
+if __FILE__ == $0
+  f = Stevedore::ESUploader.new(ES_HOST, ES_INDEX, S3_BUCKET, S3_BASEPATH)
+  f.should_ocr = options.ocr
+  puts "Will not OCR, per --no-ocr option" unless f.should_ocr
+  if FOLDER.match(/\.[ct]sv$/)
+    f.do_csv!(FOLDER, File.join(f.s3_basepath, File.basename(FOLDER)), options.title_column, options.text_column)
+  else
+    f.do!(FOLDER)
+  end
+  puts "Finished uploading documents at #{Time.now}"
+  puts "Created Stevedore for #{ES_INDEX}; go check out https://stevedore.newsdev.net/search/#{ES_INDEX} or http://stevedore.adm.prd.newsdev.nytimes.com/search/#{ES_INDEX}"
+  if f.errors.size > 0
+    STDERR.puts "#{f.errors.size} failed documents:"
+    STDERR.puts f.errors.inspect
+    puts "Uploading successful, but with #{f.errors.size} errors."
+  end
+end

data/lib/parsers/stevedore_blob.rb ADDED Viewed

@@ -0,0 +1,52 @@
+require 'json'
+require 'digest/sha1'
+module Stevedore
+  class StevedoreBlob
+    attr_accessor :title, :text, :download_url, :extra
+    def initialize(title, text, download_url=nil, extra={})
+      self.title = title || download_url
+      self.text = text
+      self.download_url = download_url
+      self.extra = extra
+      raise ArgumentError, "StevedoreBlob extra support not yet implemented" if extra.keys.size > 0
+    end
+    def clean_text
+      @clean_text ||= text.gsub(/<\/?[^>]+>/, '') # removes all tags
+    end
+    def self.new_from_tika(content, metadata, download_url, filename)
+      self.new(metadata["title"], content, download_url)
+    end
+    def analyze!
+      # probably does nothing on blobs.
+      # this should do the HTML boilerplate extraction thingy on HTML.
+    end
+    def to_hash
+      {
+        "sha1" => Digest::SHA1.hexdigest(download_url),
+        "title" => title.to_s,
+        "source_url" => download_url.to_s,
+        "file" => {
+          "title" => title.to_s,
+          "file" => clean_text.to_s
+        },
+        "analyzed" => {
+          "body" => clean_text.to_s,
+          "metadata" => {
+            "Content-Type" => extra["Content-Type"] || "text/plain"
+          }
+        },
+        "_updatedAt" => Time.now
+      }
+    end
+    # N.B. the elasticsearch gem converts your hashes to JSON for you. You don't have to use this at all.
+    # def to_json
+    #   JSON.dump to_hash
+    # end
+  end
+end

data/lib/parsers/stevedore_csv_row.rb ADDED Viewed

@@ -0,0 +1,37 @@
+require 'digest/sha1'
+module Stevedore
+  class StevedoreCsvRow
+    attr_accessor :title, :text, :download_url, :whole_row, :row_num
+    def initialize(title, text, row_num, download_url, whole_row={})
+      self.title = title || download_url
+      self.text = text
+      self.download_url = download_url
+      self.whole_row = whole_row
+      self.row_num = row_num
+    end
+    def clean_text
+      @clean_text ||= text.gsub(/<\/?[^>]+>/, '') # removes all tags
+    end
+    def to_hash
+      {
+        "sha1" => Digest::SHA1.hexdigest(download_url + row_num.to_s),
+        "title" => title.to_s,
+        "source_url" => download_url.to_s,
+        "file" => {
+          "title" => title.to_s,
+          "file" => clean_text.to_s
+        },
+        "analyzed" => {
+          "body" => clean_text.to_s,
+          "metadata" => {
+            "Content-Type" => "text/plain"
+          }.merge(  whole_row.to_h  )
+        },
+        "_updatedAt" => DateTime.now
+      }
+    end
+  end
+end

data/lib/parsers/stevedore_email.rb ADDED Viewed

@@ -0,0 +1,84 @@
+require_relative './stevedore_blob'
+require 'CGI'
+require 'digest/sha1'
+require 'manticore'
+module Stevedore
+  class StevedoreEmail < StevedoreBlob
+    # TODO write wrt other fields. where do those go???
+    attr_accessor :creation_date, :message_to, :message_from, :message_cc, :subject, :attachments, :content_type
+    def self.new_from_tika(content, metadata, download_url, filepath)
+      t = super
+      t.creation_date = metadata["Creation-Date"]
+      t.message_to = metadata["Message-To"]
+      t.message_from = metadata["Message-From"]
+      t.message_cc = metadata["Message-Cc"]
+      t.subject = metadata["subject"]
+      t.attachments = metadata["X-Attachments"].to_s.split("|").map do |raw_attachment_filename|
+        attachment_filename = CGI::unescape(raw_attachment_filename)
+        possible_filename = File.join(File.dirname(filepath), attachment_filename)
+        eml_filename = File.join(File.dirname(filepath), File.basename(filepath, '.eml') + '-' + attachment_filename)
+        s3_path = S3_BASEPATH + File.dirname(filepath).gsub(::FOLDER, '')
+        possible_s3_url = S3_BASEPATH + '/' + CGI::escape(File.basename(possible_filename))
+        possible_eml_s3_url = S3_BASEPATH + '/' + CGI::escape(File.basename(eml_filename))
+        # we might be uploading from the disk in which case we see if we can find an attachment on disk with the name from X-Attachments
+        # or we might be uploading via S3, in which case we see if an object exists, accessible on S3, with the path from X-Attachments
+        # TODO: support private S3 buckets
+        s3_url = if File.exists? possible_filename
+                    possible_s3_url
+                 elsif File.exists? eml_filename
+                    possible_eml_s3_url
+                 else
+                    nil
+                 end
+        s3_url = begin
+                  if Manticore::Client.new.head(possible_s3_url).code == 200
+                    puts "found attachment: #{possible_s3_url}"
+                    possible_s3_url
+                  elsif Manticore::Client.new.head(possible_eml_s3_url).code == 200
+                    puts "found attachment: #{possible_eml_s3_url}"
+                    possible_eml_s3_url
+                  end
+                rescue
+                  nil
+                end if s3_url.nil?
+        if s3_url.nil?
+          STDERR.puts "Tika X-Attachments: " + metadata["X-Attachments"].to_s.inspect
+          STDERR.puts "Couldn't find attachment '#{possible_s3_url}' aka '#{possible_eml_s3_url}' from '#{raw_attachment_filename}' from #{download_url}"
+        end
+        s3_url
+      end.compact
+      t
+    end
+    def to_hash
+      {
+        "sha1" => Digest::SHA1.hexdigest(download_url),
+        "title" => title.to_s,
+        "source_url" => download_url.to_s,
+        "file" => {
+          "title" => title.to_s,
+          "file" => text.to_s
+        },
+        "analyzed" => {
+          "body" => text.to_s,
+          "metadata" => {
+            "Content-Type" => content_type || "message/rfc822",
+            "Creation-Date" => creation_date,
+            "Message-To" => message_from.is_a?(Enumerable) ? message_from : [ message_from ],
+            "Message-From" => message_to.is_a?(Enumerable) ? message_to : [ message_to ],
+            "Message-Cc" => message_cc.is_a?(Enumerable) ? message_cc : [ message_cc ],
+            "subject" => subject,
+            "attachments" => attachments
+          }
+        },
+        "_updatedAt" => Time.now
+      }
+    end
+  end
+end

data/lib/parsers/stevedore_html.rb ADDED Viewed

@@ -0,0 +1,8 @@
+require_relative './stevedore_blob'
+require 'nokogiri'
+module Stevedore
+  class StevedoreHTML < StevedoreBlob
+  end
+end

data/lib/split_archive.rb ADDED Viewed

@@ -0,0 +1,77 @@
+# splits zip, mbox and pst files into their constituent documents -- mesages and attachments
+# and puts them into a tmp folder
+# which is then parsed normally
+require 'mapi/msg'
+require 'tmpdir'
+require 'mapi/pst'
+require 'zip'
+# splits PST and Mbox formats
+class ArchiveSplitter
+  def self.split(archive_filename)
+    # if it's a PST use split_pst
+    # if it's an mbox, use split_pst
+    # return a list of files
+    Dir.mktmpdir do |dir|
+      #TODO should probably do magic byte searching et.c
+      extension = dir.archive_filename.split(".")[-1]
+      constituent_files =  if extension == "mbox"
+                      self.split_mbox(archive_filename)
+                    elsif extension == "pst"
+                      self.split_pst(archive_filename)
+                    elsif extension == "zip"
+                      self.split_zip(archive_filename)
+                    end
+      # should yield a relative filename
+      # and a lambda that will write the file contents to the given filename
+      constituent_files.each do |filename, contents_lambda|
+        contents_lambda.call(File.join(dir, File.basename(archive_filename), filename ))
+      end
+    end
+  end
+end
+class MailArchiveSplitter
+  def self.split_pst(archive_filename)
+    pst = Mapi::Pst.new open(archive_filename)
+    pst.each_with_index do |mail, idx|
+      msg = Mapi::Msg.load mail
+      yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << mail } }
+      msg.attachments.each do |attachment|
+        yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save fh }}
+      end
+    end
+  end
+  def self.split_mbox(archive_filename)
+    # stolen shamelessly from the Ruby Enumerable docs, actually
+    # split mails in mbox (slice before Unix From line after an empty line)
+    open(archive_filename) do |fh|
+      f.slice_before(empty: true) do |line, h|
+        previous_was_empty = h[:empty]
+        h[:empty] = line == "\n"
+        previous_was_empty && line.start_with?("From ")
+      end.each_with_index do |mail, idx|
+        mail.pop if mail.last == "\n" # remove last line if prexent
+        yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| f << mail } }
+        msg.attachments.each do |attachment|
+          yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save f }}
+        end
+      end
+    end
+  end
+  def self.split_zip(archive_filename)
+    Zip::File.open(archive_filename) do |zip_file|
+      zip_file.each do |entry|
+        yield entry.names, lambda{|fn| entryhextract(fn) }
+      end
+    end
+  end
+end

data/lib/stevedore-uploader.rb ADDED Viewed

@@ -0,0 +1,340 @@
+Dir["#{File.expand_path(File.dirname(__FILE__))}/../lib/*.rb"].each {|f| require f}
+Dir["#{File.expand_path(File.dirname(__FILE__))}/../lib/parsers/*.rb"].each {|f| require f}
+require 'rika'
+require 'net/https'
+require 'elasticsearch'
+require 'elasticsearch/transport/transport/http/manticore'
+require 'net/https'
+require 'manticore'
+require 'fileutils'
+require 'csv'
+require 'aws-sdk'
+module Stevedore
+  class ESUploader
+    #creates blobs
+    attr_reader :errors
+    attr_accessor :should_ocr, :slice_size
+    def initialize(es_host, es_index, s3_bucket=nil, s3_path=nil)
+      @errors = []
+      @client = Elasticsearch::Client.new({
+          log: false,
+          url: es_host,
+          transport_class: Elasticsearch::Transport::Transport::HTTP::Manticore,
+          request_timeout: 5*60,
+          socket_timeout: 60
+        },
+      )
+      @es_index = es_index
+      @s3_bucket = s3_bucket || FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : 'int-data-dumps'
+      @s3_basepath = "https://#{s3_bucket}.s3.amazonaws.com/#{s3_path || es_index}"
+      @slice_size =  100
+      @should_ocr = false
+      self.create_index!
+      self.create_mappings!
+    end
+    def create_index!
+      begin
+        @client.indices.create(
+          index: @es_index,
+          body: {
+            settings: {
+              analysis: {
+                analyzer: {
+                  email_analyzer: {
+                    type: "custom",
+                    tokenizer: "email_tokenizer",
+                    filter: ["lowercase"]
+                  },
+                  snowball_analyzer: {
+                    type: "snowball",
+                    language: "English"
+                  }
+                },
+                tokenizer: {
+                  email_tokenizer: {
+                    type: "pattern",
+                    pattern: "([a-zA-Z0-9_\\.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-\\.]+)",
+                    group: "0"
+                  }
+                }
+              }
+            },
+          })
+      # don't complain if the index already exists.
+      rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e
+        raise e unless e.message && (e.message.include?("IndexAlreadyExistsException") || e.message.include?("already exists as alias"))
+      end
+    end
+    def create_mappings!
+      @client.indices.put_mapping({
+        index: @es_index,
+        type: :doc,
+        body: {
+          "_id" => {
+            path: "sha1"
+          },
+          properties: { # feel free to add more, this is the BARE MINIMUM the UI depends on
+            sha1: {type: :string, index: :not_analyzed},
+            title: { type: :string, analyzer: :keyword },
+            source_url: {type: :string, index: :not_analyzed},
+            modifiedDate: { type: :date, format: "dateOptionalTime" },
+            _updated_at: { type: :date },
+            analyzed: {
+              properties: {
+                body: {
+                  type: :string,
+                  index_options: :offsets,
+                  term_vector: :with_positions_offsets,
+                  store: true,
+                  fields: {
+                    snowball: {
+                      type: :string,
+                      index: "analyzed",
+                      analyzer: 'snowball_analyzer' ,
+                      index_options: :offsets,
+                      term_vector: :with_positions_offsets,
+                    }
+                  }
+                },
+                metadata: {
+                  properties: {
+                    # "attachments" =>  {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
+                    "Message-From" => {
+                      type: "string",
+                      fields: {
+                        email: {
+                          type: "string",
+                          analyzer: "email_analyzer"
+                        },
+                        "Message-From" => {
+                          type: "string"
+                        }
+                      }
+                    },
+                    "Message-To" => {
+                      type: "string",
+                      fields: {
+                        email: {
+                          type: "string",
+                          analyzer: "email_analyzer"
+                        },
+                        "Message-To" => {
+                          type: "string"
+                        }
+                      }
+                    }
+                  }
+                }
+              }
+            }
+          }
+        }
+      }) # was "rescue nil" but that obscured meaningful errors
+    end
+    def bulk_upload_to_es!(data)
+      return nil if data.empty?
+      begin
+        resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
+        puts resp if resp[:errors]
+      rescue JSON::GeneratorError
+        data.each do |datum|
+          begin
+            @client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
+          rescue JSON::GeneratorError
+            next
+          end
+        end
+        resp = nil
+      end
+      resp
+    end
+    def process_document(filename, filename_for_s3)
+      begin
+        puts "begin to process #{filename}"
+        # puts "size: #{File.size(filename)}"
+        begin
+          content, metadata = Rika.parse_content_and_metadata(filename)
+        rescue StandardError
+          content = "couldn't be parsed"
+          metadata = "couldn't be parsed"
+        end
+        puts "parsed: #{content.size}"
+        if content.size > 10 * (10 ** 6)
+          @errors << filename
+          puts "skipping #{filename} for being too big"
+          return nil
+        end
+        # TODO: factor these out in favor of the yield/block situation down below.
+        # this should (eventually) be totally generic, but perhaps handle common
+        # document types on its own
+        ret = case                             # .eml                                          # .msg
+              when metadata["Content-Type"] == "message/rfc822" || metadata["Content-Type"] == "application/vnd.ms-outlook"
+                ::Stevedore::StevedoreEmail.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
+              when metadata["Content-Type"] && ["application/html", "application/xhtml+xml"].include?(metadata["Content-Type"].split(";").first)
+                ::Stevedore::StevedoreHTML.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
+              when @should_ocr && metadata["Content-Type"] == "application/pdf" && (content.match(/\A\s*\z/) || content.size < 50 * metadata["xmpTPg:NPages"].to_i )
+                # this is a scanned PDF.
+                puts "scanned PDF #{File.basename(filename)} detected; OCRing"
+                pdf_basename = filename.gsub(".pdf", '')
+                system("convert","-monochrome","-density","300x300",filename,"-depth",'8',"#{pdf_basename}.png")
+                (Dir["#{pdf_basename}-*.png"] + Dir["#{pdf_basename}.png"]).sort_by{|png| (matchdata = png.match(/-\d+\.png/)).nil? ? 0 : matchdata[0].to_i }.each do |png|
+                  system('tesseract', png, png, "pdf")
+                  File.delete(png)
+                  # no need to use a system call when we could use the stdlib!
+                  # system("rm", "-f", png) rescue nil
+                  File.delete("#{png}.txt")
+                end.join("\n\n")
+                # e.g.  Analysis-Corporation-2.png.pdf or Torture.pdf
+                files = Dir["#{pdf_basename}.png.pdf"] + (Dir["#{pdf_basename}-*.png.pdf"].sort_by{|pdf| Regexp.new("#{pdf_basename}-([0-9]+).png.pdf").match(pdf)[1].to_i })
+                system('pdftk', *files, "cat", "output", "#{pdf_basename}.ocr.pdf")
+                content, _ = Rika.parse_content_and_metadata("#{pdf_basename}.ocr.pdf")
+                puts "OCRed content (#{File.basename(filename)}) length: #{content.length}"
+                ::Stevedore::StevedoreBlob.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
+              else
+                ::Stevedore::StevedoreBlob.new_from_tika(content, metadata, filename_for_s3, filename).to_hash
+              end
+      [ret, content, metadata]
+      rescue StandardError, java.lang.NoClassDefFoundError, org.apache.tika.exception.TikaException => e
+        STDERR.puts e.inspect
+        STDERR.puts "#{e} #{e.message}: #{filename}"
+        STDERR.puts e.backtrace.join("\n") + "\n\n\n"
+        # puts "\n"
+        @errors << filename
+        nil
+      end
+    end
+    def do_csv!(file, download_url, title_column=0, text_column=nil)
+      docs_so_far = 0
+      CSV.open(file, headers: (!title_column.is_a? Fixnum ) ).each_slice(@slice_size).each_with_index do |slice, slice_index|
+        slice_of_rows = slice.map.each_with_index do |row, i|
+          doc = ::Stevedore::StevedoreCsvRow.new(
+            row[title_column],
+            (row.respond_to?(:to_hash) ? (text_column.nil? ? row.to_hash.each_pair.map{|k, v| "#{k}: #{v}"}.join(" \n\n ") : row[text_column]) : row.to_a.join(" \n\n ")) + " \n\n csv_source: #{File.basename(file)}",
+            (@slice_size * slice_index )+ i,
+            download_url,
+            row).to_hash
+          doc["analyzed"] ||= {}
+          doc["analyzed"]["metadata"] ||= {}
+          yield doc if block_given? && doc
+          doc
+        end
+        begin
+          resp = bulk_upload_to_es!(slice_of_rows.compact)
+          docs_so_far += @slice_size
+        rescue Manticore::Timeout, Manticore::SocketException
+          STDERR.puts("retrying at #{Time.now}")
+          retry
+        end
+        puts "uploaded #{slice_of_rows.size} rows to #{@es_index}; #{docs_so_far} uploaded so far"
+        puts "Errors in bulk upload: #{resp.inspect}" if resp && resp["errors"]
+      end
+    end
+    def do!(target_folder_path, output_stream=STDOUT)
+      output_stream.puts "Processing documents from #{target_folder_path}"
+      docs_so_far = 0
+      if target_folder_path.downcase.include?("s3://")
+        Dir.mktmpdir do |dir|
+          Aws.config.update({
+            region: 'us-east-1', # TODO should be configurable
+          })
+          s3 = Aws::S3::Resource.new
+          bucket = s3.bucket(@s3_bucket)
+          s3_path_without_bucket = target_folder_path.gsub(/s3:\/\//i, '').split("/", 2).last
+          bucket.objects(:prefix => s3_path_without_bucket).each_slice(@slice_size) do |slice_of_objs|
+            docs_so_far += slice_of_objs.size
+            output_stream.puts "starting a set of #{@slice_size} -- so far #{docs_so_far}"
+            slice_of_objs.map! do |obj|
+              next if obj.key[-1] == "/"
+              FileUtils.mkdir_p(File.join(dir, File.dirname(obj.key)))
+              tmp_filename = File.join(dir, obj.key)
+              begin
+                body = obj.get.body.read
+                File.open(tmp_filename, 'wb'){|f| f << body}
+              rescue Aws::S3::Errors::NoSuchKey
+                @errors << obj.key
+              rescue ArgumentError
+                File.open(tmp_filename, 'wb'){|f| f << body.nil? ? '' : body.chars.select(&:valid_encoding?).join}
+              end
+              download_filename = "https://#{@s3_bucket}.s3.amazonaws.com/" + obj.key
+              doc, content, metadata = process_document(tmp_filename, download_filename)
+              begin
+                FileUtils.rm(tmp_filename)
+              rescue Errno::ENOENT
+                # try to delete, but no biggie if it doesn't work for some weird reason.
+              end
+              yield doc, obj.key, content, metadata if block_given?
+              doc
+            end
+            begin
+              resp = bulk_upload_to_es!(slice_of_objs.compact)
+            rescue Manticore::Timeout, Manticore::SocketException
+              output_stream.puts("retrying at #{Time.now}")
+              retry
+            end
+            output_stream.puts "uploaded #{slice_of_objs.size} files to #{@es_index}; #{docs_so_far} uploaded so far"
+            output_stream.puts "Errors in bulk upload: #{resp.inspect}" if resp && resp["errors"]
+          end
+        end
+      else
+        Dir[target_folder_path + (target_folder_path.include?('*') ? '' : '/**/*')].each_slice(@slice_size) do |slice_of_files|
+          output_stream.puts "starting a set of #{@slice_size}"
+          docs_so_far += slice_of_files.size
+          slice_of_files.map! do |filename|
+            next unless File.file?(filename)
+            filename_basepath = filename.gsub(target_folder_path, '')
+            if use_s3
+              download_filename = @s3_basepath + filename_basepath
+            else
+              download_filename = "/files/#{@es_index}/#{filename_basepath}"
+            end
+            doc, content, metadata = process_document(filename, download_filename  )
+            yield doc, filename, content, metadata if block_given?
+            doc
+          end
+          begin
+            puts "uploading"
+            resp = bulk_upload_to_es!(slice_of_files.compact)
+            puts resp.inspect if JSON.parse(resp)["errors"]
+          rescue Manticore::Timeout, Manticore::SocketException => e
+            output_stream.puts e.inspect
+            output_stream.puts "Upload error: #{e} #{e.message}."
+            output_stream.puts e.backtrace.join("\n") + "\n\n\n"
+            output_stream.puts("retrying at #{Time.now}")
+            retry
+          end
+          output_stream.puts "uploaded #{slice_of_files.size} files to #{@es_index}; #{docs_so_far} uploaded so far"
+          output_stream.puts "Errors in bulk upload: #{resp.inspect}" if resp && resp["errors"]
+        end
+      end
+    end
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,135 @@
+--- !ruby/object:Gem::Specification
+name: stevedore-uploader
+version: !ruby/object:Gem::Version
+  version: 1.0.0
+platform: java
+authors:
+- Jeremy B. Merrill
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2016-04-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
+  name: elasticsearch
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  name: manticore
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+  name: jruby-openssl
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2'
+  name: aws-sdk
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 1.6.1
+  name: rika-stevedore
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 1.6.1
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  name: nokogiri
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+description: TK
+email: jeremy.merrill@nytimes.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- README.md
+- bin/upload_to_elasticsearch.rb
+- lib/parsers/stevedore_blob.rb
+- lib/parsers/stevedore_csv_row.rb
+- lib/parsers/stevedore_email.rb
+- lib/parsers/stevedore_html.rb
+- lib/split_archive.rb
+- lib/stevedore-uploader.rb
+homepage: https://github.com/newsdev/stevedore-uploader
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.8
+signing_key:
+specification_version: 4
+summary: Upload documents to a Stevedore search engine.
+test_files: []