RubyGems - stevedore-uploader - Versions diffs - 1.0.0-java → 1.0.1-java - Mend

stevedore-uploader 1.0.0-java → 1.0.1-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/README.md +3 -3
data/bin/upload_to_elasticsearch.rb +1 -2
data/lib/parsers/stevedore_email.rb +1 -1
data/lib/split_archive.rb +81 -54
data/lib/stevedore-uploader.rb +120 -91
metadata +44 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: ea9762ea02f37f5e56993854cb08ae63f21f7e1e
-  data.tar.gz: f525070b72ea4f56e44096de00ec4789a059b042
+  metadata.gz: 3370d3545b000618352d42bf5437a536f9f9ce5e
+  data.tar.gz: 537acba2cff72af43b16e762f4082848cb548281
 SHA512:
-  metadata.gz: 1e6cd4cbae534a1bd67091daea696c0cb33daf5dc94c740be2b6f6ce0fbf23195c02d05adfbe00d618e6b36f3bd890e8223fa7c8b7f02a3644aaa4e988b6d1b4
-  data.tar.gz: d4b4971924a013b3bf7a64b375a6e2fcf5246fa29a975099415c1128a0d0ebbb9d0526776d98f8f81e79ffbe6dea21e1f6eeacf713485785c9f8f4b8f62f749b
+  metadata.gz: a5dbc4618ab5517701aa57c6ef09a5cb45dcdfe5bb495640e092e3796915be796313bc47f831fdc8ae3e6c2ece5af8dc0281f9ccc8595839e89bec8565e903a0
+  data.tar.gz: 5e9c8df53922d30e2748720f16ffa8c782ba4e978fbbade650da30e4da7833559494aa0b77c084ed230c96eba316d5412a485163d07b07c5ac6c108e03d0a180

data/README.md CHANGED Viewed

@@ -5,7 +5,7 @@ A tool for uploading documents into [Stevedore](https://github.com/newsdev/steve
 Stevedore is essentially an ElasticSearch endpoint with a customizable frontend attached to it. Stevedore's primary document store is ElasticSearch, so `stevedore-uploader`'s primary task is merely uploading documents to ElasticSearch, with a few attributes that Stevedore depends on. Getting a new document set ready for search requires a few steps, but this tool helps with the hardest one: Converting the documents you want to search into a format that ElasticSearch understands. Customizing the search interface is often not necessary, but if it is, information on how to do that is in the [Stevedore](https://github.com/newsdev/stevedore) repository.
-Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/). A `case` statement in `bin/upload_to_elasticsearch.rb` distinguishes between a few default types, like emails and text blobs, (and PRs would be appreciated adding new ones); for specialized types, the `do` function takes a block allowing you to modify the documents with just a few lines of Ruby.
+Every document processing job is different. Some might require OCR, others might require parsing e-mails, still others might call for sophisticated processing of text documents. There's no telling. That being the case, this project tries to make no assumptions about the type of data you'll be uploading -- but by default tries to convert everything into plaintext with [Apache Tika](https://tika.apache.org/). Stevedore distinguishes between a few default types, like emails and text blobs, (and PRs would be appreciated adding new ones); for specialized types, the `do` function takes a block allowing you to modify the documents with just a few lines of Ruby.
 For more details on the entire workflow, see [Stevedore](https://github.com/newsdev/stevedore)
@@ -47,14 +47,14 @@ bundle exec ruby bin/upload_to_elasticsearch.rb --index=jrubytest --host=https:/
 if you choose to process documents from S3, you should upload those documents using your choice of tool -- but `awscli` is a good choice. *Stevedore-Uploader does NOT upload documents to S3 on your behalf.
 If you need to process documents in a specialized, customized way, follow this example:
+````
 uploader = Stevedore::ESUploader.new(ES_HOST, ES_INDEX, S3_BUCKET, S3_PATH_PREFIX) # S3_BUCKET, S3_PATH_PREFIX are optional
 uploader.do! FOLDER do |doc, filename, content, metadata|
   next if doc.nil?
   doc["analyzed"]["metadata"]["date"] = Date.parse(File.basename(filename).split("_")[-2])
   doc["analyzed"]["metadata"]["title"] = my_title_getter_function(File.basename(filename))
 end
+````
 Questions?
 ==========

data/bin/upload_to_elasticsearch.rb CHANGED Viewed

@@ -45,7 +45,7 @@ if __FILE__ == $0
       options.text_column = text_column
     end
-    opts.on("-o", "--[no-]ocr", "Run verbosely") do |v|
+    opts.on("-o", "--[no-]ocr", "don't attempt to OCR any PDFs, even if they contain no text") do |v|
       options.ocr = v
     end
@@ -81,7 +81,6 @@ ES_INDEX =  if options.index.nil? || options.index == ''
             end
 S3_BUCKET = FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : options.s3bucket
-raise ArgumentError, 's3 buckets other than int-data-dumps aren\'t supported by the frontend yet' if S3_BUCKET != 'int-data-dumps'
 ES_HOST = options.host || "localhost:9200"
 S3_PATH = options.s3path  || options.index
 S3_BASEPATH = "https://#{S3_BUCKET}.s3.amazonaws.com/#{S3_PATH}"

data/lib/parsers/stevedore_email.rb CHANGED Viewed

@@ -15,7 +15,7 @@ module Stevedore
       t.message_to = metadata["Message-To"]
       t.message_from = metadata["Message-From"]
       t.message_cc = metadata["Message-Cc"]
-      t.subject = metadata["subject"]
+      t.title = t.subject = metadata["subject"]
       t.attachments = metadata["X-Attachments"].to_s.split("|").map do |raw_attachment_filename|
         attachment_filename = CGI::unescape(raw_attachment_filename)
         possible_filename = File.join(File.dirname(filepath), attachment_filename)

data/lib/split_archive.rb CHANGED Viewed

@@ -1,77 +1,104 @@
-# splits zip, mbox and pst files into their constituent documents -- mesages and attachments
+# splits zip, mbox (and eventually pst) files into their constituent documents -- mesages and attachments
 # and puts them into a tmp folder
 # which is then parsed normally
-require 'mapi/msg'
 require 'tmpdir'
-require 'mapi/pst'
+require 'mail'
 require 'zip'
+require 'pst' # for PST files
 # splits PST and Mbox formats
+module Stevedore
+  class ArchiveSplitter
+    HANDLED_FORMATS = ["zip", "mbox", "pst"]
-class ArchiveSplitter
-  def self.split(archive_filename)
-    # if it's a PST use split_pst
-    # if it's an mbox, use split_pst
-    # return a list of files
-    Dir.mktmpdir do |dir|
-      #TODO should probably do magic byte searching et.c
-      extension = dir.archive_filename.split(".")[-1]
+    def self.split(archive_filename)
+      # if it's a PST use split_pst
+      # if it's an mbox, use split_mbox, etc.
+      # return a list of files
+      Enumerator.new do |yielder|
+        Dir.mktmpdir do |tmpdir|
+          #TODO should probably do magic byte searching etc.
+          extension = archive_filename.split(".")[-1]
+          puts "splitting #{archive_filename}"
+          constituent_files =  if extension == "mbox"
+                          self.split_mbox(archive_filename)
+                        elsif extension == "pst"
+                          self.split_pst(archive_filename)
+                        elsif extension == "zip"
+                          self.split_zip(archive_filename)
+                        end
+          # should yield a relative filename
+          # and a lambda that will write the file contents to the given filename
+          FileUtils.mkdir_p(File.join(tmpdir, File.basename(archive_filename)))
-      constituent_files =  if extension == "mbox"
-                      self.split_mbox(archive_filename)
-                    elsif extension == "pst"
-                      self.split_pst(archive_filename)
-                    elsif extension == "zip"
-                      self.split_zip(archive_filename)
-                    end
-      # should yield a relative filename
-      # and a lambda that will write the file contents to the given filename
+          constituent_files.each_with_index do |basename_contents_lambda, idx|
+            basename, contents_lambda = *basename_contents_lambda
+            tmp_filename = File.join(tmpdir, File.basename(archive_filename), basename.gsub("/", "") )
+            contents_lambda.call(tmp_filename)
+            yielder.yield tmp_filename, File.join(File.basename(archive_filename), basename)
+          end
+        end
+      end
+    end
-      constituent_files.each do |filename, contents_lambda|
-        contents_lambda.call(File.join(dir, File.basename(archive_filename), filename ))
+    def self.split_pst(archive_filename)
+      pstfile = Java::ComPFF::PSTFile.new(archive_filename)
+      idx = 0
+      folders = pstfile.root.sub_folders.inject({}) do |memo,f|
+        memo[f.name] = f
+        memo
       end
-    end
-  end
-end
+      Enumerator.new do |yielder|
+        folders.each do |folder_name, folder|
+          while mail = folder.getNextChild
-class MailArchiveSplitter
+            eml_str = mail.get_transport_message_headers + mail.get_body
-  def self.split_pst(archive_filename)
-    pst = Mapi::Pst.new open(archive_filename)
-    pst.each_with_index do |mail, idx|
-      msg = Mapi::Msg.load mail
-      yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << mail } }
-      msg.attachments.each do |attachment|
-        yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save fh }}
+            yielder << ["#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << eml_str } }]
+            attachment_count = mail.get_number_of_attachments
+            attachment_count.times do |attachment_idx|
+              attachment = mail.get_attachment(attachment_idx)
+              attachment_filename = attachment.get_filename
+              yielder << ["#{idx}-#{attachment_filename}", lambda {|fn| open(fn, 'wb'){ |fh| fh << attachment.get_file_input_stream.to_io.read }}]
+            end
+            idx += 1
+          end
+        end
       end
     end
-  end
-  def self.split_mbox(archive_filename)
-    # stolen shamelessly from the Ruby Enumerable docs, actually
-    # split mails in mbox (slice before Unix From line after an empty line)
-    open(archive_filename) do |fh|
-      f.slice_before(empty: true) do |line, h|
-        previous_was_empty = h[:empty]
-        h[:empty] = line == "\n"
-        previous_was_empty && line.start_with?("From ")
-      end.each_with_index do |mail, idx|
-        mail.pop if mail.last == "\n" # remove last line if prexent
-        yield "#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| f << mail } }
-        msg.attachments.each do |attachment|
-          yield attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save f }}
+    def self.split_mbox(archive_filename)
+      # stolen shamelessly from the Ruby Enumerable docs, actually
+      # split mails in mbox (slice before Unix From line after an empty line)
+      Enumerator.new do |yielder|
+        open(archive_filename) do |fh|
+          fh.slice_before(empty: true) do |line, h|
+            previous_was_empty = h[:empty]
+            h[:empty] = line == "\n" || line == "\r\n" || line == "\r"
+            previous_was_empty && line.start_with?("From ")
+          end.each_with_index do |mail_str, idx|
+            mail_str.pop if mail_str.last == "\n" # remove last line if prexent
+            yielder << ["#{idx}.eml", lambda{|fn| open(fn, 'wb'){|fh| fh << mail_str.join("") } }]
+            mail = Mail.new mail_str.join("")
+            mail.attachments.each do |attachment|
+              yielder << [attachment.filename, lambda{|fn| open(fn, 'wb'){|fh| attachment.save fh }}]
+            end
+          end
         end
       end
     end
-  end
-  def self.split_zip(archive_filename)
-    Zip::File.open(archive_filename) do |zip_file|
-      zip_file.each do |entry|
-        yield entry.names, lambda{|fn| entryhextract(fn) }
+    def self.split_zip(archive_filename)
+      Zip::File.open(archive_filename) do |zip_file|
+        Enumerator.new do |yielder|
+          zip_file.each do |entry|
+            yielder << [entry.name, lambda{|fn| entry.extract(fn) }]
+          end
+        end
       end
     end
-  end
+  end
 end

data/lib/stevedore-uploader.rb CHANGED Viewed

@@ -20,7 +20,7 @@ module Stevedore
   class ESUploader
     #creates blobs
     attr_reader :errors
-    attr_accessor :should_ocr, :slice_size
+    attr_accessor :should_ocr, :slice_size, :client
     def initialize(es_host, es_index, s3_bucket=nil, s3_path=nil)
       @errors = []
@@ -33,16 +33,15 @@ module Stevedore
         },
       )
       @es_index = es_index
-      @s3_bucket = s3_bucket || FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : 'int-data-dumps'
+      @s3_bucket = s3_bucket #|| (Stevedore::ESUploader.const_defined?(FOLDER) && FOLDER.downcase.include?('s3://') ? FOLDER.gsub(/s3:\/\//i, '').split("/", 2).first : nil)
       @s3_basepath = "https://#{s3_bucket}.s3.amazonaws.com/#{s3_path || es_index}"
       @slice_size =  100
       @should_ocr = false
       self.create_index!
-      self.create_mappings!
+      self.add_mapping(:doc, MAPPING)
     end
     def create_index!
@@ -80,82 +79,28 @@ module Stevedore
       end
     end
-    def create_mappings!
+    def add_mapping(type, mapping)
       @client.indices.put_mapping({
         index: @es_index,
-        type: :doc,
+        type: type,
         body: {
           "_id" => {
-            path: "sha1"
+            path: "id"
           },
-          properties: { # feel free to add more, this is the BARE MINIMUM the UI depends on
-            sha1: {type: :string, index: :not_analyzed},
-            title: { type: :string, analyzer: :keyword },
-            source_url: {type: :string, index: :not_analyzed},
-            modifiedDate: { type: :date, format: "dateOptionalTime" },
-            _updated_at: { type: :date },
-            analyzed: {
-              properties: {
-                body: {
-                  type: :string,
-                  index_options: :offsets,
-                  term_vector: :with_positions_offsets,
-                  store: true,
-                  fields: {
-                    snowball: {
-                      type: :string,
-                      index: "analyzed",
-                      analyzer: 'snowball_analyzer' ,
-                      index_options: :offsets,
-                      term_vector: :with_positions_offsets,
-                    }
-                  }
-                },
-                metadata: {
-                  properties: {
-                    # "attachments" =>  {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
-                    "Message-From" => {
-                      type: "string",
-                      fields: {
-                        email: {
-                          type: "string",
-                          analyzer: "email_analyzer"
-                        },
-                        "Message-From" => {
-                          type: "string"
-                        }
-                      }
-                    },
-                    "Message-To" => {
-                      type: "string",
-                      fields: {
-                        email: {
-                          type: "string",
-                          analyzer: "email_analyzer"
-                        },
-                        "Message-To" => {
-                          type: "string"
-                        }
-                      }
-                    }
-                  }
-                }
-              }
-            }
-          }
+          properties: mapping
         }
       }) # was "rescue nil" but that obscured meaningful errors
     end
-    def bulk_upload_to_es!(data)
+    def bulk_upload_to_es!(data, type=nil)
       return nil if data.empty?
       begin
-        resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
+        resp = @client.bulk body: data.map{|datum| {index: {_index: @es_index, _type: type || 'doc', data: datum }} }
         puts resp if resp[:errors]
       rescue JSON::GeneratorError
         data.each do |datum|
           begin
-            @client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: 'doc', data: datum }} }
+            @client.bulk body: [datum].map{|datum| {index: {_index: @es_index, _type: type || 'doc', data: datum }} }
           rescue JSON::GeneratorError
             next
           end
@@ -166,8 +111,6 @@ module Stevedore
     end
     def process_document(filename, filename_for_s3)
       begin
         puts "begin to process #{filename}"
         # puts "size: #{File.size(filename)}"
@@ -183,6 +126,7 @@ module Stevedore
           puts "skipping #{filename} for being too big"
           return nil
         end
+        puts metadata["Content-Type"].inspect
         # TODO: factor these out in favor of the yield/block situation down below.
         # this should (eventually) be totally generic, but perhaps handle common
@@ -206,6 +150,7 @@ module Stevedore
                 end.join("\n\n")
                 # e.g.  Analysis-Corporation-2.png.pdf or Torture.pdf
                 files = Dir["#{pdf_basename}.png.pdf"] + (Dir["#{pdf_basename}-*.png.pdf"].sort_by{|pdf| Regexp.new("#{pdf_basename}-([0-9]+).png.pdf").match(pdf)[1].to_i })
+                return nil if files.empty?
                 system('pdftk', *files, "cat", "output", "#{pdf_basename}.ocr.pdf")
                 content, _ = Rika.parse_content_and_metadata("#{pdf_basename}.ocr.pdf")
                 puts "OCRed content (#{File.basename(filename)}) length: #{content.length}"
@@ -251,12 +196,14 @@ module Stevedore
       end
     end
-    def do!(target_folder_path, output_stream=STDOUT)
-      output_stream.puts "Processing documents from #{target_folder_path}"
+    def do!(target_path, output_stream=STDOUT)
+      output_stream.puts "Processing documents from #{target_path}"
       docs_so_far = 0
+      # use_s3 = false # option to set this (an option to set document URLs to be relative to the search engine root) is TK
+      @s3_bucket =  target_path.gsub(/s3:\/\//i, '').split("/", 2).first if @s3_bucket.nil? && target_path.downcase.include?('s3://')
-      if target_folder_path.downcase.include?("s3://")
+      if target_path.downcase.include?("s3://")
         Dir.mktmpdir do |dir|
           Aws.config.update({
             region: 'us-east-1', # TODO should be configurable
@@ -264,7 +211,7 @@ module Stevedore
           s3 = Aws::S3::Resource.new
           bucket = s3.bucket(@s3_bucket)
-          s3_path_without_bucket = target_folder_path.gsub(/s3:\/\//i, '').split("/", 2).last
+          s3_path_without_bucket = target_path.gsub(/s3:\/\//i, '').split("/", 2).last
           bucket.objects(:prefix => s3_path_without_bucket).each_slice(@slice_size) do |slice_of_objs|
             docs_so_far += slice_of_objs.size
@@ -282,17 +229,30 @@ module Stevedore
                 File.open(tmp_filename, 'wb'){|f| f << body.nil? ? '' : body.chars.select(&:valid_encoding?).join}
               end
               download_filename = "https://#{@s3_bucket}.s3.amazonaws.com/" + obj.key
-              doc, content, metadata = process_document(tmp_filename, download_filename)
-              begin
-                FileUtils.rm(tmp_filename)
-              rescue Errno::ENOENT
-                # try to delete, but no biggie if it doesn't work for some weird reason.
+              # is this file an archive that contains a bunch of documents we should index separately?
+              # obviously, there is not a strict definition here.
+              # emails in mailboxes are split into an email and attachments
+              # but, for now, standalone emails are treated as one document
+              # PDFs can (theoretically) contain documents as "attachments" -- those aren't handled here either.x
+              if ArchiveSplitter::HANDLED_FORMATS.include?(tmp_filename.split(".")[-1])
+                ArchiveSplitter.split(tmp_filename).map do |constituent_file, constituent_basename|
+                  doc, content, metadata = process_document(constituent_file, download_filename)
+                  doc["sha1"] = Digest::SHA1.hexdigest(download_filename + File.basename(constituent_basename)) # since these files all share a download URL (that of the archive, we need to come up with a custom sha1)
+                  yield doc, obj.key, content, metadata if block_given?
+                  FileUtils.rm(constituent_file) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
+                  doc
+                end
+              else
+                doc, content, metadata = process_document(tmp_filename, download_filename)
+                yield doc, obj.key, content, metadata if block_given?
+                FileUtils.rm(tmp_filename) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
+                [doc]
               end
-              yield doc, obj.key, content, metadata if block_given?
-              doc
             end
             begin
-              resp = bulk_upload_to_es!(slice_of_objs.compact)
+              resp = bulk_upload_to_es!(slice_of_objs.compact.flatten(1)) # flatten, in case there's an archive
+              puts resp.inspect if resp && resp["errors"]
             rescue Manticore::Timeout, Manticore::SocketException
               output_stream.puts("retrying at #{Time.now}")
               retry
@@ -302,28 +262,42 @@ module Stevedore
           end
         end
       else
-        Dir[target_folder_path + (target_folder_path.include?('*') ? '' : '/**/*')].each_slice(@slice_size) do |slice_of_files|
+        list_of_files = File.file?(target_path) ? [target_path] : Dir[File.join(target_path, target_path.include?('*') ? '' : '**/*')]
+        list_of_files.each_slice(@slice_size) do |slice_of_files|
           output_stream.puts "starting a set of #{@slice_size}"
           docs_so_far += slice_of_files.size
           slice_of_files.map! do |filename|
             next unless File.file?(filename)
-            filename_basepath = filename.gsub(target_folder_path, '')
-            if use_s3
+            filename_basepath = filename.gsub(target_path, '')
+            # if use_s3  # turning this on TK
               download_filename = @s3_basepath + filename_basepath
+            # else
+            #   download_filename = "/files/#{@es_index}/#{filename_basepath}"
+            # end
+            # is this file an archive that contains a bunch of documents we should index separately?
+            # obviously, there is not a strict definition here.
+            # emails in mailboxes are split into an email and attachments
+            # but, for now, standalone emails are treated as one document
+            # PDFs can (theoretically) contain documents as "attachments" -- those aren't handled here either.
+            if ArchiveSplitter::HANDLED_FORMATS.include?(filename.split(".")[-1])
+              ArchiveSplitter.split(filename).map do |constituent_file, constituent_basename|
+                doc, content, metadata = process_document(constituent_file, download_filename)
+                doc["sha1"] = Digest::SHA1.hexdigest(download_filename + File.basename(constituent_basename)) # since these files all share a download URL (that of the archive, we need to come up with a custom sha1)
+                yield doc, filename, content, metadata if block_given?
+                # FileUtils.rm(constituent_file) rescue Errno::ENOENT # try to delete, but no biggie if it doesn't work for some weird reason.
+                doc
+              end
             else
-              download_filename = "/files/#{@es_index}/#{filename_basepath}"
+              doc, content, metadata = process_document(filename, download_filename  )
+              yield doc, filename, content, metadata if block_given?
+              [doc]
             end
-            doc, content, metadata = process_document(filename, download_filename  )
-            yield doc, filename, content, metadata if block_given?
-            doc
           end
           begin
-            puts "uploading"
-            resp = bulk_upload_to_es!(slice_of_files.compact)
-            puts resp.inspect if JSON.parse(resp)["errors"]
+            resp = bulk_upload_to_es!(slice_of_files.compact.flatten(1)) # flatten, in case there's an archive
+            puts resp.inspect if resp && resp["errors"]
           rescue Manticore::Timeout, Manticore::SocketException => e
             output_stream.puts e.inspect
             output_stream.puts "Upload error: #{e} #{e.message}."
@@ -337,4 +311,59 @@ module Stevedore
       end
     end
   end
+  MAPPING = {
+              sha1: {type: :string, index: :not_analyzed},
+              title: { type: :string, analyzer: :keyword },
+              source_url: {type: :string, index: :not_analyzed},
+              modifiedDate: { type: :date, format: "dateOptionalTime" },
+              _updated_at: { type: :date },
+              analyzed: {
+                properties: {
+                  body: {
+                    type: :string,
+                    index_options: :offsets,
+                    term_vector: :with_positions_offsets,
+                    store: true,
+                    fields: {
+                      snowball: {
+                        type: :string,
+                        index: "analyzed",
+                        analyzer: 'snowball_analyzer' ,
+                        index_options: :offsets,
+                        term_vector: :with_positions_offsets,
+                      }
+                    }
+                  },
+                  metadata: {
+                    properties: {
+                      # "attachments" =>  {type: :string, index: :not_analyzed}, # might break stuff; intended to keep the index name (which often contains relevant search terms) from being indexed, e.g. if a user wants to search for 'bernie' in the bernie-burlington-emails
+                      "Message-From" => {
+                        type: "string",
+                        fields: {
+                          email: {
+                            type: "string",
+                            analyzer: "email_analyzer"
+                          },
+                          "Message-From" => {
+                            type: "string"
+                          }
+                        }
+                      },
+                      "Message-To" => {
+                        type: "string",
+                        fields: {
+                          email: {
+                            type: "string",
+                            analyzer: "email_analyzer"
+                          },
+                          "Message-To" => {
+                            type: "string"
+                          }
+                        }
+                      }
+                    }
+                  }
+                }
+              }
+            }
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: stevedore-uploader
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 1.0.1
 platform: java
 authors:
 - Jeremy B. Merrill
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-04-19 00:00:00.000000000 Z
+date: 2016-05-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   requirement: !ruby/object:Gem::Requirement
@@ -94,6 +94,48 @@ dependencies:
     - - "~>"
       - !ruby/object:Gem::Version
         version: '1.6'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.0.2
+  name: pst
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.0.2
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.6'
+  name: mail
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.6'
+- !ruby/object:Gem::Dependency
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.2'
+  name: rubyzip
+  prerelease: false
+  type: :runtime
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.2'
 description: TK
 email: jeremy.merrill@nytimes.com
 executables: []