RubyGems - word-to-markdown - Versions diffs - 1.1.7 → 1.1.8 - Mend

word-to-markdown 1.1.7 → 1.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +5 -5
data/LICENSE.md +21 -0
data/README.md +78 -0
data/bin/w2m +4 -3
data/lib/cliver/dependency_ext.rb +6 -4
data/lib/nokogiri/xml/element.rb +4 -3
data/lib/word-to-markdown.rb +65 -56
data/lib/word-to-markdown/converter.rb +45 -29
data/lib/word-to-markdown/document.rb +44 -38
data/lib/word-to-markdown/version.rb +3 -1
metadata +49 -33

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: 9d852a4ac53478b17c5336883ec861ea52bfac83
-  data.tar.gz: 4505d3165e6e167a5908cf277d9c28d42b8b66e5
+SHA256:
+  metadata.gz: 3febb4398acdc4eacedcc62e09f4beeaee625858043a27e6df7e597fee1e0d17
+  data.tar.gz: 7a76057aeca2db8f321282bc309a835f282ea921343b28c8bce6d83cd0fc4582
 SHA512:
-  metadata.gz: b14cf7b9341a1f779b0943c050822bd8c2c358851e99c0e712722f6090f8d7d8ca3f2e1c79ded4663b51fca3706c8f490e1bcdda99d77d85e7aa0c0895615597
-  data.tar.gz: d3ea8bb028083f5e9752f5989700c98de481e654e2a8188efc7d9741049605036c36b26fd17df6ef3d9aca2ef7ef358c736e589af980dea98b12be8f08025868
+  metadata.gz: ee2340688c2d5f3f21c7e47f85220bebc88201e28200a682146041fa7f5a47e89c56d687b4f6565a95d7d3e7f4c70fb1551894ad297620e7bd49cef520975e18
+  data.tar.gz: 44d405387990ee9a09cb33572a1d7843461c0817789c92a08df6d3542082ffc837bbd39d544a2e6a734a0772c67da3afed77049dc96b0ace0d562473515816e1

data/LICENSE.md ADDED

@@ -0,0 +1,21 @@
+The MIT License (MIT)
+Copyright (c) 2014, Ben Balter
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,78 @@
+# Word to Markdown converter
+A Ruby gem to liberate content from [the jail that is Word documents](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/#jailbreaking-content)
+[![Build Status](https://travis-ci.org/benbalter/word-to-markdown.svg?branch=master)](https://travis-ci.org/benbalter/word-to-markdown) [![Gem Version](https://badge.fury.io/rb/word-to-markdown.png)](http://badge.fury.io/rb/word-to-markdown) [![Inline docs](http://inch-ci.org/github/benbalter/word-to-markdown.png)](http://inch-ci.org/github/benbalter/word-to-markdown) [![Build status](https://ci.appveyor.com/api/projects/status/x2gnsfvli3q47a2e/branch/master?svg=true)](https://ci.appveyor.com/project/benbalter/word-to-markdown/branch/master)
+## The problem
+> Our default content publishing workflow is terribly broken. [We've all been trained to make paper](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/), yet today, content authored once is more commonly consumed in multiple formats, and rarely, if ever, does it embody physical form. Put another way, our go-to content authoring workflow remains relatively unchanged since it was conceived in the early 80s.
+>
+> I'm asked regularly by government employees — knowledge workers who fire up a desktop word processor as the first step to any project — for an automated pipeline to convert Microsoft Word documents to [Markdown](http://guides.github.com/overviews/mastering-markdown/), the *lingua franca* of the internet, but as my recent foray into building [just such a converter](http://word-to-markdown.herokuapp.com/) proves, it's not that simple.
+>
+> Markdown isn't just an alternative format. Markdown forces you to write for the web.
+**[Read more](http://ben.balter.com/2014/03/31/word-versus-markdown-more-than-mere-semantics/)**
+**[Demo](http://word-to-markdown.herokuapp.com/)**
+## Install
+You'll need to install [LibreOffice](http://www.libreoffice.org/). Then:
+```bash
+gem install word-to-markdown
+```
+## Usage
+```ruby
+file = WordToMarkdown.new("/path/to/document.docx")
+=> <WordToMarkdown path="/path/to/document.docx">
+file.to_s
+=> "# Test\n\n This is a test"
+file.document.tree
+=> <Nokogiri Document>
+```
+### Command line usage
+Once you've installed the gem, it's just:
+```
+$ w2m path/to/document.docx
+```
+*Outputs the resulting markdown to stdout*
+## Supports
+* Paragraphs
+* Numbered lists
+* Unnumbered lists
+* Nested lists
+* Italic
+* Bold
+* Explicit headings (e.g., selected as "Heading 1" or "Heading 2")
+* Implicit headings (e.g., text with a larger font size relative to paragraph text)
+* Images
+* Tables
+* Hyperlinks
+## Requirements and configuration
+Word-to-markdown requires `soffice` a command line interface to LibreOffice that works on Linux, Mac, and Windows. To install soffice, see [the LibreOffice documentation](https://www.libreoffice.org/get-help/install-howto/).
+## Testing
+```
+script/cibuild
+```
+## Server
+[Word-to-markdown-demo](https://github.com/benbalter/word-to-markdown-demo) contains a lightweight server for converting Word Documents as a service.
+A live version runs at [word-to-markdown.herokuapp.com](http://word-to-markdown.herokuapp.com).

data/bin/w2m CHANGED

@@ -1,13 +1,14 @@
 #!/usr/bin/env ruby
+# frozen_string_literal: true
 require 'word-to-markdown'
-if ARGV.size != 1 || ARGV[0] == "--help"
-  puts "Usage: bundle exec w2m path/to/document.docx"
+if ARGV.size != 1 || ARGV[0] == '--help'
+  puts 'Usage: bundle exec w2m path/to/document.docx'
   exit 1
 end
-if ARGV[0] == "--version"
+if ARGV[0] == '--version'
   puts "WordToMarkdown v#{WordToMarkdown::VERSION}"
   puts "LibreOffice v#{WordToMarkdown.soffice.version}" unless Gem.win_platform?
 else

data/lib/cliver/dependency_ext.rb CHANGED

@@ -1,16 +1,18 @@
+# frozen_string_literal: true
 require 'sys/proctable'
 module Cliver
   class Dependency
     include Sys
     # Memoized shortcut for detect
     # Returns the path to the detected dependency
     # Raises an error if the dependency was not satisfied
-    def path
+    def detected_path
       @detected_path ||= detect!
     end
+    alias path detected_path
     # Is the detected dependency currently open?
     def open?
@@ -24,12 +26,12 @@ module Cliver
     def version
       return @detected_version if defined? @detected_version
       return if Gem.win_platform?
-      version = installed_versions.find { |p, v| p == path }
+      version = installed_versions.find { |p, _v| p == path }
       @detected_version = version.nil? ? nil : version[1]
     end
     def major_version
-      version.split(".").first if version
+      version.split('.').first if version
     end
   end
 end

data/lib/nokogiri/xml/element.rb CHANGED

@@ -1,7 +1,8 @@
+# frozen_string_literal: true
 module Nokogiri
   module XML
     class Element
       DEFAULT_FONT_SIZE = 12.to_f
       # The node's font size
@@ -13,11 +14,11 @@ module Nokogiri
       end
       def bold?
-        styles['font-weight'] && styles['font-weight'] == "bold"
+        styles['font-weight'] && styles['font-weight'] == 'bold'
       end
       def italic?
-        styles['font-style'] && styles['font-style'] == "italic"
+        styles['font-style'] && styles['font-style'] == 'italic'
       end
     end
   end

data/lib/word-to-markdown.rb CHANGED

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 require 'descriptive_statistics'
 require 'reverse_markdown'
 require 'nokogiri-styles'
@@ -16,86 +18,93 @@ require_relative 'nokogiri/xml/element'
 require_relative 'cliver/dependency_ext'
 class WordToMarkdown
   attr_reader :document, :converter
+  # Options to be passed to Reverse Markdown
   REVERSE_MARKDOWN_OPTIONS = {
-    unknown_tags: :bypass,
+    unknown_tags:    :bypass,
     github_flavored: true
-  }
+  }.freeze
-  SOFFICE_VERSION_REQUIREMENT = '> 4.0'
+  # Minimum version of LibreOffice Required
+  SOFFICE_VERSION_REQUIREMENT = '> 4.0'.freeze
+  # Paths to look for LibreOffice, in order of preference
   PATHS = [
-    "*", # Sub'd for ENV["PATH"]
-    "~/Applications/LibreOffice.app/Contents/MacOS",
-    "/Applications/LibreOffice.app/Contents/MacOS",
-    "/Program Files/LibreOffice 5/program",
-    "/Program Files (x86)/LibreOffice 4/program"
-  ]
+    '*', # Sub'd for ENV["PATH"]
+    '~/Applications/LibreOffice.app/Contents/MacOS',
+    '/Applications/LibreOffice.app/Contents/MacOS',
+    '/Program Files/LibreOffice 5/program',
+    '/Program Files (x86)/LibreOffice 4/program'
+  ].freeze
   # Create a new WordToMarkdown object
   #
-  # input - a HTML string or path to an HTML file
-  #
-  # Returns the WordToMarkdown object
+  # @param path [string] Path to the Word document
+  # @param tmpdir [string] Path to a working directory to use
+  # @return [WordToMarkdown] WordToMarkdown object with the converted document
   def initialize(path, tmpdir = nil)
     @document = WordToMarkdown::Document.new path, tmpdir
     @converter = WordToMarkdown::Converter.new @document
     converter.convert!
   end
-  def self.run_command(*args)
-    raise "LibreOffice already running" if soffice.open?
-    output, status = Open3.capture2e(soffice.path, *args)
-    logger.debug output
-    raise "Command `#{soffice.path} #{args.join(" ")}` failed: #{output}" if status.exitstatus != 0
-    output
+  # Helper method to return the document body, as markdown
+  # @return [string] the document body, as markdown
+  def to_s
+    document.to_s
   end
-  # Returns a Cliver::Dependency object representing our soffice dependency
-  #
-  # Attempts to resolve by looking at PATH followed by paths in the PATHS constant
-  #
-  # Methods used internally:
-  #   path    - returns the resolved path. Raises an error if not satisfied
-  #   version - returns the resolved version
-  #   open    - is the dependency currently open/running?
-  def self.soffice
-    @@soffice_dependency ||= Cliver::Dependency.new("soffice", *soffice_dependency_args)
-  end
+  class << self
+    # Run an soffice command
+    #
+    # @param args [string] one or more arguments to pass to the sofice command
+    # @return [string] the command output
+    def run_command(*args)
+      raise 'LibreOffice already running' if soffice.open?
-  def self.logger
-    @@logger ||= begin
-      logger = Logger.new(STDOUT)
-      logger.level = Logger::ERROR unless ENV["DEBUG"]
-      logger
+      output, status = Open3.capture2e(soffice.path, *args)
+      logger.debug output
+      raise "Command `#{soffice.path} #{args.join(' ')}` failed: #{output}" if status.exitstatus != 0
+      output
     end
-  end
-  # Pretty print the class in console
-  def inspect
-    "<WordToMarkdown path=\"#{@document.path}\">"
-  end
+    # Returns a Cliver::Dependency object representing our soffice dependency
+    #
+    # Attempts to resolve by looking at PATH followed by paths in the PATHS constant
+    #
+    # Methods used internally:
+    #   path    - returns the resolved path. Raises an error if not satisfied
+    #   version - returns the resolved version
+    #   open    - is the dependency currently open/running?
+    # @return Cliver::Dependency instance
+    def soffice
+      @soffice ||= Cliver::Dependency.new('soffice', *soffice_dependency_args)
+    end
-  def to_s
-    document.to_s
-  end
+    # @return Logger instance
+    def logger
+      @logger ||= begin
+        logger = Logger.new(STDOUT)
+        logger.level = Logger::ERROR unless ENV['DEBUG']
+        logger
+      end
+    end
-  private
+    private
-  # Workaround for two upstream bugs:
-  # 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
-  # 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
-  #    and will shell out to `soffice.exe --version`
-  # In order to support Windows, don't pass *any* version requirement to Cliver
-  def self.soffice_dependency_args
-    args = [:path => PATHS.join(File::PATH_SEPARATOR)]
-    if Gem.win_platform?
-      args
-    else
-      args.unshift SOFFICE_VERSION_REQUIREMENT
+    # Workaround for two upstream bugs:
+    # 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
+    # 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
+    #    and will shell out to `soffice.exe --version`
+    # In order to support Windows, don't pass *any* version requirement to Cliver
+    def soffice_dependency_args
+      args = [path: PATHS.join(File::PATH_SEPARATOR)]
+      if Gem.win_platform?
+        args
+      else
+        args.unshift SOFFICE_VERSION_REQUIREMENT
+      end
     end
   end
 end

data/lib/word-to-markdown/converter.rb CHANGED

@@ -1,18 +1,29 @@
-# encoding: utf-8
+# frozen_string_literal: true
 class WordToMarkdown
   class Converter
     attr_reader :document
-    HEADING_DEPTH = 6 # Number of headings to guess, e.g., h6
-    HEADING_STEP = 100/HEADING_DEPTH
+    # Number of headings to guess, e.g., h6
+    HEADING_DEPTH = 6
+    # Percentile step for eaceh eheading
+    HEADING_STEP = 100 / HEADING_DEPTH
+    # Minimum heading size
     MIN_HEADING_SIZE = 20
-    UNICODE_BULLETS = ["○", "o", "●", "\u2022", "\\p{C}"]
+    # Unicode bullets to strip when processing
+    UNICODE_BULLETS = ['○', 'o', '●', "\u2022", '\\p{C}'].freeze
+    # @param document [WordToMarkdown::Document] The document to convert
     def initialize(document)
       @document = document
     end
+    # Convert the document
+    #
+    # Note: this action is destructive!
     def convert!
       # Fonts and headings
       semanticize_font_styles!
@@ -29,22 +40,22 @@ class WordToMarkdown
       remove_numbering_from_list_items!
     end
-    # Returns an array of Nokogiri nodes that are implicit headings
+    # @return [Array<Nokogiri::Node>] Return an array of Nokogiri Nodes that are implicit headings
     def implicit_headings
       @implicit_headings ||= begin
         headings = []
-        @document.tree.css("[style]").each do |element|
+        @document.tree.css('[style]').each do |element|
           headings.push element unless element.font_size.nil? || element.font_size < MIN_HEADING_SIZE
         end
         headings
       end
     end
-    # Returns an array of font-sizes for implicit headings in the document
+    # @return [Array<Integer>] An array of font-sizes for implicit headings in the document
     def font_sizes
       @font_sizes ||= begin
         sizes = []
-        @document.tree.css("[style]").each do |element|
+        @document.tree.css('[style]').each do |element|
           sizes.push element.font_size.round(-1) unless element.font_size.nil?
         end
         sizes.uniq.sort
@@ -53,11 +64,10 @@ class WordToMarkdown
     # Given a Nokogiri node, guess what heading it represents, if any
     #
-    # node - the nokigiri node
-    #
-    # retuns the heading tag (e.g., H1), or nil
+    # @param node [Nokigiri::Node] the nokigiri node
+    # @return [String, nil] the heading tag (e.g., H1), or nil
     def guess_heading(node)
-      return nil if node.font_size == nil
+      return nil if node.font_size.nil?
       [*1...HEADING_DEPTH].each do |heading|
         return "h#{heading}" if node.font_size >= h(heading)
       end
@@ -67,51 +77,58 @@ class WordToMarkdown
     # Minimum font size required for a given heading
     # e.g., H(2) would represent the minimum font size of an implicit h2
     #
-    # n - the heading number, e.g., 1, 2
+    # @param num [Integer] the heading number, e.g., 1, 2
     #
-    # returns the minimum font size as an integer
-    def h(n)
-      font_sizes.percentile ((HEADING_DEPTH-1)-n) * HEADING_STEP
+    # @return [Integer] the minimum font size
+    def h(num)
+      font_sizes.percentile(((HEADING_DEPTH - 1) - num) * HEADING_STEP)
     end
+    # Convert span-based font styles to `strong`s and `em`s
     def semanticize_font_styles!
-      @document.tree.css("span").each do |node|
+      @document.tree.css('span').each do |node|
         if node.bold?
-          node.node_name = "strong"
+          node.node_name = 'strong'
         elsif node.italic?
-          node.node_name = "em"
+          node.node_name = 'em'
         end
       end
     end
+    # Remove top-level paragraphs from table cells
     def remove_paragraphs_from_tables!
-      @document.tree.search("td p").each { |node| node.node_name = "span" }
+      @document.tree.search('td p').each { |node| node.node_name = 'span' }
     end
+    # Remove top-level paragraphs from list items
     def remove_paragraphs_from_list_items!
-      @document.tree.search("li p").each { |node| node.node_name = "span" }
+      @document.tree.search('li p').each { |node| node.node_name = 'span' }
     end
+    # Remove prepended unicode bullets from list items
     def remove_unicode_bullets_from_list_items!
-      path = WordToMarkdown.soffice.major_version == "5" ? "li span span" : "li span"
+      path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
       @document.tree.search(path).each do |span|
-        span.inner_html = span.inner_html.gsub /^([#{UNICODE_BULLETS.join("")}]+)/, ""
+        span.inner_html = span.inner_html.gsub(/^([#{UNICODE_BULLETS.join("")}]+)/, '')
       end
     end
+    # Remove prepended numbers from list items
     def remove_numbering_from_list_items!
-      path = WordToMarkdown.soffice.major_version == "5" ? "li span span" : "li span"
+      path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
       @document.tree.search(path).each do |span|
-        span.inner_html = span.inner_html.gsub /^[a-zA-Z0-9]+\./m, ""
+        span.inner_html = span.inner_html.gsub(/^[a-zA-Z0-9]+\./m, '')
       end
     end
+    # Remvoe whitespace from list items
     def remove_whitespace_from_list_items!
-      @document.tree.search("li span").each { |span| span.inner_html.strip! }
+      @document.tree.search('li span').each { |span| span.inner_html.strip! }
     end
+    # Convert table headers to `th`s2
     def semanticize_table_headers!
-      @document.tree.search("table tr:first td").each { |node| node.node_name = "th" }
+      @document.tree.search('table tr:first td').each { |node| node.node_name = 'th' }
     end
     # Try to guess heading where implicit bassed on font size
@@ -121,6 +138,5 @@ class WordToMarkdown
         element.node_name = heading unless heading.nil?
       end
     end
   end
 end

data/lib/word-to-markdown/document.rb CHANGED

@@ -1,50 +1,54 @@
-# encoding: utf-8
+# frozen_string_literal: true
 class WordToMarkdown
   class Document
     class NotFoundError < StandardError; end
-    class ConverstionError < StandardError; end
+    class ConversionError < StandardError; end
-    attr_reader :path, :raw_html, :tmpdir
+    attr_reader :path, :tmpdir
+    # @param path [string] Path to the Word document
+    # @param tmpdir [string] Path to a working directory to use
     def initialize(path, tmpdir = nil)
       @path = File.expand_path path, Dir.pwd
       @tmpdir = tmpdir || Dir.mktmpdir
       raise NotFoundError, "File #{@path} does not exist" unless File.exist?(@path)
     end
+    # @return [String] the document's extension
     def extension
       File.extname path
     end
+    # @return [Nokigiri::Document]
     def tree
       @tree ||= begin
         tree = Nokogiri::HTML(normalized_html)
-        tree.css("title").remove
+        tree.css('title').remove
         tree
       end
     end
-    # Returns the html representation of the document
+    # @return [String] the html representation of the document
     def html
-      tree.to_html.gsub("</li>\n", "</li>")
+      tree.to_html.gsub("</li>\n", '</li>')
     end
-    # Returns the markdown representation of the document
-    def to_s
+    # @return [String] the markdown representation of the document
+    def markdown
       @markdown ||= scrub_whitespace(ReverseMarkdown.convert(html, WordToMarkdown::REVERSE_MARKDOWN_OPTIONS))
     end
+    alias to_s markdown
     # Determine the document encoding
     #
-    # html - the raw html export
-    #
-    # Returns the encoding, defaulting to "UTF-8"
+    # @return [String] the encoding, defaulting to "UTF-8"
     def encoding
-      match = raw_html.encode("UTF-8", :invalid => :replace, :replace => "").match(/charset=([^\"]+)/)
+      match = raw_html.encode('UTF-8', invalid: :replace, replace: '').match(/charset=([^\"]+)/)
       if match
-        match[1].sub("macintosh", "MacRoman")
+        match[1].sub('macintosh', 'MacRoman')
       else
-        "UTF-8"
+        'UTF-8'
       end
     end
@@ -52,55 +56,57 @@ class WordToMarkdown
     # Perform pre-processing normalization
     #
-    # html - the raw html input from the export
-    #
-    # Returns the normalized html
+    # @return [String] the normalized html
     def normalized_html
-      html = raw_html.force_encoding(encoding)
-      html = html.encode("UTF-8", :invalid => :replace, :replace => "")
-      html = Premailer.new(html, :with_html_string => true, :input_encoding => "UTF-8").to_inline_css
-      html.gsub! /\n|\r/," "         # Remove linebreaks
-      html.gsub! /“|”/, '"'          # Straighten curly double quotes
-      html.gsub! /‘|’/, "'"          # Straighten curly single quotes
-      html.gsub! />\s+</, "><"       # Remove extra whitespace between tags
+      html = raw_html.dup.force_encoding(encoding)
+      html = html.encode('UTF-8', invalid: :replace, replace: '')
+      html = Premailer.new(html, with_html_string: true, input_encoding: 'UTF-8').to_inline_css
+      html.gsub!(/\n|\r/, ' ')  # Remove linebreaks
+      html.gsub!(/“|”/, '"')    # Straighten curly double quotes
+      html.gsub!(/‘|’/, "'")    # Straighten curly single quotes
+      html.gsub!(/>\s+</, '><') # Remove extra whitespace between tags
       html
     end
     # Perform post-processing normalization of certain Word quirks
     #
-    # string - the markdown representation of the document
+    # @param string [String] the markdown representation of the document
     #
-    # Returns the normalized markdown
+    # @return [String] the normalized markdown
     def scrub_whitespace(string)
-      string.gsub!("&nbsp;", " ")                     # HTML encoded spaces
-      string.sub!(/\A[[:space:]]+/,'')                # document leading whitespace
-      string.sub!(/[[:space:]]+\z/,'')                # document trailing whitespace
-      string.gsub!(/([ ]+)$/, '')                     # line trailing whitespace
-      string.gsub!(/\n\n\n\n/,"\n\n")                 # Quadruple line breaks
-      string.gsub!(/\u00A0/, "")                      # Unicode non-breaking spaces, injected as tabs
+      string = string.dup
+      string.gsub!('&nbsp;', ' ')       # HTML encoded spaces
+      string.sub!(/\A[[:space:]]+/, '') # document leading whitespace
+      string.sub!(/[[:space:]]+\z/, '') # document trailing whitespace
+      string.gsub!(/([ ]+)$/, '')       # line trailing whitespace
+      string.gsub!(/\n\n\n\n/, "\n\n")  # Quadruple line breaks
+      string.delete!(' ')               # Unicode non-breaking spaces, injected as tabs
       string
     end
+    # @return [String] the path to the intermediary HTML document
     def dest_path
-      dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, ".html")
+      dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, '.html')
       File.expand_path(dest_filename, tmpdir)
     end
+    # @return [String] the unnormalized HTML representation
     def raw_html
       @raw_html ||= begin
-        WordToMarkdown::run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
-        raise ConverstionError, "Failed to convert #{path}" unless File.exists?(dest_path)
+        WordToMarkdown.run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
+        raise ConversionError, "Failed to convert #{path}" unless File.exist?(dest_path)
         html = File.read dest_path
         File.delete dest_path
         html
       end
     end
+    # @return [String] the LibreOffice filter to use for conversion
     def filter
-      if WordToMarkdown.soffice.major_version == "5"
-        "html:XHTML Writer File:UTF8"
+      if WordToMarkdown.soffice.major_version == '5'
+        'html:XHTML Writer File:UTF8'
       else
-        "html"
+        'html'
       end
     end
   end

data/lib/word-to-markdown/version.rb CHANGED

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 class WordToMarkdown
-  VERSION = "1.1.7"
+  VERSION = '1.1.8'.freeze
 end

metadata CHANGED

@@ -1,29 +1,29 @@
 --- !ruby/object:Gem::Specification
 name: word-to-markdown
 version: !ruby/object:Gem::Version
-  version: 1.1.7
+  version: 1.1.8
 platform: ruby
 authors:
 - Ben Balter
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-01-04 00:00:00.000000000 Z
+date: 2018-08-01 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
-  name: reverse_markdown
+  name: cliver
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.6'
+        version: '0.3'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.6'
+        version: '0.3'
 - !ruby/object:Gem::Dependency
   name: descriptive_statistics
   requirement: !ruby/object:Gem::Requirement
@@ -39,103 +39,103 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '2.5'
 - !ruby/object:Gem::Dependency
-  name: premailer
+  name: nokogiri-styles
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.8'
+        version: '0.1'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.8'
+        version: '0.1'
 - !ruby/object:Gem::Dependency
-  name: nokogiri-styles
+  name: premailer
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.1'
+        version: '1.8'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.1'
+        version: '1.8'
 - !ruby/object:Gem::Dependency
-  name: sys-proctable
+  name: reverse_markdown
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.9'
+        version: '1.0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.9'
+        version: '1.0'
 - !ruby/object:Gem::Dependency
-  name: cliver
+  name: sys-proctable
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.3'
+        version: '1.0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '0.3'
+        version: '1.0'
 - !ruby/object:Gem::Dependency
-  name: rake
+  name: bundler
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '10.4'
+        version: '1.6'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '10.4'
+        version: '1.6'
 - !ruby/object:Gem::Dependency
-  name: shoulda
+  name: minitest
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.5'
+        version: '5.0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '3.5'
+        version: '5.0'
 - !ruby/object:Gem::Dependency
-  name: bundler
+  name: mocha
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.6'
+        version: '1.1'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.6'
+        version: '1.1'
 - !ruby/object:Gem::Dependency
   name: pry
   requirement: !ruby/object:Gem::Requirement
@@ -151,33 +151,47 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0.10'
 - !ruby/object:Gem::Dependency
-  name: mocha
+  name: rake
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.1'
+        version: '10.4'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.1'
+        version: '10.4'
 - !ruby/object:Gem::Dependency
-  name: minitest
+  name: rubocop
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.0'
+        version: '0.49'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '5.0'
+        version: '0.49'
+- !ruby/object:Gem::Dependency
+  name: shoulda
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.5'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.5'
 description: Ruby Gem to convert Word documents to markdown.
 email: ben.balter@github.com
 executables:
@@ -185,6 +199,8 @@ executables:
 extensions: []
 extra_rdoc_files: []
 files:
+- LICENSE.md
+- README.md
 - bin/w2m
 - lib/cliver/dependency_ext.rb
 - lib/nokogiri/xml/element.rb
@@ -212,7 +228,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.5.1
+rubygems_version: 2.7.6
 signing_key:
 specification_version: 4
 summary: Ruby Gem to convert Word documents to markdown