word-to-markdown 1.1.7 → 1.1.8

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 9d852a4ac53478b17c5336883ec861ea52bfac83
4
- data.tar.gz: 4505d3165e6e167a5908cf277d9c28d42b8b66e5
2
+ SHA256:
3
+ metadata.gz: 3febb4398acdc4eacedcc62e09f4beeaee625858043a27e6df7e597fee1e0d17
4
+ data.tar.gz: 7a76057aeca2db8f321282bc309a835f282ea921343b28c8bce6d83cd0fc4582
5
5
  SHA512:
6
- metadata.gz: b14cf7b9341a1f779b0943c050822bd8c2c358851e99c0e712722f6090f8d7d8ca3f2e1c79ded4663b51fca3706c8f490e1bcdda99d77d85e7aa0c0895615597
7
- data.tar.gz: d3ea8bb028083f5e9752f5989700c98de481e654e2a8188efc7d9741049605036c36b26fd17df6ef3d9aca2ef7ef358c736e589af980dea98b12be8f08025868
6
+ metadata.gz: ee2340688c2d5f3f21c7e47f85220bebc88201e28200a682146041fa7f5a47e89c56d687b4f6565a95d7d3e7f4c70fb1551894ad297620e7bd49cef520975e18
7
+ data.tar.gz: 44d405387990ee9a09cb33572a1d7843461c0817789c92a08df6d3542082ffc837bbd39d544a2e6a734a0772c67da3afed77049dc96b0ace0d562473515816e1
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2014, Ben Balter
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,78 @@
1
+ # Word to Markdown converter
2
+
3
+ A Ruby gem to liberate content from [the jail that is Word documents](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/#jailbreaking-content)
4
+
5
+ [![Build Status](https://travis-ci.org/benbalter/word-to-markdown.svg?branch=master)](https://travis-ci.org/benbalter/word-to-markdown) [![Gem Version](https://badge.fury.io/rb/word-to-markdown.png)](http://badge.fury.io/rb/word-to-markdown) [![Inline docs](http://inch-ci.org/github/benbalter/word-to-markdown.png)](http://inch-ci.org/github/benbalter/word-to-markdown) [![Build status](https://ci.appveyor.com/api/projects/status/x2gnsfvli3q47a2e/branch/master?svg=true)](https://ci.appveyor.com/project/benbalter/word-to-markdown/branch/master)
6
+
7
+ ## The problem
8
+
9
+ > Our default content publishing workflow is terribly broken. [We've all been trained to make paper](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/), yet today, content authored once is more commonly consumed in multiple formats, and rarely, if ever, does it embody physical form. Put another way, our go-to content authoring workflow remains relatively unchanged since it was conceived in the early 80s.
10
+ >
11
+ > I'm asked regularly by government employees — knowledge workers who fire up a desktop word processor as the first step to any project — for an automated pipeline to convert Microsoft Word documents to [Markdown](http://guides.github.com/overviews/mastering-markdown/), the *lingua franca* of the internet, but as my recent foray into building [just such a converter](http://word-to-markdown.herokuapp.com/) proves, it's not that simple.
12
+ >
13
+ > Markdown isn't just an alternative format. Markdown forces you to write for the web.
14
+
15
+ **[Read more](http://ben.balter.com/2014/03/31/word-versus-markdown-more-than-mere-semantics/)**
16
+
17
+ **[Demo](http://word-to-markdown.herokuapp.com/)**
18
+
19
+ ## Install
20
+
21
+ You'll need to install [LibreOffice](http://www.libreoffice.org/). Then:
22
+
23
+ ```bash
24
+ gem install word-to-markdown
25
+ ```
26
+
27
+ ## Usage
28
+
29
+ ```ruby
30
+ file = WordToMarkdown.new("/path/to/document.docx")
31
+ => <WordToMarkdown path="/path/to/document.docx">
32
+
33
+ file.to_s
34
+ => "# Test\n\n This is a test"
35
+
36
+ file.document.tree
37
+ => <Nokogiri Document>
38
+ ```
39
+
40
+ ### Command line usage
41
+
42
+ Once you've installed the gem, it's just:
43
+
44
+ ```
45
+ $ w2m path/to/document.docx
46
+ ```
47
+
48
+ *Outputs the resulting markdown to stdout*
49
+
50
+ ## Supports
51
+
52
+ * Paragraphs
53
+ * Numbered lists
54
+ * Unnumbered lists
55
+ * Nested lists
56
+ * Italic
57
+ * Bold
58
+ * Explicit headings (e.g., selected as "Heading 1" or "Heading 2")
59
+ * Implicit headings (e.g., text with a larger font size relative to paragraph text)
60
+ * Images
61
+ * Tables
62
+ * Hyperlinks
63
+
64
+ ## Requirements and configuration
65
+
66
+ Word-to-markdown requires `soffice` a command line interface to LibreOffice that works on Linux, Mac, and Windows. To install soffice, see [the LibreOffice documentation](https://www.libreoffice.org/get-help/install-howto/).
67
+
68
+ ## Testing
69
+
70
+ ```
71
+ script/cibuild
72
+ ```
73
+
74
+ ## Server
75
+
76
+ [Word-to-markdown-demo](https://github.com/benbalter/word-to-markdown-demo) contains a lightweight server for converting Word Documents as a service.
77
+
78
+ A live version runs at [word-to-markdown.herokuapp.com](http://word-to-markdown.herokuapp.com).
data/bin/w2m CHANGED
@@ -1,13 +1,14 @@
1
1
  #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
2
3
 
3
4
  require 'word-to-markdown'
4
5
 
5
- if ARGV.size != 1 || ARGV[0] == "--help"
6
- puts "Usage: bundle exec w2m path/to/document.docx"
6
+ if ARGV.size != 1 || ARGV[0] == '--help'
7
+ puts 'Usage: bundle exec w2m path/to/document.docx'
7
8
  exit 1
8
9
  end
9
10
 
10
- if ARGV[0] == "--version"
11
+ if ARGV[0] == '--version'
11
12
  puts "WordToMarkdown v#{WordToMarkdown::VERSION}"
12
13
  puts "LibreOffice v#{WordToMarkdown.soffice.version}" unless Gem.win_platform?
13
14
  else
@@ -1,16 +1,18 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'sys/proctable'
2
4
 
3
5
  module Cliver
4
6
  class Dependency
5
-
6
7
  include Sys
7
8
 
8
9
  # Memoized shortcut for detect
9
10
  # Returns the path to the detected dependency
10
11
  # Raises an error if the dependency was not satisfied
11
- def path
12
+ def detected_path
12
13
  @detected_path ||= detect!
13
14
  end
15
+ alias path detected_path
14
16
 
15
17
  # Is the detected dependency currently open?
16
18
  def open?
@@ -24,12 +26,12 @@ module Cliver
24
26
  def version
25
27
  return @detected_version if defined? @detected_version
26
28
  return if Gem.win_platform?
27
- version = installed_versions.find { |p, v| p == path }
29
+ version = installed_versions.find { |p, _v| p == path }
28
30
  @detected_version = version.nil? ? nil : version[1]
29
31
  end
30
32
 
31
33
  def major_version
32
- version.split(".").first if version
34
+ version.split('.').first if version
33
35
  end
34
36
  end
35
37
  end
@@ -1,7 +1,8 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Nokogiri
2
4
  module XML
3
5
  class Element
4
-
5
6
  DEFAULT_FONT_SIZE = 12.to_f
6
7
 
7
8
  # The node's font size
@@ -13,11 +14,11 @@ module Nokogiri
13
14
  end
14
15
 
15
16
  def bold?
16
- styles['font-weight'] && styles['font-weight'] == "bold"
17
+ styles['font-weight'] && styles['font-weight'] == 'bold'
17
18
  end
18
19
 
19
20
  def italic?
20
- styles['font-style'] && styles['font-style'] == "italic"
21
+ styles['font-style'] && styles['font-style'] == 'italic'
21
22
  end
22
23
  end
23
24
  end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'descriptive_statistics'
2
4
  require 'reverse_markdown'
3
5
  require 'nokogiri-styles'
@@ -16,86 +18,93 @@ require_relative 'nokogiri/xml/element'
16
18
  require_relative 'cliver/dependency_ext'
17
19
 
18
20
  class WordToMarkdown
19
-
20
21
  attr_reader :document, :converter
21
22
 
23
+ # Options to be passed to Reverse Markdown
22
24
  REVERSE_MARKDOWN_OPTIONS = {
23
- unknown_tags: :bypass,
25
+ unknown_tags: :bypass,
24
26
  github_flavored: true
25
- }
27
+ }.freeze
26
28
 
27
- SOFFICE_VERSION_REQUIREMENT = '> 4.0'
29
+ # Minimum version of LibreOffice Required
30
+ SOFFICE_VERSION_REQUIREMENT = '> 4.0'.freeze
28
31
 
32
+ # Paths to look for LibreOffice, in order of preference
29
33
  PATHS = [
30
- "*", # Sub'd for ENV["PATH"]
31
- "~/Applications/LibreOffice.app/Contents/MacOS",
32
- "/Applications/LibreOffice.app/Contents/MacOS",
33
- "/Program Files/LibreOffice 5/program",
34
- "/Program Files (x86)/LibreOffice 4/program"
35
- ]
34
+ '*', # Sub'd for ENV["PATH"]
35
+ '~/Applications/LibreOffice.app/Contents/MacOS',
36
+ '/Applications/LibreOffice.app/Contents/MacOS',
37
+ '/Program Files/LibreOffice 5/program',
38
+ '/Program Files (x86)/LibreOffice 4/program'
39
+ ].freeze
36
40
 
37
41
  # Create a new WordToMarkdown object
38
42
  #
39
- # input - a HTML string or path to an HTML file
40
- #
41
- # Returns the WordToMarkdown object
43
+ # @param path [string] Path to the Word document
44
+ # @param tmpdir [string] Path to a working directory to use
45
+ # @return [WordToMarkdown] WordToMarkdown object with the converted document
42
46
  def initialize(path, tmpdir = nil)
43
47
  @document = WordToMarkdown::Document.new path, tmpdir
44
48
  @converter = WordToMarkdown::Converter.new @document
45
49
  converter.convert!
46
50
  end
47
51
 
48
- def self.run_command(*args)
49
- raise "LibreOffice already running" if soffice.open?
50
-
51
- output, status = Open3.capture2e(soffice.path, *args)
52
- logger.debug output
53
- raise "Command `#{soffice.path} #{args.join(" ")}` failed: #{output}" if status.exitstatus != 0
54
- output
52
+ # Helper method to return the document body, as markdown
53
+ # @return [string] the document body, as markdown
54
+ def to_s
55
+ document.to_s
55
56
  end
56
57
 
57
- # Returns a Cliver::Dependency object representing our soffice dependency
58
- #
59
- # Attempts to resolve by looking at PATH followed by paths in the PATHS constant
60
- #
61
- # Methods used internally:
62
- # path - returns the resolved path. Raises an error if not satisfied
63
- # version - returns the resolved version
64
- # open - is the dependency currently open/running?
65
- def self.soffice
66
- @@soffice_dependency ||= Cliver::Dependency.new("soffice", *soffice_dependency_args)
67
- end
58
+ class << self
59
+ # Run an soffice command
60
+ #
61
+ # @param args [string] one or more arguments to pass to the sofice command
62
+ # @return [string] the command output
63
+ def run_command(*args)
64
+ raise 'LibreOffice already running' if soffice.open?
68
65
 
69
- def self.logger
70
- @@logger ||= begin
71
- logger = Logger.new(STDOUT)
72
- logger.level = Logger::ERROR unless ENV["DEBUG"]
73
- logger
66
+ output, status = Open3.capture2e(soffice.path, *args)
67
+ logger.debug output
68
+ raise "Command `#{soffice.path} #{args.join(' ')}` failed: #{output}" if status.exitstatus != 0
69
+ output
74
70
  end
75
- end
76
71
 
77
- # Pretty print the class in console
78
- def inspect
79
- "<WordToMarkdown path=\"#{@document.path}\">"
80
- end
72
+ # Returns a Cliver::Dependency object representing our soffice dependency
73
+ #
74
+ # Attempts to resolve by looking at PATH followed by paths in the PATHS constant
75
+ #
76
+ # Methods used internally:
77
+ # path - returns the resolved path. Raises an error if not satisfied
78
+ # version - returns the resolved version
79
+ # open - is the dependency currently open/running?
80
+ # @return Cliver::Dependency instance
81
+ def soffice
82
+ @soffice ||= Cliver::Dependency.new('soffice', *soffice_dependency_args)
83
+ end
81
84
 
82
- def to_s
83
- document.to_s
84
- end
85
+ # @return Logger instance
86
+ def logger
87
+ @logger ||= begin
88
+ logger = Logger.new(STDOUT)
89
+ logger.level = Logger::ERROR unless ENV['DEBUG']
90
+ logger
91
+ end
92
+ end
85
93
 
86
- private
94
+ private
87
95
 
88
- # Workaround for two upstream bugs:
89
- # 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
90
- # 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
91
- # and will shell out to `soffice.exe --version`
92
- # In order to support Windows, don't pass *any* version requirement to Cliver
93
- def self.soffice_dependency_args
94
- args = [:path => PATHS.join(File::PATH_SEPARATOR)]
95
- if Gem.win_platform?
96
- args
97
- else
98
- args.unshift SOFFICE_VERSION_REQUIREMENT
96
+ # Workaround for two upstream bugs:
97
+ # 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
98
+ # 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
99
+ # and will shell out to `soffice.exe --version`
100
+ # In order to support Windows, don't pass *any* version requirement to Cliver
101
+ def soffice_dependency_args
102
+ args = [path: PATHS.join(File::PATH_SEPARATOR)]
103
+ if Gem.win_platform?
104
+ args
105
+ else
106
+ args.unshift SOFFICE_VERSION_REQUIREMENT
107
+ end
99
108
  end
100
109
  end
101
110
  end
@@ -1,18 +1,29 @@
1
- # encoding: utf-8
1
+ # frozen_string_literal: true
2
+
2
3
  class WordToMarkdown
3
4
  class Converter
4
-
5
5
  attr_reader :document
6
6
 
7
- HEADING_DEPTH = 6 # Number of headings to guess, e.g., h6
8
- HEADING_STEP = 100/HEADING_DEPTH
7
+ # Number of headings to guess, e.g., h6
8
+ HEADING_DEPTH = 6
9
+
10
+ # Percentile step for eaceh eheading
11
+ HEADING_STEP = 100 / HEADING_DEPTH
12
+
13
+ # Minimum heading size
9
14
  MIN_HEADING_SIZE = 20
10
- UNICODE_BULLETS = ["○", "o", "●", "\u2022", "\\p{C}"]
11
15
 
16
+ # Unicode bullets to strip when processing
17
+ UNICODE_BULLETS = ['○', 'o', '●', "\u2022", '\\p{C}'].freeze
18
+
19
+ # @param document [WordToMarkdown::Document] The document to convert
12
20
  def initialize(document)
13
21
  @document = document
14
22
  end
15
23
 
24
+ # Convert the document
25
+ #
26
+ # Note: this action is destructive!
16
27
  def convert!
17
28
  # Fonts and headings
18
29
  semanticize_font_styles!
@@ -29,22 +40,22 @@ class WordToMarkdown
29
40
  remove_numbering_from_list_items!
30
41
  end
31
42
 
32
- # Returns an array of Nokogiri nodes that are implicit headings
43
+ # @return [Array<Nokogiri::Node>] Return an array of Nokogiri Nodes that are implicit headings
33
44
  def implicit_headings
34
45
  @implicit_headings ||= begin
35
46
  headings = []
36
- @document.tree.css("[style]").each do |element|
47
+ @document.tree.css('[style]').each do |element|
37
48
  headings.push element unless element.font_size.nil? || element.font_size < MIN_HEADING_SIZE
38
49
  end
39
50
  headings
40
51
  end
41
52
  end
42
53
 
43
- # Returns an array of font-sizes for implicit headings in the document
54
+ # @return [Array<Integer>] An array of font-sizes for implicit headings in the document
44
55
  def font_sizes
45
56
  @font_sizes ||= begin
46
57
  sizes = []
47
- @document.tree.css("[style]").each do |element|
58
+ @document.tree.css('[style]').each do |element|
48
59
  sizes.push element.font_size.round(-1) unless element.font_size.nil?
49
60
  end
50
61
  sizes.uniq.sort
@@ -53,11 +64,10 @@ class WordToMarkdown
53
64
 
54
65
  # Given a Nokogiri node, guess what heading it represents, if any
55
66
  #
56
- # node - the nokigiri node
57
- #
58
- # retuns the heading tag (e.g., H1), or nil
67
+ # @param node [Nokigiri::Node] the nokigiri node
68
+ # @return [String, nil] the heading tag (e.g., H1), or nil
59
69
  def guess_heading(node)
60
- return nil if node.font_size == nil
70
+ return nil if node.font_size.nil?
61
71
  [*1...HEADING_DEPTH].each do |heading|
62
72
  return "h#{heading}" if node.font_size >= h(heading)
63
73
  end
@@ -67,51 +77,58 @@ class WordToMarkdown
67
77
  # Minimum font size required for a given heading
68
78
  # e.g., H(2) would represent the minimum font size of an implicit h2
69
79
  #
70
- # n - the heading number, e.g., 1, 2
80
+ # @param num [Integer] the heading number, e.g., 1, 2
71
81
  #
72
- # returns the minimum font size as an integer
73
- def h(n)
74
- font_sizes.percentile ((HEADING_DEPTH-1)-n) * HEADING_STEP
82
+ # @return [Integer] the minimum font size
83
+ def h(num)
84
+ font_sizes.percentile(((HEADING_DEPTH - 1) - num) * HEADING_STEP)
75
85
  end
76
86
 
87
+ # Convert span-based font styles to `strong`s and `em`s
77
88
  def semanticize_font_styles!
78
- @document.tree.css("span").each do |node|
89
+ @document.tree.css('span').each do |node|
79
90
  if node.bold?
80
- node.node_name = "strong"
91
+ node.node_name = 'strong'
81
92
  elsif node.italic?
82
- node.node_name = "em"
93
+ node.node_name = 'em'
83
94
  end
84
95
  end
85
96
  end
86
97
 
98
+ # Remove top-level paragraphs from table cells
87
99
  def remove_paragraphs_from_tables!
88
- @document.tree.search("td p").each { |node| node.node_name = "span" }
100
+ @document.tree.search('td p').each { |node| node.node_name = 'span' }
89
101
  end
90
102
 
103
+ # Remove top-level paragraphs from list items
91
104
  def remove_paragraphs_from_list_items!
92
- @document.tree.search("li p").each { |node| node.node_name = "span" }
105
+ @document.tree.search('li p').each { |node| node.node_name = 'span' }
93
106
  end
94
107
 
108
+ # Remove prepended unicode bullets from list items
95
109
  def remove_unicode_bullets_from_list_items!
96
- path = WordToMarkdown.soffice.major_version == "5" ? "li span span" : "li span"
110
+ path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
97
111
  @document.tree.search(path).each do |span|
98
- span.inner_html = span.inner_html.gsub /^([#{UNICODE_BULLETS.join("")}]+)/, ""
112
+ span.inner_html = span.inner_html.gsub(/^([#{UNICODE_BULLETS.join("")}]+)/, '')
99
113
  end
100
114
  end
101
115
 
116
+ # Remove prepended numbers from list items
102
117
  def remove_numbering_from_list_items!
103
- path = WordToMarkdown.soffice.major_version == "5" ? "li span span" : "li span"
118
+ path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
104
119
  @document.tree.search(path).each do |span|
105
- span.inner_html = span.inner_html.gsub /^[a-zA-Z0-9]+\./m, ""
120
+ span.inner_html = span.inner_html.gsub(/^[a-zA-Z0-9]+\./m, '')
106
121
  end
107
122
  end
108
123
 
124
+ # Remvoe whitespace from list items
109
125
  def remove_whitespace_from_list_items!
110
- @document.tree.search("li span").each { |span| span.inner_html.strip! }
126
+ @document.tree.search('li span').each { |span| span.inner_html.strip! }
111
127
  end
112
128
 
129
+ # Convert table headers to `th`s2
113
130
  def semanticize_table_headers!
114
- @document.tree.search("table tr:first td").each { |node| node.node_name = "th" }
131
+ @document.tree.search('table tr:first td').each { |node| node.node_name = 'th' }
115
132
  end
116
133
 
117
134
  # Try to guess heading where implicit bassed on font size
@@ -121,6 +138,5 @@ class WordToMarkdown
121
138
  element.node_name = heading unless heading.nil?
122
139
  end
123
140
  end
124
-
125
141
  end
126
142
  end
@@ -1,50 +1,54 @@
1
- # encoding: utf-8
1
+ # frozen_string_literal: true
2
+
2
3
  class WordToMarkdown
3
4
  class Document
4
5
  class NotFoundError < StandardError; end
5
- class ConverstionError < StandardError; end
6
+ class ConversionError < StandardError; end
6
7
 
7
- attr_reader :path, :raw_html, :tmpdir
8
+ attr_reader :path, :tmpdir
8
9
 
10
+ # @param path [string] Path to the Word document
11
+ # @param tmpdir [string] Path to a working directory to use
9
12
  def initialize(path, tmpdir = nil)
10
13
  @path = File.expand_path path, Dir.pwd
11
14
  @tmpdir = tmpdir || Dir.mktmpdir
12
15
  raise NotFoundError, "File #{@path} does not exist" unless File.exist?(@path)
13
16
  end
14
17
 
18
+ # @return [String] the document's extension
15
19
  def extension
16
20
  File.extname path
17
21
  end
18
22
 
23
+ # @return [Nokigiri::Document]
19
24
  def tree
20
25
  @tree ||= begin
21
26
  tree = Nokogiri::HTML(normalized_html)
22
- tree.css("title").remove
27
+ tree.css('title').remove
23
28
  tree
24
29
  end
25
30
  end
26
31
 
27
- # Returns the html representation of the document
32
+ # @return [String] the html representation of the document
28
33
  def html
29
- tree.to_html.gsub("</li>\n", "</li>")
34
+ tree.to_html.gsub("</li>\n", '</li>')
30
35
  end
31
36
 
32
- # Returns the markdown representation of the document
33
- def to_s
37
+ # @return [String] the markdown representation of the document
38
+ def markdown
34
39
  @markdown ||= scrub_whitespace(ReverseMarkdown.convert(html, WordToMarkdown::REVERSE_MARKDOWN_OPTIONS))
35
40
  end
41
+ alias to_s markdown
36
42
 
37
43
  # Determine the document encoding
38
44
  #
39
- # html - the raw html export
40
- #
41
- # Returns the encoding, defaulting to "UTF-8"
45
+ # @return [String] the encoding, defaulting to "UTF-8"
42
46
  def encoding
43
- match = raw_html.encode("UTF-8", :invalid => :replace, :replace => "").match(/charset=([^\"]+)/)
47
+ match = raw_html.encode('UTF-8', invalid: :replace, replace: '').match(/charset=([^\"]+)/)
44
48
  if match
45
- match[1].sub("macintosh", "MacRoman")
49
+ match[1].sub('macintosh', 'MacRoman')
46
50
  else
47
- "UTF-8"
51
+ 'UTF-8'
48
52
  end
49
53
  end
50
54
 
@@ -52,55 +56,57 @@ class WordToMarkdown
52
56
 
53
57
  # Perform pre-processing normalization
54
58
  #
55
- # html - the raw html input from the export
56
- #
57
- # Returns the normalized html
59
+ # @return [String] the normalized html
58
60
  def normalized_html
59
- html = raw_html.force_encoding(encoding)
60
- html = html.encode("UTF-8", :invalid => :replace, :replace => "")
61
- html = Premailer.new(html, :with_html_string => true, :input_encoding => "UTF-8").to_inline_css
62
- html.gsub! /\n|\r/," " # Remove linebreaks
63
- html.gsub! /“|”/, '"' # Straighten curly double quotes
64
- html.gsub! /‘|’/, "'" # Straighten curly single quotes
65
- html.gsub! />\s+</, "><" # Remove extra whitespace between tags
61
+ html = raw_html.dup.force_encoding(encoding)
62
+ html = html.encode('UTF-8', invalid: :replace, replace: '')
63
+ html = Premailer.new(html, with_html_string: true, input_encoding: 'UTF-8').to_inline_css
64
+ html.gsub!(/\n|\r/, ' ') # Remove linebreaks
65
+ html.gsub!(/“|”/, '"') # Straighten curly double quotes
66
+ html.gsub!(/‘|’/, "'") # Straighten curly single quotes
67
+ html.gsub!(/>\s+</, '><') # Remove extra whitespace between tags
66
68
  html
67
69
  end
68
70
 
69
71
  # Perform post-processing normalization of certain Word quirks
70
72
  #
71
- # string - the markdown representation of the document
73
+ # @param string [String] the markdown representation of the document
72
74
  #
73
- # Returns the normalized markdown
75
+ # @return [String] the normalized markdown
74
76
  def scrub_whitespace(string)
75
- string.gsub!("&nbsp;", " ") # HTML encoded spaces
76
- string.sub!(/\A[[:space:]]+/,'') # document leading whitespace
77
- string.sub!(/[[:space:]]+\z/,'') # document trailing whitespace
78
- string.gsub!(/([ ]+)$/, '') # line trailing whitespace
79
- string.gsub!(/\n\n\n\n/,"\n\n") # Quadruple line breaks
80
- string.gsub!(/\u00A0/, "") # Unicode non-breaking spaces, injected as tabs
77
+ string = string.dup
78
+ string.gsub!('&nbsp;', ' ') # HTML encoded spaces
79
+ string.sub!(/\A[[:space:]]+/, '') # document leading whitespace
80
+ string.sub!(/[[:space:]]+\z/, '') # document trailing whitespace
81
+ string.gsub!(/([ ]+)$/, '') # line trailing whitespace
82
+ string.gsub!(/\n\n\n\n/, "\n\n") # Quadruple line breaks
83
+ string.delete!(' ') # Unicode non-breaking spaces, injected as tabs
81
84
  string
82
85
  end
83
86
 
87
+ # @return [String] the path to the intermediary HTML document
84
88
  def dest_path
85
- dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, ".html")
89
+ dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, '.html')
86
90
  File.expand_path(dest_filename, tmpdir)
87
91
  end
88
92
 
93
+ # @return [String] the unnormalized HTML representation
89
94
  def raw_html
90
95
  @raw_html ||= begin
91
- WordToMarkdown::run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
92
- raise ConverstionError, "Failed to convert #{path}" unless File.exists?(dest_path)
96
+ WordToMarkdown.run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
97
+ raise ConversionError, "Failed to convert #{path}" unless File.exist?(dest_path)
93
98
  html = File.read dest_path
94
99
  File.delete dest_path
95
100
  html
96
101
  end
97
102
  end
98
103
 
104
+ # @return [String] the LibreOffice filter to use for conversion
99
105
  def filter
100
- if WordToMarkdown.soffice.major_version == "5"
101
- "html:XHTML Writer File:UTF8"
106
+ if WordToMarkdown.soffice.major_version == '5'
107
+ 'html:XHTML Writer File:UTF8'
102
108
  else
103
- "html"
109
+ 'html'
104
110
  end
105
111
  end
106
112
  end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  class WordToMarkdown
2
- VERSION = "1.1.7"
4
+ VERSION = '1.1.8'.freeze
3
5
  end
metadata CHANGED
@@ -1,29 +1,29 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: word-to-markdown
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.7
4
+ version: 1.1.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ben Balter
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-01-04 00:00:00.000000000 Z
11
+ date: 2018-08-01 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: reverse_markdown
14
+ name: cliver
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '0.6'
19
+ version: '0.3'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
- version: '0.6'
26
+ version: '0.3'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: descriptive_statistics
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -39,103 +39,103 @@ dependencies:
39
39
  - !ruby/object:Gem::Version
40
40
  version: '2.5'
41
41
  - !ruby/object:Gem::Dependency
42
- name: premailer
42
+ name: nokogiri-styles
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
45
  - - "~>"
46
46
  - !ruby/object:Gem::Version
47
- version: '1.8'
47
+ version: '0.1'
48
48
  type: :runtime
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
52
  - - "~>"
53
53
  - !ruby/object:Gem::Version
54
- version: '1.8'
54
+ version: '0.1'
55
55
  - !ruby/object:Gem::Dependency
56
- name: nokogiri-styles
56
+ name: premailer
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
59
  - - "~>"
60
60
  - !ruby/object:Gem::Version
61
- version: '0.1'
61
+ version: '1.8'
62
62
  type: :runtime
63
63
  prerelease: false
64
64
  version_requirements: !ruby/object:Gem::Requirement
65
65
  requirements:
66
66
  - - "~>"
67
67
  - !ruby/object:Gem::Version
68
- version: '0.1'
68
+ version: '1.8'
69
69
  - !ruby/object:Gem::Dependency
70
- name: sys-proctable
70
+ name: reverse_markdown
71
71
  requirement: !ruby/object:Gem::Requirement
72
72
  requirements:
73
73
  - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '0.9'
75
+ version: '1.0'
76
76
  type: :runtime
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
80
  - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '0.9'
82
+ version: '1.0'
83
83
  - !ruby/object:Gem::Dependency
84
- name: cliver
84
+ name: sys-proctable
85
85
  requirement: !ruby/object:Gem::Requirement
86
86
  requirements:
87
87
  - - "~>"
88
88
  - !ruby/object:Gem::Version
89
- version: '0.3'
89
+ version: '1.0'
90
90
  type: :runtime
91
91
  prerelease: false
92
92
  version_requirements: !ruby/object:Gem::Requirement
93
93
  requirements:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
- version: '0.3'
96
+ version: '1.0'
97
97
  - !ruby/object:Gem::Dependency
98
- name: rake
98
+ name: bundler
99
99
  requirement: !ruby/object:Gem::Requirement
100
100
  requirements:
101
101
  - - "~>"
102
102
  - !ruby/object:Gem::Version
103
- version: '10.4'
103
+ version: '1.6'
104
104
  type: :development
105
105
  prerelease: false
106
106
  version_requirements: !ruby/object:Gem::Requirement
107
107
  requirements:
108
108
  - - "~>"
109
109
  - !ruby/object:Gem::Version
110
- version: '10.4'
110
+ version: '1.6'
111
111
  - !ruby/object:Gem::Dependency
112
- name: shoulda
112
+ name: minitest
113
113
  requirement: !ruby/object:Gem::Requirement
114
114
  requirements:
115
115
  - - "~>"
116
116
  - !ruby/object:Gem::Version
117
- version: '3.5'
117
+ version: '5.0'
118
118
  type: :development
119
119
  prerelease: false
120
120
  version_requirements: !ruby/object:Gem::Requirement
121
121
  requirements:
122
122
  - - "~>"
123
123
  - !ruby/object:Gem::Version
124
- version: '3.5'
124
+ version: '5.0'
125
125
  - !ruby/object:Gem::Dependency
126
- name: bundler
126
+ name: mocha
127
127
  requirement: !ruby/object:Gem::Requirement
128
128
  requirements:
129
129
  - - "~>"
130
130
  - !ruby/object:Gem::Version
131
- version: '1.6'
131
+ version: '1.1'
132
132
  type: :development
133
133
  prerelease: false
134
134
  version_requirements: !ruby/object:Gem::Requirement
135
135
  requirements:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
- version: '1.6'
138
+ version: '1.1'
139
139
  - !ruby/object:Gem::Dependency
140
140
  name: pry
141
141
  requirement: !ruby/object:Gem::Requirement
@@ -151,33 +151,47 @@ dependencies:
151
151
  - !ruby/object:Gem::Version
152
152
  version: '0.10'
153
153
  - !ruby/object:Gem::Dependency
154
- name: mocha
154
+ name: rake
155
155
  requirement: !ruby/object:Gem::Requirement
156
156
  requirements:
157
157
  - - "~>"
158
158
  - !ruby/object:Gem::Version
159
- version: '1.1'
159
+ version: '10.4'
160
160
  type: :development
161
161
  prerelease: false
162
162
  version_requirements: !ruby/object:Gem::Requirement
163
163
  requirements:
164
164
  - - "~>"
165
165
  - !ruby/object:Gem::Version
166
- version: '1.1'
166
+ version: '10.4'
167
167
  - !ruby/object:Gem::Dependency
168
- name: minitest
168
+ name: rubocop
169
169
  requirement: !ruby/object:Gem::Requirement
170
170
  requirements:
171
171
  - - "~>"
172
172
  - !ruby/object:Gem::Version
173
- version: '5.0'
173
+ version: '0.49'
174
174
  type: :development
175
175
  prerelease: false
176
176
  version_requirements: !ruby/object:Gem::Requirement
177
177
  requirements:
178
178
  - - "~>"
179
179
  - !ruby/object:Gem::Version
180
- version: '5.0'
180
+ version: '0.49'
181
+ - !ruby/object:Gem::Dependency
182
+ name: shoulda
183
+ requirement: !ruby/object:Gem::Requirement
184
+ requirements:
185
+ - - "~>"
186
+ - !ruby/object:Gem::Version
187
+ version: '3.5'
188
+ type: :development
189
+ prerelease: false
190
+ version_requirements: !ruby/object:Gem::Requirement
191
+ requirements:
192
+ - - "~>"
193
+ - !ruby/object:Gem::Version
194
+ version: '3.5'
181
195
  description: Ruby Gem to convert Word documents to markdown.
182
196
  email: ben.balter@github.com
183
197
  executables:
@@ -185,6 +199,8 @@ executables:
185
199
  extensions: []
186
200
  extra_rdoc_files: []
187
201
  files:
202
+ - LICENSE.md
203
+ - README.md
188
204
  - bin/w2m
189
205
  - lib/cliver/dependency_ext.rb
190
206
  - lib/nokogiri/xml/element.rb
@@ -212,7 +228,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
212
228
  version: '0'
213
229
  requirements: []
214
230
  rubyforge_project:
215
- rubygems_version: 2.5.1
231
+ rubygems_version: 2.7.6
216
232
  signing_key:
217
233
  specification_version: 4
218
234
  summary: Ruby Gem to convert Word documents to markdown