word-to-markdown 1.1.7 → 1.1.8

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: 9d852a4ac53478b17c5336883ec861ea52bfac83
4
- data.tar.gz: 4505d3165e6e167a5908cf277d9c28d42b8b66e5
2
+ SHA256:
3
+ metadata.gz: 3febb4398acdc4eacedcc62e09f4beeaee625858043a27e6df7e597fee1e0d17
4
+ data.tar.gz: 7a76057aeca2db8f321282bc309a835f282ea921343b28c8bce6d83cd0fc4582
5
5
  SHA512:
6
- metadata.gz: b14cf7b9341a1f779b0943c050822bd8c2c358851e99c0e712722f6090f8d7d8ca3f2e1c79ded4663b51fca3706c8f490e1bcdda99d77d85e7aa0c0895615597
7
- data.tar.gz: d3ea8bb028083f5e9752f5989700c98de481e654e2a8188efc7d9741049605036c36b26fd17df6ef3d9aca2ef7ef358c736e589af980dea98b12be8f08025868
6
+ metadata.gz: ee2340688c2d5f3f21c7e47f85220bebc88201e28200a682146041fa7f5a47e89c56d687b4f6565a95d7d3e7f4c70fb1551894ad297620e7bd49cef520975e18
7
+ data.tar.gz: 44d405387990ee9a09cb33572a1d7843461c0817789c92a08df6d3542082ffc837bbd39d544a2e6a734a0772c67da3afed77049dc96b0ace0d562473515816e1
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2014, Ben Balter
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,78 @@
1
+ # Word to Markdown converter
2
+
3
+ A Ruby gem to liberate content from [the jail that is Word documents](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/#jailbreaking-content)
4
+
5
+ [![Build Status](https://travis-ci.org/benbalter/word-to-markdown.svg?branch=master)](https://travis-ci.org/benbalter/word-to-markdown) [![Gem Version](https://badge.fury.io/rb/word-to-markdown.png)](http://badge.fury.io/rb/word-to-markdown) [![Inline docs](http://inch-ci.org/github/benbalter/word-to-markdown.png)](http://inch-ci.org/github/benbalter/word-to-markdown) [![Build status](https://ci.appveyor.com/api/projects/status/x2gnsfvli3q47a2e/branch/master?svg=true)](https://ci.appveyor.com/project/benbalter/word-to-markdown/branch/master)
6
+
7
+ ## The problem
8
+
9
+ > Our default content publishing workflow is terribly broken. [We've all been trained to make paper](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/), yet today, content authored once is more commonly consumed in multiple formats, and rarely, if ever, does it embody physical form. Put another way, our go-to content authoring workflow remains relatively unchanged since it was conceived in the early 80s.
10
+ >
11
+ > I'm asked regularly by government employees — knowledge workers who fire up a desktop word processor as the first step to any project — for an automated pipeline to convert Microsoft Word documents to [Markdown](http://guides.github.com/overviews/mastering-markdown/), the *lingua franca* of the internet, but as my recent foray into building [just such a converter](http://word-to-markdown.herokuapp.com/) proves, it's not that simple.
12
+ >
13
+ > Markdown isn't just an alternative format. Markdown forces you to write for the web.
14
+
15
+ **[Read more](http://ben.balter.com/2014/03/31/word-versus-markdown-more-than-mere-semantics/)**
16
+
17
+ **[Demo](http://word-to-markdown.herokuapp.com/)**
18
+
19
+ ## Install
20
+
21
+ You'll need to install [LibreOffice](http://www.libreoffice.org/). Then:
22
+
23
+ ```bash
24
+ gem install word-to-markdown
25
+ ```
26
+
27
+ ## Usage
28
+
29
+ ```ruby
30
+ file = WordToMarkdown.new("/path/to/document.docx")
31
+ => <WordToMarkdown path="/path/to/document.docx">
32
+
33
+ file.to_s
34
+ => "# Test\n\n This is a test"
35
+
36
+ file.document.tree
37
+ => <Nokogiri Document>
38
+ ```
39
+
40
+ ### Command line usage
41
+
42
+ Once you've installed the gem, it's just:
43
+
44
+ ```
45
+ $ w2m path/to/document.docx
46
+ ```
47
+
48
+ *Outputs the resulting markdown to stdout*
49
+
50
+ ## Supports
51
+
52
+ * Paragraphs
53
+ * Numbered lists
54
+ * Unnumbered lists
55
+ * Nested lists
56
+ * Italic
57
+ * Bold
58
+ * Explicit headings (e.g., selected as "Heading 1" or "Heading 2")
59
+ * Implicit headings (e.g., text with a larger font size relative to paragraph text)
60
+ * Images
61
+ * Tables
62
+ * Hyperlinks
63
+
64
+ ## Requirements and configuration
65
+
66
+ Word-to-markdown requires `soffice` a command line interface to LibreOffice that works on Linux, Mac, and Windows. To install soffice, see [the LibreOffice documentation](https://www.libreoffice.org/get-help/install-howto/).
67
+
68
+ ## Testing
69
+
70
+ ```
71
+ script/cibuild
72
+ ```
73
+
74
+ ## Server
75
+
76
+ [Word-to-markdown-demo](https://github.com/benbalter/word-to-markdown-demo) contains a lightweight server for converting Word Documents as a service.
77
+
78
+ A live version runs at [word-to-markdown.herokuapp.com](http://word-to-markdown.herokuapp.com).
data/bin/w2m CHANGED
@@ -1,13 +1,14 @@
1
1
  #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
2
3
 
3
4
  require 'word-to-markdown'
4
5
 
5
- if ARGV.size != 1 || ARGV[0] == "--help"
6
- puts "Usage: bundle exec w2m path/to/document.docx"
6
+ if ARGV.size != 1 || ARGV[0] == '--help'
7
+ puts 'Usage: bundle exec w2m path/to/document.docx'
7
8
  exit 1
8
9
  end
9
10
 
10
- if ARGV[0] == "--version"
11
+ if ARGV[0] == '--version'
11
12
  puts "WordToMarkdown v#{WordToMarkdown::VERSION}"
12
13
  puts "LibreOffice v#{WordToMarkdown.soffice.version}" unless Gem.win_platform?
13
14
  else
@@ -1,16 +1,18 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'sys/proctable'
2
4
 
3
5
  module Cliver
4
6
  class Dependency
5
-
6
7
  include Sys
7
8
 
8
9
  # Memoized shortcut for detect
9
10
  # Returns the path to the detected dependency
10
11
  # Raises an error if the dependency was not satisfied
11
- def path
12
+ def detected_path
12
13
  @detected_path ||= detect!
13
14
  end
15
+ alias path detected_path
14
16
 
15
17
  # Is the detected dependency currently open?
16
18
  def open?
@@ -24,12 +26,12 @@ module Cliver
24
26
  def version
25
27
  return @detected_version if defined? @detected_version
26
28
  return if Gem.win_platform?
27
- version = installed_versions.find { |p, v| p == path }
29
+ version = installed_versions.find { |p, _v| p == path }
28
30
  @detected_version = version.nil? ? nil : version[1]
29
31
  end
30
32
 
31
33
  def major_version
32
- version.split(".").first if version
34
+ version.split('.').first if version
33
35
  end
34
36
  end
35
37
  end
@@ -1,7 +1,8 @@
1
+ # frozen_string_literal: true
2
+
1
3
  module Nokogiri
2
4
  module XML
3
5
  class Element
4
-
5
6
  DEFAULT_FONT_SIZE = 12.to_f
6
7
 
7
8
  # The node's font size
@@ -13,11 +14,11 @@ module Nokogiri
13
14
  end
14
15
 
15
16
  def bold?
16
- styles['font-weight'] && styles['font-weight'] == "bold"
17
+ styles['font-weight'] && styles['font-weight'] == 'bold'
17
18
  end
18
19
 
19
20
  def italic?
20
- styles['font-style'] && styles['font-style'] == "italic"
21
+ styles['font-style'] && styles['font-style'] == 'italic'
21
22
  end
22
23
  end
23
24
  end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  require 'descriptive_statistics'
2
4
  require 'reverse_markdown'
3
5
  require 'nokogiri-styles'
@@ -16,86 +18,93 @@ require_relative 'nokogiri/xml/element'
16
18
  require_relative 'cliver/dependency_ext'
17
19
 
18
20
  class WordToMarkdown
19
-
20
21
  attr_reader :document, :converter
21
22
 
23
+ # Options to be passed to Reverse Markdown
22
24
  REVERSE_MARKDOWN_OPTIONS = {
23
- unknown_tags: :bypass,
25
+ unknown_tags: :bypass,
24
26
  github_flavored: true
25
- }
27
+ }.freeze
26
28
 
27
- SOFFICE_VERSION_REQUIREMENT = '> 4.0'
29
+ # Minimum version of LibreOffice Required
30
+ SOFFICE_VERSION_REQUIREMENT = '> 4.0'.freeze
28
31
 
32
+ # Paths to look for LibreOffice, in order of preference
29
33
  PATHS = [
30
- "*", # Sub'd for ENV["PATH"]
31
- "~/Applications/LibreOffice.app/Contents/MacOS",
32
- "/Applications/LibreOffice.app/Contents/MacOS",
33
- "/Program Files/LibreOffice 5/program",
34
- "/Program Files (x86)/LibreOffice 4/program"
35
- ]
34
+ '*', # Sub'd for ENV["PATH"]
35
+ '~/Applications/LibreOffice.app/Contents/MacOS',
36
+ '/Applications/LibreOffice.app/Contents/MacOS',
37
+ '/Program Files/LibreOffice 5/program',
38
+ '/Program Files (x86)/LibreOffice 4/program'
39
+ ].freeze
36
40
 
37
41
  # Create a new WordToMarkdown object
38
42
  #
39
- # input - a HTML string or path to an HTML file
40
- #
41
- # Returns the WordToMarkdown object
43
+ # @param path [string] Path to the Word document
44
+ # @param tmpdir [string] Path to a working directory to use
45
+ # @return [WordToMarkdown] WordToMarkdown object with the converted document
42
46
  def initialize(path, tmpdir = nil)
43
47
  @document = WordToMarkdown::Document.new path, tmpdir
44
48
  @converter = WordToMarkdown::Converter.new @document
45
49
  converter.convert!
46
50
  end
47
51
 
48
- def self.run_command(*args)
49
- raise "LibreOffice already running" if soffice.open?
50
-
51
- output, status = Open3.capture2e(soffice.path, *args)
52
- logger.debug output
53
- raise "Command `#{soffice.path} #{args.join(" ")}` failed: #{output}" if status.exitstatus != 0
54
- output
52
+ # Helper method to return the document body, as markdown
53
+ # @return [string] the document body, as markdown
54
+ def to_s
55
+ document.to_s
55
56
  end
56
57
 
57
- # Returns a Cliver::Dependency object representing our soffice dependency
58
- #
59
- # Attempts to resolve by looking at PATH followed by paths in the PATHS constant
60
- #
61
- # Methods used internally:
62
- # path - returns the resolved path. Raises an error if not satisfied
63
- # version - returns the resolved version
64
- # open - is the dependency currently open/running?
65
- def self.soffice
66
- @@soffice_dependency ||= Cliver::Dependency.new("soffice", *soffice_dependency_args)
67
- end
58
+ class << self
59
+ # Run an soffice command
60
+ #
61
+ # @param args [string] one or more arguments to pass to the sofice command
62
+ # @return [string] the command output
63
+ def run_command(*args)
64
+ raise 'LibreOffice already running' if soffice.open?
68
65
 
69
- def self.logger
70
- @@logger ||= begin
71
- logger = Logger.new(STDOUT)
72
- logger.level = Logger::ERROR unless ENV["DEBUG"]
73
- logger
66
+ output, status = Open3.capture2e(soffice.path, *args)
67
+ logger.debug output
68
+ raise "Command `#{soffice.path} #{args.join(' ')}` failed: #{output}" if status.exitstatus != 0
69
+ output
74
70
  end
75
- end
76
71
 
77
- # Pretty print the class in console
78
- def inspect
79
- "<WordToMarkdown path=\"#{@document.path}\">"
80
- end
72
+ # Returns a Cliver::Dependency object representing our soffice dependency
73
+ #
74
+ # Attempts to resolve by looking at PATH followed by paths in the PATHS constant
75
+ #
76
+ # Methods used internally:
77
+ # path - returns the resolved path. Raises an error if not satisfied
78
+ # version - returns the resolved version
79
+ # open - is the dependency currently open/running?
80
+ # @return Cliver::Dependency instance
81
+ def soffice
82
+ @soffice ||= Cliver::Dependency.new('soffice', *soffice_dependency_args)
83
+ end
81
84
 
82
- def to_s
83
- document.to_s
84
- end
85
+ # @return Logger instance
86
+ def logger
87
+ @logger ||= begin
88
+ logger = Logger.new(STDOUT)
89
+ logger.level = Logger::ERROR unless ENV['DEBUG']
90
+ logger
91
+ end
92
+ end
85
93
 
86
- private
94
+ private
87
95
 
88
- # Workaround for two upstream bugs:
89
- # 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
90
- # 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
91
- # and will shell out to `soffice.exe --version`
92
- # In order to support Windows, don't pass *any* version requirement to Cliver
93
- def self.soffice_dependency_args
94
- args = [:path => PATHS.join(File::PATH_SEPARATOR)]
95
- if Gem.win_platform?
96
- args
97
- else
98
- args.unshift SOFFICE_VERSION_REQUIREMENT
96
+ # Workaround for two upstream bugs:
97
+ # 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
98
+ # 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
99
+ # and will shell out to `soffice.exe --version`
100
+ # In order to support Windows, don't pass *any* version requirement to Cliver
101
+ def soffice_dependency_args
102
+ args = [path: PATHS.join(File::PATH_SEPARATOR)]
103
+ if Gem.win_platform?
104
+ args
105
+ else
106
+ args.unshift SOFFICE_VERSION_REQUIREMENT
107
+ end
99
108
  end
100
109
  end
101
110
  end
@@ -1,18 +1,29 @@
1
- # encoding: utf-8
1
+ # frozen_string_literal: true
2
+
2
3
  class WordToMarkdown
3
4
  class Converter
4
-
5
5
  attr_reader :document
6
6
 
7
- HEADING_DEPTH = 6 # Number of headings to guess, e.g., h6
8
- HEADING_STEP = 100/HEADING_DEPTH
7
+ # Number of headings to guess, e.g., h6
8
+ HEADING_DEPTH = 6
9
+
10
+ # Percentile step for eaceh eheading
11
+ HEADING_STEP = 100 / HEADING_DEPTH
12
+
13
+ # Minimum heading size
9
14
  MIN_HEADING_SIZE = 20
10
- UNICODE_BULLETS = ["○", "o", "●", "\u2022", "\\p{C}"]
11
15
 
16
+ # Unicode bullets to strip when processing
17
+ UNICODE_BULLETS = ['○', 'o', '●', "\u2022", '\\p{C}'].freeze
18
+
19
+ # @param document [WordToMarkdown::Document] The document to convert
12
20
  def initialize(document)
13
21
  @document = document
14
22
  end
15
23
 
24
+ # Convert the document
25
+ #
26
+ # Note: this action is destructive!
16
27
  def convert!
17
28
  # Fonts and headings
18
29
  semanticize_font_styles!
@@ -29,22 +40,22 @@ class WordToMarkdown
29
40
  remove_numbering_from_list_items!
30
41
  end
31
42
 
32
- # Returns an array of Nokogiri nodes that are implicit headings
43
+ # @return [Array<Nokogiri::Node>] Return an array of Nokogiri Nodes that are implicit headings
33
44
  def implicit_headings
34
45
  @implicit_headings ||= begin
35
46
  headings = []
36
- @document.tree.css("[style]").each do |element|
47
+ @document.tree.css('[style]').each do |element|
37
48
  headings.push element unless element.font_size.nil? || element.font_size < MIN_HEADING_SIZE
38
49
  end
39
50
  headings
40
51
  end
41
52
  end
42
53
 
43
- # Returns an array of font-sizes for implicit headings in the document
54
+ # @return [Array<Integer>] An array of font-sizes for implicit headings in the document
44
55
  def font_sizes
45
56
  @font_sizes ||= begin
46
57
  sizes = []
47
- @document.tree.css("[style]").each do |element|
58
+ @document.tree.css('[style]').each do |element|
48
59
  sizes.push element.font_size.round(-1) unless element.font_size.nil?
49
60
  end
50
61
  sizes.uniq.sort
@@ -53,11 +64,10 @@ class WordToMarkdown
53
64
 
54
65
  # Given a Nokogiri node, guess what heading it represents, if any
55
66
  #
56
- # node - the nokigiri node
57
- #
58
- # retuns the heading tag (e.g., H1), or nil
67
+ # @param node [Nokigiri::Node] the nokigiri node
68
+ # @return [String, nil] the heading tag (e.g., H1), or nil
59
69
  def guess_heading(node)
60
- return nil if node.font_size == nil
70
+ return nil if node.font_size.nil?
61
71
  [*1...HEADING_DEPTH].each do |heading|
62
72
  return "h#{heading}" if node.font_size >= h(heading)
63
73
  end
@@ -67,51 +77,58 @@ class WordToMarkdown
67
77
  # Minimum font size required for a given heading
68
78
  # e.g., H(2) would represent the minimum font size of an implicit h2
69
79
  #
70
- # n - the heading number, e.g., 1, 2
80
+ # @param num [Integer] the heading number, e.g., 1, 2
71
81
  #
72
- # returns the minimum font size as an integer
73
- def h(n)
74
- font_sizes.percentile ((HEADING_DEPTH-1)-n) * HEADING_STEP
82
+ # @return [Integer] the minimum font size
83
+ def h(num)
84
+ font_sizes.percentile(((HEADING_DEPTH - 1) - num) * HEADING_STEP)
75
85
  end
76
86
 
87
+ # Convert span-based font styles to `strong`s and `em`s
77
88
  def semanticize_font_styles!
78
- @document.tree.css("span").each do |node|
89
+ @document.tree.css('span').each do |node|
79
90
  if node.bold?
80
- node.node_name = "strong"
91
+ node.node_name = 'strong'
81
92
  elsif node.italic?
82
- node.node_name = "em"
93
+ node.node_name = 'em'
83
94
  end
84
95
  end
85
96
  end
86
97
 
98
+ # Remove top-level paragraphs from table cells
87
99
  def remove_paragraphs_from_tables!
88
- @document.tree.search("td p").each { |node| node.node_name = "span" }
100
+ @document.tree.search('td p').each { |node| node.node_name = 'span' }
89
101
  end
90
102
 
103
+ # Remove top-level paragraphs from list items
91
104
  def remove_paragraphs_from_list_items!
92
- @document.tree.search("li p").each { |node| node.node_name = "span" }
105
+ @document.tree.search('li p').each { |node| node.node_name = 'span' }
93
106
  end
94
107
 
108
+ # Remove prepended unicode bullets from list items
95
109
  def remove_unicode_bullets_from_list_items!
96
- path = WordToMarkdown.soffice.major_version == "5" ? "li span span" : "li span"
110
+ path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
97
111
  @document.tree.search(path).each do |span|
98
- span.inner_html = span.inner_html.gsub /^([#{UNICODE_BULLETS.join("")}]+)/, ""
112
+ span.inner_html = span.inner_html.gsub(/^([#{UNICODE_BULLETS.join("")}]+)/, '')
99
113
  end
100
114
  end
101
115
 
116
+ # Remove prepended numbers from list items
102
117
  def remove_numbering_from_list_items!
103
- path = WordToMarkdown.soffice.major_version == "5" ? "li span span" : "li span"
118
+ path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
104
119
  @document.tree.search(path).each do |span|
105
- span.inner_html = span.inner_html.gsub /^[a-zA-Z0-9]+\./m, ""
120
+ span.inner_html = span.inner_html.gsub(/^[a-zA-Z0-9]+\./m, '')
106
121
  end
107
122
  end
108
123
 
124
+ # Remvoe whitespace from list items
109
125
  def remove_whitespace_from_list_items!
110
- @document.tree.search("li span").each { |span| span.inner_html.strip! }
126
+ @document.tree.search('li span').each { |span| span.inner_html.strip! }
111
127
  end
112
128
 
129
+ # Convert table headers to `th`s2
113
130
  def semanticize_table_headers!
114
- @document.tree.search("table tr:first td").each { |node| node.node_name = "th" }
131
+ @document.tree.search('table tr:first td').each { |node| node.node_name = 'th' }
115
132
  end
116
133
 
117
134
  # Try to guess heading where implicit bassed on font size
@@ -121,6 +138,5 @@ class WordToMarkdown
121
138
  element.node_name = heading unless heading.nil?
122
139
  end
123
140
  end
124
-
125
141
  end
126
142
  end
@@ -1,50 +1,54 @@
1
- # encoding: utf-8
1
+ # frozen_string_literal: true
2
+
2
3
  class WordToMarkdown
3
4
  class Document
4
5
  class NotFoundError < StandardError; end
5
- class ConverstionError < StandardError; end
6
+ class ConversionError < StandardError; end
6
7
 
7
- attr_reader :path, :raw_html, :tmpdir
8
+ attr_reader :path, :tmpdir
8
9
 
10
+ # @param path [string] Path to the Word document
11
+ # @param tmpdir [string] Path to a working directory to use
9
12
  def initialize(path, tmpdir = nil)
10
13
  @path = File.expand_path path, Dir.pwd
11
14
  @tmpdir = tmpdir || Dir.mktmpdir
12
15
  raise NotFoundError, "File #{@path} does not exist" unless File.exist?(@path)
13
16
  end
14
17
 
18
+ # @return [String] the document's extension
15
19
  def extension
16
20
  File.extname path
17
21
  end
18
22
 
23
+ # @return [Nokigiri::Document]
19
24
  def tree
20
25
  @tree ||= begin
21
26
  tree = Nokogiri::HTML(normalized_html)
22
- tree.css("title").remove
27
+ tree.css('title').remove
23
28
  tree
24
29
  end
25
30
  end
26
31
 
27
- # Returns the html representation of the document
32
+ # @return [String] the html representation of the document
28
33
  def html
29
- tree.to_html.gsub("</li>\n", "</li>")
34
+ tree.to_html.gsub("</li>\n", '</li>')
30
35
  end
31
36
 
32
- # Returns the markdown representation of the document
33
- def to_s
37
+ # @return [String] the markdown representation of the document
38
+ def markdown
34
39
  @markdown ||= scrub_whitespace(ReverseMarkdown.convert(html, WordToMarkdown::REVERSE_MARKDOWN_OPTIONS))
35
40
  end
41
+ alias to_s markdown
36
42
 
37
43
  # Determine the document encoding
38
44
  #
39
- # html - the raw html export
40
- #
41
- # Returns the encoding, defaulting to "UTF-8"
45
+ # @return [String] the encoding, defaulting to "UTF-8"
42
46
  def encoding
43
- match = raw_html.encode("UTF-8", :invalid => :replace, :replace => "").match(/charset=([^\"]+)/)
47
+ match = raw_html.encode('UTF-8', invalid: :replace, replace: '').match(/charset=([^\"]+)/)
44
48
  if match
45
- match[1].sub("macintosh", "MacRoman")
49
+ match[1].sub('macintosh', 'MacRoman')
46
50
  else
47
- "UTF-8"
51
+ 'UTF-8'
48
52
  end
49
53
  end
50
54
 
@@ -52,55 +56,57 @@ class WordToMarkdown
52
56
 
53
57
  # Perform pre-processing normalization
54
58
  #
55
- # html - the raw html input from the export
56
- #
57
- # Returns the normalized html
59
+ # @return [String] the normalized html
58
60
  def normalized_html
59
- html = raw_html.force_encoding(encoding)
60
- html = html.encode("UTF-8", :invalid => :replace, :replace => "")
61
- html = Premailer.new(html, :with_html_string => true, :input_encoding => "UTF-8").to_inline_css
62
- html.gsub! /\n|\r/," " # Remove linebreaks
63
- html.gsub! /“|”/, '"' # Straighten curly double quotes
64
- html.gsub! /‘|’/, "'" # Straighten curly single quotes
65
- html.gsub! />\s+</, "><" # Remove extra whitespace between tags
61
+ html = raw_html.dup.force_encoding(encoding)
62
+ html = html.encode('UTF-8', invalid: :replace, replace: '')
63
+ html = Premailer.new(html, with_html_string: true, input_encoding: 'UTF-8').to_inline_css
64
+ html.gsub!(/\n|\r/, ' ') # Remove linebreaks
65
+ html.gsub!(/“|”/, '"') # Straighten curly double quotes
66
+ html.gsub!(/‘|’/, "'") # Straighten curly single quotes
67
+ html.gsub!(/>\s+</, '><') # Remove extra whitespace between tags
66
68
  html
67
69
  end
68
70
 
69
71
  # Perform post-processing normalization of certain Word quirks
70
72
  #
71
- # string - the markdown representation of the document
73
+ # @param string [String] the markdown representation of the document
72
74
  #
73
- # Returns the normalized markdown
75
+ # @return [String] the normalized markdown
74
76
  def scrub_whitespace(string)
75
- string.gsub!("&nbsp;", " ") # HTML encoded spaces
76
- string.sub!(/\A[[:space:]]+/,'') # document leading whitespace
77
- string.sub!(/[[:space:]]+\z/,'') # document trailing whitespace
78
- string.gsub!(/([ ]+)$/, '') # line trailing whitespace
79
- string.gsub!(/\n\n\n\n/,"\n\n") # Quadruple line breaks
80
- string.gsub!(/\u00A0/, "") # Unicode non-breaking spaces, injected as tabs
77
+ string = string.dup
78
+ string.gsub!('&nbsp;', ' ') # HTML encoded spaces
79
+ string.sub!(/\A[[:space:]]+/, '') # document leading whitespace
80
+ string.sub!(/[[:space:]]+\z/, '') # document trailing whitespace
81
+ string.gsub!(/([ ]+)$/, '') # line trailing whitespace
82
+ string.gsub!(/\n\n\n\n/, "\n\n") # Quadruple line breaks
83
+ string.delete!(' ') # Unicode non-breaking spaces, injected as tabs
81
84
  string
82
85
  end
83
86
 
87
+ # @return [String] the path to the intermediary HTML document
84
88
  def dest_path
85
- dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, ".html")
89
+ dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, '.html')
86
90
  File.expand_path(dest_filename, tmpdir)
87
91
  end
88
92
 
93
+ # @return [String] the unnormalized HTML representation
89
94
  def raw_html
90
95
  @raw_html ||= begin
91
- WordToMarkdown::run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
92
- raise ConverstionError, "Failed to convert #{path}" unless File.exists?(dest_path)
96
+ WordToMarkdown.run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
97
+ raise ConversionError, "Failed to convert #{path}" unless File.exist?(dest_path)
93
98
  html = File.read dest_path
94
99
  File.delete dest_path
95
100
  html
96
101
  end
97
102
  end
98
103
 
104
+ # @return [String] the LibreOffice filter to use for conversion
99
105
  def filter
100
- if WordToMarkdown.soffice.major_version == "5"
101
- "html:XHTML Writer File:UTF8"
106
+ if WordToMarkdown.soffice.major_version == '5'
107
+ 'html:XHTML Writer File:UTF8'
102
108
  else
103
- "html"
109
+ 'html'
104
110
  end
105
111
  end
106
112
  end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  class WordToMarkdown
2
- VERSION = "1.1.7"
4
+ VERSION = '1.1.8'.freeze
3
5
  end
metadata CHANGED
@@ -1,29 +1,29 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: word-to-markdown
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.7
4
+ version: 1.1.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ben Balter
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-01-04 00:00:00.000000000 Z
11
+ date: 2018-08-01 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: reverse_markdown
14
+ name: cliver
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '0.6'
19
+ version: '0.3'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
- version: '0.6'
26
+ version: '0.3'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: descriptive_statistics
29
29
  requirement: !ruby/object:Gem::Requirement
@@ -39,103 +39,103 @@ dependencies:
39
39
  - !ruby/object:Gem::Version
40
40
  version: '2.5'
41
41
  - !ruby/object:Gem::Dependency
42
- name: premailer
42
+ name: nokogiri-styles
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
45
  - - "~>"
46
46
  - !ruby/object:Gem::Version
47
- version: '1.8'
47
+ version: '0.1'
48
48
  type: :runtime
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
52
  - - "~>"
53
53
  - !ruby/object:Gem::Version
54
- version: '1.8'
54
+ version: '0.1'
55
55
  - !ruby/object:Gem::Dependency
56
- name: nokogiri-styles
56
+ name: premailer
57
57
  requirement: !ruby/object:Gem::Requirement
58
58
  requirements:
59
59
  - - "~>"
60
60
  - !ruby/object:Gem::Version
61
- version: '0.1'
61
+ version: '1.8'
62
62
  type: :runtime
63
63
  prerelease: false
64
64
  version_requirements: !ruby/object:Gem::Requirement
65
65
  requirements:
66
66
  - - "~>"
67
67
  - !ruby/object:Gem::Version
68
- version: '0.1'
68
+ version: '1.8'
69
69
  - !ruby/object:Gem::Dependency
70
- name: sys-proctable
70
+ name: reverse_markdown
71
71
  requirement: !ruby/object:Gem::Requirement
72
72
  requirements:
73
73
  - - "~>"
74
74
  - !ruby/object:Gem::Version
75
- version: '0.9'
75
+ version: '1.0'
76
76
  type: :runtime
77
77
  prerelease: false
78
78
  version_requirements: !ruby/object:Gem::Requirement
79
79
  requirements:
80
80
  - - "~>"
81
81
  - !ruby/object:Gem::Version
82
- version: '0.9'
82
+ version: '1.0'
83
83
  - !ruby/object:Gem::Dependency
84
- name: cliver
84
+ name: sys-proctable
85
85
  requirement: !ruby/object:Gem::Requirement
86
86
  requirements:
87
87
  - - "~>"
88
88
  - !ruby/object:Gem::Version
89
- version: '0.3'
89
+ version: '1.0'
90
90
  type: :runtime
91
91
  prerelease: false
92
92
  version_requirements: !ruby/object:Gem::Requirement
93
93
  requirements:
94
94
  - - "~>"
95
95
  - !ruby/object:Gem::Version
96
- version: '0.3'
96
+ version: '1.0'
97
97
  - !ruby/object:Gem::Dependency
98
- name: rake
98
+ name: bundler
99
99
  requirement: !ruby/object:Gem::Requirement
100
100
  requirements:
101
101
  - - "~>"
102
102
  - !ruby/object:Gem::Version
103
- version: '10.4'
103
+ version: '1.6'
104
104
  type: :development
105
105
  prerelease: false
106
106
  version_requirements: !ruby/object:Gem::Requirement
107
107
  requirements:
108
108
  - - "~>"
109
109
  - !ruby/object:Gem::Version
110
- version: '10.4'
110
+ version: '1.6'
111
111
  - !ruby/object:Gem::Dependency
112
- name: shoulda
112
+ name: minitest
113
113
  requirement: !ruby/object:Gem::Requirement
114
114
  requirements:
115
115
  - - "~>"
116
116
  - !ruby/object:Gem::Version
117
- version: '3.5'
117
+ version: '5.0'
118
118
  type: :development
119
119
  prerelease: false
120
120
  version_requirements: !ruby/object:Gem::Requirement
121
121
  requirements:
122
122
  - - "~>"
123
123
  - !ruby/object:Gem::Version
124
- version: '3.5'
124
+ version: '5.0'
125
125
  - !ruby/object:Gem::Dependency
126
- name: bundler
126
+ name: mocha
127
127
  requirement: !ruby/object:Gem::Requirement
128
128
  requirements:
129
129
  - - "~>"
130
130
  - !ruby/object:Gem::Version
131
- version: '1.6'
131
+ version: '1.1'
132
132
  type: :development
133
133
  prerelease: false
134
134
  version_requirements: !ruby/object:Gem::Requirement
135
135
  requirements:
136
136
  - - "~>"
137
137
  - !ruby/object:Gem::Version
138
- version: '1.6'
138
+ version: '1.1'
139
139
  - !ruby/object:Gem::Dependency
140
140
  name: pry
141
141
  requirement: !ruby/object:Gem::Requirement
@@ -151,33 +151,47 @@ dependencies:
151
151
  - !ruby/object:Gem::Version
152
152
  version: '0.10'
153
153
  - !ruby/object:Gem::Dependency
154
- name: mocha
154
+ name: rake
155
155
  requirement: !ruby/object:Gem::Requirement
156
156
  requirements:
157
157
  - - "~>"
158
158
  - !ruby/object:Gem::Version
159
- version: '1.1'
159
+ version: '10.4'
160
160
  type: :development
161
161
  prerelease: false
162
162
  version_requirements: !ruby/object:Gem::Requirement
163
163
  requirements:
164
164
  - - "~>"
165
165
  - !ruby/object:Gem::Version
166
- version: '1.1'
166
+ version: '10.4'
167
167
  - !ruby/object:Gem::Dependency
168
- name: minitest
168
+ name: rubocop
169
169
  requirement: !ruby/object:Gem::Requirement
170
170
  requirements:
171
171
  - - "~>"
172
172
  - !ruby/object:Gem::Version
173
- version: '5.0'
173
+ version: '0.49'
174
174
  type: :development
175
175
  prerelease: false
176
176
  version_requirements: !ruby/object:Gem::Requirement
177
177
  requirements:
178
178
  - - "~>"
179
179
  - !ruby/object:Gem::Version
180
- version: '5.0'
180
+ version: '0.49'
181
+ - !ruby/object:Gem::Dependency
182
+ name: shoulda
183
+ requirement: !ruby/object:Gem::Requirement
184
+ requirements:
185
+ - - "~>"
186
+ - !ruby/object:Gem::Version
187
+ version: '3.5'
188
+ type: :development
189
+ prerelease: false
190
+ version_requirements: !ruby/object:Gem::Requirement
191
+ requirements:
192
+ - - "~>"
193
+ - !ruby/object:Gem::Version
194
+ version: '3.5'
181
195
  description: Ruby Gem to convert Word documents to markdown.
182
196
  email: ben.balter@github.com
183
197
  executables:
@@ -185,6 +199,8 @@ executables:
185
199
  extensions: []
186
200
  extra_rdoc_files: []
187
201
  files:
202
+ - LICENSE.md
203
+ - README.md
188
204
  - bin/w2m
189
205
  - lib/cliver/dependency_ext.rb
190
206
  - lib/nokogiri/xml/element.rb
@@ -212,7 +228,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
212
228
  version: '0'
213
229
  requirements: []
214
230
  rubyforge_project:
215
- rubygems_version: 2.5.1
231
+ rubygems_version: 2.7.6
216
232
  signing_key:
217
233
  specification_version: 4
218
234
  summary: Ruby Gem to convert Word documents to markdown