word-to-markdown 1.1.7 → 1.1.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
- data/LICENSE.md +21 -0
- data/README.md +78 -0
- data/bin/w2m +4 -3
- data/lib/cliver/dependency_ext.rb +6 -4
- data/lib/nokogiri/xml/element.rb +4 -3
- data/lib/word-to-markdown.rb +65 -56
- data/lib/word-to-markdown/converter.rb +45 -29
- data/lib/word-to-markdown/document.rb +44 -38
- data/lib/word-to-markdown/version.rb +3 -1
- metadata +49 -33
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 3febb4398acdc4eacedcc62e09f4beeaee625858043a27e6df7e597fee1e0d17
|
4
|
+
data.tar.gz: 7a76057aeca2db8f321282bc309a835f282ea921343b28c8bce6d83cd0fc4582
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ee2340688c2d5f3f21c7e47f85220bebc88201e28200a682146041fa7f5a47e89c56d687b4f6565a95d7d3e7f4c70fb1551894ad297620e7bd49cef520975e18
|
7
|
+
data.tar.gz: 44d405387990ee9a09cb33572a1d7843461c0817789c92a08df6d3542082ffc837bbd39d544a2e6a734a0772c67da3afed77049dc96b0ace0d562473515816e1
|
data/LICENSE.md
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2014, Ben Balter
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,78 @@
|
|
1
|
+
# Word to Markdown converter
|
2
|
+
|
3
|
+
A Ruby gem to liberate content from [the jail that is Word documents](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/#jailbreaking-content)
|
4
|
+
|
5
|
+
[](https://travis-ci.org/benbalter/word-to-markdown) [](http://badge.fury.io/rb/word-to-markdown) [](http://inch-ci.org/github/benbalter/word-to-markdown) [](https://ci.appveyor.com/project/benbalter/word-to-markdown/branch/master)
|
6
|
+
|
7
|
+
## The problem
|
8
|
+
|
9
|
+
> Our default content publishing workflow is terribly broken. [We've all been trained to make paper](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/), yet today, content authored once is more commonly consumed in multiple formats, and rarely, if ever, does it embody physical form. Put another way, our go-to content authoring workflow remains relatively unchanged since it was conceived in the early 80s.
|
10
|
+
>
|
11
|
+
> I'm asked regularly by government employees — knowledge workers who fire up a desktop word processor as the first step to any project — for an automated pipeline to convert Microsoft Word documents to [Markdown](http://guides.github.com/overviews/mastering-markdown/), the *lingua franca* of the internet, but as my recent foray into building [just such a converter](http://word-to-markdown.herokuapp.com/) proves, it's not that simple.
|
12
|
+
>
|
13
|
+
> Markdown isn't just an alternative format. Markdown forces you to write for the web.
|
14
|
+
|
15
|
+
**[Read more](http://ben.balter.com/2014/03/31/word-versus-markdown-more-than-mere-semantics/)**
|
16
|
+
|
17
|
+
**[Demo](http://word-to-markdown.herokuapp.com/)**
|
18
|
+
|
19
|
+
## Install
|
20
|
+
|
21
|
+
You'll need to install [LibreOffice](http://www.libreoffice.org/). Then:
|
22
|
+
|
23
|
+
```bash
|
24
|
+
gem install word-to-markdown
|
25
|
+
```
|
26
|
+
|
27
|
+
## Usage
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
file = WordToMarkdown.new("/path/to/document.docx")
|
31
|
+
=> <WordToMarkdown path="/path/to/document.docx">
|
32
|
+
|
33
|
+
file.to_s
|
34
|
+
=> "# Test\n\n This is a test"
|
35
|
+
|
36
|
+
file.document.tree
|
37
|
+
=> <Nokogiri Document>
|
38
|
+
```
|
39
|
+
|
40
|
+
### Command line usage
|
41
|
+
|
42
|
+
Once you've installed the gem, it's just:
|
43
|
+
|
44
|
+
```
|
45
|
+
$ w2m path/to/document.docx
|
46
|
+
```
|
47
|
+
|
48
|
+
*Outputs the resulting markdown to stdout*
|
49
|
+
|
50
|
+
## Supports
|
51
|
+
|
52
|
+
* Paragraphs
|
53
|
+
* Numbered lists
|
54
|
+
* Unnumbered lists
|
55
|
+
* Nested lists
|
56
|
+
* Italic
|
57
|
+
* Bold
|
58
|
+
* Explicit headings (e.g., selected as "Heading 1" or "Heading 2")
|
59
|
+
* Implicit headings (e.g., text with a larger font size relative to paragraph text)
|
60
|
+
* Images
|
61
|
+
* Tables
|
62
|
+
* Hyperlinks
|
63
|
+
|
64
|
+
## Requirements and configuration
|
65
|
+
|
66
|
+
Word-to-markdown requires `soffice` a command line interface to LibreOffice that works on Linux, Mac, and Windows. To install soffice, see [the LibreOffice documentation](https://www.libreoffice.org/get-help/install-howto/).
|
67
|
+
|
68
|
+
## Testing
|
69
|
+
|
70
|
+
```
|
71
|
+
script/cibuild
|
72
|
+
```
|
73
|
+
|
74
|
+
## Server
|
75
|
+
|
76
|
+
[Word-to-markdown-demo](https://github.com/benbalter/word-to-markdown-demo) contains a lightweight server for converting Word Documents as a service.
|
77
|
+
|
78
|
+
A live version runs at [word-to-markdown.herokuapp.com](http://word-to-markdown.herokuapp.com).
|
data/bin/w2m
CHANGED
@@ -1,13 +1,14 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
|
+
# frozen_string_literal: true
|
2
3
|
|
3
4
|
require 'word-to-markdown'
|
4
5
|
|
5
|
-
if ARGV.size != 1 || ARGV[0] ==
|
6
|
-
puts
|
6
|
+
if ARGV.size != 1 || ARGV[0] == '--help'
|
7
|
+
puts 'Usage: bundle exec w2m path/to/document.docx'
|
7
8
|
exit 1
|
8
9
|
end
|
9
10
|
|
10
|
-
if ARGV[0] ==
|
11
|
+
if ARGV[0] == '--version'
|
11
12
|
puts "WordToMarkdown v#{WordToMarkdown::VERSION}"
|
12
13
|
puts "LibreOffice v#{WordToMarkdown.soffice.version}" unless Gem.win_platform?
|
13
14
|
else
|
@@ -1,16 +1,18 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
require 'sys/proctable'
|
2
4
|
|
3
5
|
module Cliver
|
4
6
|
class Dependency
|
5
|
-
|
6
7
|
include Sys
|
7
8
|
|
8
9
|
# Memoized shortcut for detect
|
9
10
|
# Returns the path to the detected dependency
|
10
11
|
# Raises an error if the dependency was not satisfied
|
11
|
-
def
|
12
|
+
def detected_path
|
12
13
|
@detected_path ||= detect!
|
13
14
|
end
|
15
|
+
alias path detected_path
|
14
16
|
|
15
17
|
# Is the detected dependency currently open?
|
16
18
|
def open?
|
@@ -24,12 +26,12 @@ module Cliver
|
|
24
26
|
def version
|
25
27
|
return @detected_version if defined? @detected_version
|
26
28
|
return if Gem.win_platform?
|
27
|
-
version = installed_versions.find { |p,
|
29
|
+
version = installed_versions.find { |p, _v| p == path }
|
28
30
|
@detected_version = version.nil? ? nil : version[1]
|
29
31
|
end
|
30
32
|
|
31
33
|
def major_version
|
32
|
-
version.split(
|
34
|
+
version.split('.').first if version
|
33
35
|
end
|
34
36
|
end
|
35
37
|
end
|
data/lib/nokogiri/xml/element.rb
CHANGED
@@ -1,7 +1,8 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
module Nokogiri
|
2
4
|
module XML
|
3
5
|
class Element
|
4
|
-
|
5
6
|
DEFAULT_FONT_SIZE = 12.to_f
|
6
7
|
|
7
8
|
# The node's font size
|
@@ -13,11 +14,11 @@ module Nokogiri
|
|
13
14
|
end
|
14
15
|
|
15
16
|
def bold?
|
16
|
-
styles['font-weight'] && styles['font-weight'] ==
|
17
|
+
styles['font-weight'] && styles['font-weight'] == 'bold'
|
17
18
|
end
|
18
19
|
|
19
20
|
def italic?
|
20
|
-
styles['font-style'] && styles['font-style'] ==
|
21
|
+
styles['font-style'] && styles['font-style'] == 'italic'
|
21
22
|
end
|
22
23
|
end
|
23
24
|
end
|
data/lib/word-to-markdown.rb
CHANGED
@@ -1,3 +1,5 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
require 'descriptive_statistics'
|
2
4
|
require 'reverse_markdown'
|
3
5
|
require 'nokogiri-styles'
|
@@ -16,86 +18,93 @@ require_relative 'nokogiri/xml/element'
|
|
16
18
|
require_relative 'cliver/dependency_ext'
|
17
19
|
|
18
20
|
class WordToMarkdown
|
19
|
-
|
20
21
|
attr_reader :document, :converter
|
21
22
|
|
23
|
+
# Options to be passed to Reverse Markdown
|
22
24
|
REVERSE_MARKDOWN_OPTIONS = {
|
23
|
-
unknown_tags:
|
25
|
+
unknown_tags: :bypass,
|
24
26
|
github_flavored: true
|
25
|
-
}
|
27
|
+
}.freeze
|
26
28
|
|
27
|
-
|
29
|
+
# Minimum version of LibreOffice Required
|
30
|
+
SOFFICE_VERSION_REQUIREMENT = '> 4.0'.freeze
|
28
31
|
|
32
|
+
# Paths to look for LibreOffice, in order of preference
|
29
33
|
PATHS = [
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
]
|
34
|
+
'*', # Sub'd for ENV["PATH"]
|
35
|
+
'~/Applications/LibreOffice.app/Contents/MacOS',
|
36
|
+
'/Applications/LibreOffice.app/Contents/MacOS',
|
37
|
+
'/Program Files/LibreOffice 5/program',
|
38
|
+
'/Program Files (x86)/LibreOffice 4/program'
|
39
|
+
].freeze
|
36
40
|
|
37
41
|
# Create a new WordToMarkdown object
|
38
42
|
#
|
39
|
-
#
|
40
|
-
#
|
41
|
-
#
|
43
|
+
# @param path [string] Path to the Word document
|
44
|
+
# @param tmpdir [string] Path to a working directory to use
|
45
|
+
# @return [WordToMarkdown] WordToMarkdown object with the converted document
|
42
46
|
def initialize(path, tmpdir = nil)
|
43
47
|
@document = WordToMarkdown::Document.new path, tmpdir
|
44
48
|
@converter = WordToMarkdown::Converter.new @document
|
45
49
|
converter.convert!
|
46
50
|
end
|
47
51
|
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
logger.debug output
|
53
|
-
raise "Command `#{soffice.path} #{args.join(" ")}` failed: #{output}" if status.exitstatus != 0
|
54
|
-
output
|
52
|
+
# Helper method to return the document body, as markdown
|
53
|
+
# @return [string] the document body, as markdown
|
54
|
+
def to_s
|
55
|
+
document.to_s
|
55
56
|
end
|
56
57
|
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
# open - is the dependency currently open/running?
|
65
|
-
def self.soffice
|
66
|
-
@@soffice_dependency ||= Cliver::Dependency.new("soffice", *soffice_dependency_args)
|
67
|
-
end
|
58
|
+
class << self
|
59
|
+
# Run an soffice command
|
60
|
+
#
|
61
|
+
# @param args [string] one or more arguments to pass to the sofice command
|
62
|
+
# @return [string] the command output
|
63
|
+
def run_command(*args)
|
64
|
+
raise 'LibreOffice already running' if soffice.open?
|
68
65
|
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
logger
|
66
|
+
output, status = Open3.capture2e(soffice.path, *args)
|
67
|
+
logger.debug output
|
68
|
+
raise "Command `#{soffice.path} #{args.join(' ')}` failed: #{output}" if status.exitstatus != 0
|
69
|
+
output
|
74
70
|
end
|
75
|
-
end
|
76
71
|
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
72
|
+
# Returns a Cliver::Dependency object representing our soffice dependency
|
73
|
+
#
|
74
|
+
# Attempts to resolve by looking at PATH followed by paths in the PATHS constant
|
75
|
+
#
|
76
|
+
# Methods used internally:
|
77
|
+
# path - returns the resolved path. Raises an error if not satisfied
|
78
|
+
# version - returns the resolved version
|
79
|
+
# open - is the dependency currently open/running?
|
80
|
+
# @return Cliver::Dependency instance
|
81
|
+
def soffice
|
82
|
+
@soffice ||= Cliver::Dependency.new('soffice', *soffice_dependency_args)
|
83
|
+
end
|
81
84
|
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
+
# @return Logger instance
|
86
|
+
def logger
|
87
|
+
@logger ||= begin
|
88
|
+
logger = Logger.new(STDOUT)
|
89
|
+
logger.level = Logger::ERROR unless ENV['DEBUG']
|
90
|
+
logger
|
91
|
+
end
|
92
|
+
end
|
85
93
|
|
86
|
-
|
94
|
+
private
|
87
95
|
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
96
|
+
# Workaround for two upstream bugs:
|
97
|
+
# 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
|
98
|
+
# 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
|
99
|
+
# and will shell out to `soffice.exe --version`
|
100
|
+
# In order to support Windows, don't pass *any* version requirement to Cliver
|
101
|
+
def soffice_dependency_args
|
102
|
+
args = [path: PATHS.join(File::PATH_SEPARATOR)]
|
103
|
+
if Gem.win_platform?
|
104
|
+
args
|
105
|
+
else
|
106
|
+
args.unshift SOFFICE_VERSION_REQUIREMENT
|
107
|
+
end
|
99
108
|
end
|
100
109
|
end
|
101
110
|
end
|
@@ -1,18 +1,29 @@
|
|
1
|
-
#
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
2
3
|
class WordToMarkdown
|
3
4
|
class Converter
|
4
|
-
|
5
5
|
attr_reader :document
|
6
6
|
|
7
|
-
|
8
|
-
|
7
|
+
# Number of headings to guess, e.g., h6
|
8
|
+
HEADING_DEPTH = 6
|
9
|
+
|
10
|
+
# Percentile step for eaceh eheading
|
11
|
+
HEADING_STEP = 100 / HEADING_DEPTH
|
12
|
+
|
13
|
+
# Minimum heading size
|
9
14
|
MIN_HEADING_SIZE = 20
|
10
|
-
UNICODE_BULLETS = ["○", "o", "●", "\u2022", "\\p{C}"]
|
11
15
|
|
16
|
+
# Unicode bullets to strip when processing
|
17
|
+
UNICODE_BULLETS = ['○', 'o', '●', "\u2022", '\\p{C}'].freeze
|
18
|
+
|
19
|
+
# @param document [WordToMarkdown::Document] The document to convert
|
12
20
|
def initialize(document)
|
13
21
|
@document = document
|
14
22
|
end
|
15
23
|
|
24
|
+
# Convert the document
|
25
|
+
#
|
26
|
+
# Note: this action is destructive!
|
16
27
|
def convert!
|
17
28
|
# Fonts and headings
|
18
29
|
semanticize_font_styles!
|
@@ -29,22 +40,22 @@ class WordToMarkdown
|
|
29
40
|
remove_numbering_from_list_items!
|
30
41
|
end
|
31
42
|
|
32
|
-
#
|
43
|
+
# @return [Array<Nokogiri::Node>] Return an array of Nokogiri Nodes that are implicit headings
|
33
44
|
def implicit_headings
|
34
45
|
@implicit_headings ||= begin
|
35
46
|
headings = []
|
36
|
-
@document.tree.css(
|
47
|
+
@document.tree.css('[style]').each do |element|
|
37
48
|
headings.push element unless element.font_size.nil? || element.font_size < MIN_HEADING_SIZE
|
38
49
|
end
|
39
50
|
headings
|
40
51
|
end
|
41
52
|
end
|
42
53
|
|
43
|
-
#
|
54
|
+
# @return [Array<Integer>] An array of font-sizes for implicit headings in the document
|
44
55
|
def font_sizes
|
45
56
|
@font_sizes ||= begin
|
46
57
|
sizes = []
|
47
|
-
@document.tree.css(
|
58
|
+
@document.tree.css('[style]').each do |element|
|
48
59
|
sizes.push element.font_size.round(-1) unless element.font_size.nil?
|
49
60
|
end
|
50
61
|
sizes.uniq.sort
|
@@ -53,11 +64,10 @@ class WordToMarkdown
|
|
53
64
|
|
54
65
|
# Given a Nokogiri node, guess what heading it represents, if any
|
55
66
|
#
|
56
|
-
# node
|
57
|
-
#
|
58
|
-
# retuns the heading tag (e.g., H1), or nil
|
67
|
+
# @param node [Nokigiri::Node] the nokigiri node
|
68
|
+
# @return [String, nil] the heading tag (e.g., H1), or nil
|
59
69
|
def guess_heading(node)
|
60
|
-
return nil if node.font_size
|
70
|
+
return nil if node.font_size.nil?
|
61
71
|
[*1...HEADING_DEPTH].each do |heading|
|
62
72
|
return "h#{heading}" if node.font_size >= h(heading)
|
63
73
|
end
|
@@ -67,51 +77,58 @@ class WordToMarkdown
|
|
67
77
|
# Minimum font size required for a given heading
|
68
78
|
# e.g., H(2) would represent the minimum font size of an implicit h2
|
69
79
|
#
|
70
|
-
#
|
80
|
+
# @param num [Integer] the heading number, e.g., 1, 2
|
71
81
|
#
|
72
|
-
#
|
73
|
-
def h(
|
74
|
-
font_sizes.percentile
|
82
|
+
# @return [Integer] the minimum font size
|
83
|
+
def h(num)
|
84
|
+
font_sizes.percentile(((HEADING_DEPTH - 1) - num) * HEADING_STEP)
|
75
85
|
end
|
76
86
|
|
87
|
+
# Convert span-based font styles to `strong`s and `em`s
|
77
88
|
def semanticize_font_styles!
|
78
|
-
@document.tree.css(
|
89
|
+
@document.tree.css('span').each do |node|
|
79
90
|
if node.bold?
|
80
|
-
node.node_name =
|
91
|
+
node.node_name = 'strong'
|
81
92
|
elsif node.italic?
|
82
|
-
node.node_name =
|
93
|
+
node.node_name = 'em'
|
83
94
|
end
|
84
95
|
end
|
85
96
|
end
|
86
97
|
|
98
|
+
# Remove top-level paragraphs from table cells
|
87
99
|
def remove_paragraphs_from_tables!
|
88
|
-
@document.tree.search(
|
100
|
+
@document.tree.search('td p').each { |node| node.node_name = 'span' }
|
89
101
|
end
|
90
102
|
|
103
|
+
# Remove top-level paragraphs from list items
|
91
104
|
def remove_paragraphs_from_list_items!
|
92
|
-
@document.tree.search(
|
105
|
+
@document.tree.search('li p').each { |node| node.node_name = 'span' }
|
93
106
|
end
|
94
107
|
|
108
|
+
# Remove prepended unicode bullets from list items
|
95
109
|
def remove_unicode_bullets_from_list_items!
|
96
|
-
path = WordToMarkdown.soffice.major_version ==
|
110
|
+
path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
|
97
111
|
@document.tree.search(path).each do |span|
|
98
|
-
span.inner_html = span.inner_html.gsub
|
112
|
+
span.inner_html = span.inner_html.gsub(/^([#{UNICODE_BULLETS.join("")}]+)/, '')
|
99
113
|
end
|
100
114
|
end
|
101
115
|
|
116
|
+
# Remove prepended numbers from list items
|
102
117
|
def remove_numbering_from_list_items!
|
103
|
-
path = WordToMarkdown.soffice.major_version ==
|
118
|
+
path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
|
104
119
|
@document.tree.search(path).each do |span|
|
105
|
-
span.inner_html = span.inner_html.gsub
|
120
|
+
span.inner_html = span.inner_html.gsub(/^[a-zA-Z0-9]+\./m, '')
|
106
121
|
end
|
107
122
|
end
|
108
123
|
|
124
|
+
# Remvoe whitespace from list items
|
109
125
|
def remove_whitespace_from_list_items!
|
110
|
-
@document.tree.search(
|
126
|
+
@document.tree.search('li span').each { |span| span.inner_html.strip! }
|
111
127
|
end
|
112
128
|
|
129
|
+
# Convert table headers to `th`s2
|
113
130
|
def semanticize_table_headers!
|
114
|
-
@document.tree.search(
|
131
|
+
@document.tree.search('table tr:first td').each { |node| node.node_name = 'th' }
|
115
132
|
end
|
116
133
|
|
117
134
|
# Try to guess heading where implicit bassed on font size
|
@@ -121,6 +138,5 @@ class WordToMarkdown
|
|
121
138
|
element.node_name = heading unless heading.nil?
|
122
139
|
end
|
123
140
|
end
|
124
|
-
|
125
141
|
end
|
126
142
|
end
|
@@ -1,50 +1,54 @@
|
|
1
|
-
#
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
2
3
|
class WordToMarkdown
|
3
4
|
class Document
|
4
5
|
class NotFoundError < StandardError; end
|
5
|
-
class
|
6
|
+
class ConversionError < StandardError; end
|
6
7
|
|
7
|
-
attr_reader :path, :
|
8
|
+
attr_reader :path, :tmpdir
|
8
9
|
|
10
|
+
# @param path [string] Path to the Word document
|
11
|
+
# @param tmpdir [string] Path to a working directory to use
|
9
12
|
def initialize(path, tmpdir = nil)
|
10
13
|
@path = File.expand_path path, Dir.pwd
|
11
14
|
@tmpdir = tmpdir || Dir.mktmpdir
|
12
15
|
raise NotFoundError, "File #{@path} does not exist" unless File.exist?(@path)
|
13
16
|
end
|
14
17
|
|
18
|
+
# @return [String] the document's extension
|
15
19
|
def extension
|
16
20
|
File.extname path
|
17
21
|
end
|
18
22
|
|
23
|
+
# @return [Nokigiri::Document]
|
19
24
|
def tree
|
20
25
|
@tree ||= begin
|
21
26
|
tree = Nokogiri::HTML(normalized_html)
|
22
|
-
tree.css(
|
27
|
+
tree.css('title').remove
|
23
28
|
tree
|
24
29
|
end
|
25
30
|
end
|
26
31
|
|
27
|
-
#
|
32
|
+
# @return [String] the html representation of the document
|
28
33
|
def html
|
29
|
-
tree.to_html.gsub("</li>\n",
|
34
|
+
tree.to_html.gsub("</li>\n", '</li>')
|
30
35
|
end
|
31
36
|
|
32
|
-
#
|
33
|
-
def
|
37
|
+
# @return [String] the markdown representation of the document
|
38
|
+
def markdown
|
34
39
|
@markdown ||= scrub_whitespace(ReverseMarkdown.convert(html, WordToMarkdown::REVERSE_MARKDOWN_OPTIONS))
|
35
40
|
end
|
41
|
+
alias to_s markdown
|
36
42
|
|
37
43
|
# Determine the document encoding
|
38
44
|
#
|
39
|
-
#
|
40
|
-
#
|
41
|
-
# Returns the encoding, defaulting to "UTF-8"
|
45
|
+
# @return [String] the encoding, defaulting to "UTF-8"
|
42
46
|
def encoding
|
43
|
-
match = raw_html.encode(
|
47
|
+
match = raw_html.encode('UTF-8', invalid: :replace, replace: '').match(/charset=([^\"]+)/)
|
44
48
|
if match
|
45
|
-
match[1].sub(
|
49
|
+
match[1].sub('macintosh', 'MacRoman')
|
46
50
|
else
|
47
|
-
|
51
|
+
'UTF-8'
|
48
52
|
end
|
49
53
|
end
|
50
54
|
|
@@ -52,55 +56,57 @@ class WordToMarkdown
|
|
52
56
|
|
53
57
|
# Perform pre-processing normalization
|
54
58
|
#
|
55
|
-
#
|
56
|
-
#
|
57
|
-
# Returns the normalized html
|
59
|
+
# @return [String] the normalized html
|
58
60
|
def normalized_html
|
59
|
-
html = raw_html.force_encoding(encoding)
|
60
|
-
html = html.encode(
|
61
|
-
html = Premailer.new(html, :
|
62
|
-
html.gsub!
|
63
|
-
html.gsub!
|
64
|
-
html.gsub!
|
65
|
-
html.gsub!
|
61
|
+
html = raw_html.dup.force_encoding(encoding)
|
62
|
+
html = html.encode('UTF-8', invalid: :replace, replace: '')
|
63
|
+
html = Premailer.new(html, with_html_string: true, input_encoding: 'UTF-8').to_inline_css
|
64
|
+
html.gsub!(/\n|\r/, ' ') # Remove linebreaks
|
65
|
+
html.gsub!(/“|”/, '"') # Straighten curly double quotes
|
66
|
+
html.gsub!(/‘|’/, "'") # Straighten curly single quotes
|
67
|
+
html.gsub!(/>\s+</, '><') # Remove extra whitespace between tags
|
66
68
|
html
|
67
69
|
end
|
68
70
|
|
69
71
|
# Perform post-processing normalization of certain Word quirks
|
70
72
|
#
|
71
|
-
# string
|
73
|
+
# @param string [String] the markdown representation of the document
|
72
74
|
#
|
73
|
-
#
|
75
|
+
# @return [String] the normalized markdown
|
74
76
|
def scrub_whitespace(string)
|
75
|
-
string
|
76
|
-
string.
|
77
|
-
string.sub!(
|
78
|
-
string.
|
79
|
-
string.gsub!(
|
80
|
-
string.gsub!(/\
|
77
|
+
string = string.dup
|
78
|
+
string.gsub!(' ', ' ') # HTML encoded spaces
|
79
|
+
string.sub!(/\A[[:space:]]+/, '') # document leading whitespace
|
80
|
+
string.sub!(/[[:space:]]+\z/, '') # document trailing whitespace
|
81
|
+
string.gsub!(/([ ]+)$/, '') # line trailing whitespace
|
82
|
+
string.gsub!(/\n\n\n\n/, "\n\n") # Quadruple line breaks
|
83
|
+
string.delete!(' ') # Unicode non-breaking spaces, injected as tabs
|
81
84
|
string
|
82
85
|
end
|
83
86
|
|
87
|
+
# @return [String] the path to the intermediary HTML document
|
84
88
|
def dest_path
|
85
|
-
dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/,
|
89
|
+
dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, '.html')
|
86
90
|
File.expand_path(dest_filename, tmpdir)
|
87
91
|
end
|
88
92
|
|
93
|
+
# @return [String] the unnormalized HTML representation
|
89
94
|
def raw_html
|
90
95
|
@raw_html ||= begin
|
91
|
-
WordToMarkdown
|
92
|
-
raise
|
96
|
+
WordToMarkdown.run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
|
97
|
+
raise ConversionError, "Failed to convert #{path}" unless File.exist?(dest_path)
|
93
98
|
html = File.read dest_path
|
94
99
|
File.delete dest_path
|
95
100
|
html
|
96
101
|
end
|
97
102
|
end
|
98
103
|
|
104
|
+
# @return [String] the LibreOffice filter to use for conversion
|
99
105
|
def filter
|
100
|
-
if WordToMarkdown.soffice.major_version ==
|
101
|
-
|
106
|
+
if WordToMarkdown.soffice.major_version == '5'
|
107
|
+
'html:XHTML Writer File:UTF8'
|
102
108
|
else
|
103
|
-
|
109
|
+
'html'
|
104
110
|
end
|
105
111
|
end
|
106
112
|
end
|
metadata
CHANGED
@@ -1,29 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: word-to-markdown
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ben Balter
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2018-08-01 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
|
-
name:
|
14
|
+
name: cliver
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
17
|
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '0.
|
19
|
+
version: '0.3'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '0.
|
26
|
+
version: '0.3'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: descriptive_statistics
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -39,103 +39,103 @@ dependencies:
|
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '2.5'
|
41
41
|
- !ruby/object:Gem::Dependency
|
42
|
-
name:
|
42
|
+
name: nokogiri-styles
|
43
43
|
requirement: !ruby/object:Gem::Requirement
|
44
44
|
requirements:
|
45
45
|
- - "~>"
|
46
46
|
- !ruby/object:Gem::Version
|
47
|
-
version: '1
|
47
|
+
version: '0.1'
|
48
48
|
type: :runtime
|
49
49
|
prerelease: false
|
50
50
|
version_requirements: !ruby/object:Gem::Requirement
|
51
51
|
requirements:
|
52
52
|
- - "~>"
|
53
53
|
- !ruby/object:Gem::Version
|
54
|
-
version: '1
|
54
|
+
version: '0.1'
|
55
55
|
- !ruby/object:Gem::Dependency
|
56
|
-
name:
|
56
|
+
name: premailer
|
57
57
|
requirement: !ruby/object:Gem::Requirement
|
58
58
|
requirements:
|
59
59
|
- - "~>"
|
60
60
|
- !ruby/object:Gem::Version
|
61
|
-
version: '
|
61
|
+
version: '1.8'
|
62
62
|
type: :runtime
|
63
63
|
prerelease: false
|
64
64
|
version_requirements: !ruby/object:Gem::Requirement
|
65
65
|
requirements:
|
66
66
|
- - "~>"
|
67
67
|
- !ruby/object:Gem::Version
|
68
|
-
version: '
|
68
|
+
version: '1.8'
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
|
-
name:
|
70
|
+
name: reverse_markdown
|
71
71
|
requirement: !ruby/object:Gem::Requirement
|
72
72
|
requirements:
|
73
73
|
- - "~>"
|
74
74
|
- !ruby/object:Gem::Version
|
75
|
-
version: '0
|
75
|
+
version: '1.0'
|
76
76
|
type: :runtime
|
77
77
|
prerelease: false
|
78
78
|
version_requirements: !ruby/object:Gem::Requirement
|
79
79
|
requirements:
|
80
80
|
- - "~>"
|
81
81
|
- !ruby/object:Gem::Version
|
82
|
-
version: '0
|
82
|
+
version: '1.0'
|
83
83
|
- !ruby/object:Gem::Dependency
|
84
|
-
name:
|
84
|
+
name: sys-proctable
|
85
85
|
requirement: !ruby/object:Gem::Requirement
|
86
86
|
requirements:
|
87
87
|
- - "~>"
|
88
88
|
- !ruby/object:Gem::Version
|
89
|
-
version: '0
|
89
|
+
version: '1.0'
|
90
90
|
type: :runtime
|
91
91
|
prerelease: false
|
92
92
|
version_requirements: !ruby/object:Gem::Requirement
|
93
93
|
requirements:
|
94
94
|
- - "~>"
|
95
95
|
- !ruby/object:Gem::Version
|
96
|
-
version: '0
|
96
|
+
version: '1.0'
|
97
97
|
- !ruby/object:Gem::Dependency
|
98
|
-
name:
|
98
|
+
name: bundler
|
99
99
|
requirement: !ruby/object:Gem::Requirement
|
100
100
|
requirements:
|
101
101
|
- - "~>"
|
102
102
|
- !ruby/object:Gem::Version
|
103
|
-
version: '
|
103
|
+
version: '1.6'
|
104
104
|
type: :development
|
105
105
|
prerelease: false
|
106
106
|
version_requirements: !ruby/object:Gem::Requirement
|
107
107
|
requirements:
|
108
108
|
- - "~>"
|
109
109
|
- !ruby/object:Gem::Version
|
110
|
-
version: '
|
110
|
+
version: '1.6'
|
111
111
|
- !ruby/object:Gem::Dependency
|
112
|
-
name:
|
112
|
+
name: minitest
|
113
113
|
requirement: !ruby/object:Gem::Requirement
|
114
114
|
requirements:
|
115
115
|
- - "~>"
|
116
116
|
- !ruby/object:Gem::Version
|
117
|
-
version: '
|
117
|
+
version: '5.0'
|
118
118
|
type: :development
|
119
119
|
prerelease: false
|
120
120
|
version_requirements: !ruby/object:Gem::Requirement
|
121
121
|
requirements:
|
122
122
|
- - "~>"
|
123
123
|
- !ruby/object:Gem::Version
|
124
|
-
version: '
|
124
|
+
version: '5.0'
|
125
125
|
- !ruby/object:Gem::Dependency
|
126
|
-
name:
|
126
|
+
name: mocha
|
127
127
|
requirement: !ruby/object:Gem::Requirement
|
128
128
|
requirements:
|
129
129
|
- - "~>"
|
130
130
|
- !ruby/object:Gem::Version
|
131
|
-
version: '1.
|
131
|
+
version: '1.1'
|
132
132
|
type: :development
|
133
133
|
prerelease: false
|
134
134
|
version_requirements: !ruby/object:Gem::Requirement
|
135
135
|
requirements:
|
136
136
|
- - "~>"
|
137
137
|
- !ruby/object:Gem::Version
|
138
|
-
version: '1.
|
138
|
+
version: '1.1'
|
139
139
|
- !ruby/object:Gem::Dependency
|
140
140
|
name: pry
|
141
141
|
requirement: !ruby/object:Gem::Requirement
|
@@ -151,33 +151,47 @@ dependencies:
|
|
151
151
|
- !ruby/object:Gem::Version
|
152
152
|
version: '0.10'
|
153
153
|
- !ruby/object:Gem::Dependency
|
154
|
-
name:
|
154
|
+
name: rake
|
155
155
|
requirement: !ruby/object:Gem::Requirement
|
156
156
|
requirements:
|
157
157
|
- - "~>"
|
158
158
|
- !ruby/object:Gem::Version
|
159
|
-
version: '
|
159
|
+
version: '10.4'
|
160
160
|
type: :development
|
161
161
|
prerelease: false
|
162
162
|
version_requirements: !ruby/object:Gem::Requirement
|
163
163
|
requirements:
|
164
164
|
- - "~>"
|
165
165
|
- !ruby/object:Gem::Version
|
166
|
-
version: '
|
166
|
+
version: '10.4'
|
167
167
|
- !ruby/object:Gem::Dependency
|
168
|
-
name:
|
168
|
+
name: rubocop
|
169
169
|
requirement: !ruby/object:Gem::Requirement
|
170
170
|
requirements:
|
171
171
|
- - "~>"
|
172
172
|
- !ruby/object:Gem::Version
|
173
|
-
version: '
|
173
|
+
version: '0.49'
|
174
174
|
type: :development
|
175
175
|
prerelease: false
|
176
176
|
version_requirements: !ruby/object:Gem::Requirement
|
177
177
|
requirements:
|
178
178
|
- - "~>"
|
179
179
|
- !ruby/object:Gem::Version
|
180
|
-
version: '
|
180
|
+
version: '0.49'
|
181
|
+
- !ruby/object:Gem::Dependency
|
182
|
+
name: shoulda
|
183
|
+
requirement: !ruby/object:Gem::Requirement
|
184
|
+
requirements:
|
185
|
+
- - "~>"
|
186
|
+
- !ruby/object:Gem::Version
|
187
|
+
version: '3.5'
|
188
|
+
type: :development
|
189
|
+
prerelease: false
|
190
|
+
version_requirements: !ruby/object:Gem::Requirement
|
191
|
+
requirements:
|
192
|
+
- - "~>"
|
193
|
+
- !ruby/object:Gem::Version
|
194
|
+
version: '3.5'
|
181
195
|
description: Ruby Gem to convert Word documents to markdown.
|
182
196
|
email: ben.balter@github.com
|
183
197
|
executables:
|
@@ -185,6 +199,8 @@ executables:
|
|
185
199
|
extensions: []
|
186
200
|
extra_rdoc_files: []
|
187
201
|
files:
|
202
|
+
- LICENSE.md
|
203
|
+
- README.md
|
188
204
|
- bin/w2m
|
189
205
|
- lib/cliver/dependency_ext.rb
|
190
206
|
- lib/nokogiri/xml/element.rb
|
@@ -212,7 +228,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
212
228
|
version: '0'
|
213
229
|
requirements: []
|
214
230
|
rubyforge_project:
|
215
|
-
rubygems_version: 2.
|
231
|
+
rubygems_version: 2.7.6
|
216
232
|
signing_key:
|
217
233
|
specification_version: 4
|
218
234
|
summary: Ruby Gem to convert Word documents to markdown
|