word-to-markdown 1.1.7 → 1.1.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +5 -5
- data/LICENSE.md +21 -0
- data/README.md +94 -0
- data/bin/w2m +4 -3
- data/lib/cliver/dependency_ext.rb +9 -6
- data/lib/nokogiri/xml/element.rb +4 -3
- data/lib/word-to-markdown/converter.rb +47 -30
- data/lib/word-to-markdown/document.rb +47 -38
- data/lib/word-to-markdown/version.rb +3 -1
- data/lib/word-to-markdown.rb +65 -55
- metadata +76 -40
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: f2b816a2ad9402eb1c45f74f806482502608337ac9127d072647bd4f1f97cd39
|
4
|
+
data.tar.gz: ff35f5c9f2e89c0e781ea864552f6f20bafd23083301c4d72c5c95c7aae4f38c
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e74c055913709cd0fa871ba95cf22b22a86089bb18fd9e88cd27d1dec3ec3b4927708fc9a591ba278b3e5cdc98150178cd17547ba76579702893645554370c83
|
7
|
+
data.tar.gz: 9b01d816e8f95d43fb19dc828213e3db4f7b0897957c6e3eba42b4f4fe888d3564db3b3730b416f9eb10837ca4911fd7083c73b3556c3313dd12cefe2f8ae597
|
data/LICENSE.md
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2014, Ben Balter
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
13
|
+
copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,94 @@
|
|
1
|
+
# Word to Markdown converter
|
2
|
+
|
3
|
+
A Ruby gem to liberate content from [the jail that is Word documents](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/#jailbreaking-content)
|
4
|
+
|
5
|
+
[](https://github.com/benbalter/word-to-markdown/actions/workflows/ci.yml) [](http://badge.fury.io/rb/word-to-markdown) [](http://inch-ci.org/github/benbalter/word-to-markdown) [](https://ci.appveyor.com/project/benbalter/word-to-markdown/branch/master) [](https://codeclimate.com/github/benbalter/word-to-markdown/maintainability) [](https://codeclimate.com/github/benbalter/word-to-markdown/test_coverage)
|
6
|
+
|
7
|
+
## The problem
|
8
|
+
|
9
|
+
> Our default content publishing workflow is terribly broken. [We've all been trained to make paper](http://ben.balter.com/2012/10/19/we-ve-been-trained-to-make-paper/), yet today, content authored once is more commonly consumed in multiple formats, and rarely, if ever, does it embody physical form. Put another way, our go-to content authoring workflow remains relatively unchanged since it was conceived in the early 80s.
|
10
|
+
>
|
11
|
+
> I'm asked regularly by government employees — knowledge workers who fire up a desktop word processor as the first step to any project — for an automated pipeline to convert Microsoft Word documents to [Markdown](http://guides.github.com/overviews/mastering-markdown/), the *lingua franca* of the internet, but as my recent foray into building [just such a converter](http://word-to-markdown.herokuapp.com/) proves, it's not that simple.
|
12
|
+
>
|
13
|
+
> Markdown isn't just an alternative format. Markdown forces you to write for the web.
|
14
|
+
|
15
|
+
**[Read more](http://ben.balter.com/2014/03/31/word-versus-markdown-more-than-mere-semantics/)**
|
16
|
+
|
17
|
+
## Just want to convert a Microsoft Word (or Google) document to Markdown?
|
18
|
+
|
19
|
+
You can use this **[hosted service](https://word2md.com/)** (or check out [its source](https://github.com/benbalter/word-to-markdown-server)).
|
20
|
+
|
21
|
+
## Install
|
22
|
+
|
23
|
+
You'll need to install [LibreOffice](http://www.libreoffice.org/). Then:
|
24
|
+
|
25
|
+
```bash
|
26
|
+
gem install word-to-markdown
|
27
|
+
```
|
28
|
+
|
29
|
+
## Usage
|
30
|
+
|
31
|
+
```ruby
|
32
|
+
file = WordToMarkdown.new("/path/to/document.docx")
|
33
|
+
=> <WordToMarkdown path="/path/to/document.docx">
|
34
|
+
|
35
|
+
file.to_s
|
36
|
+
=> "# Test\n\n This is a test"
|
37
|
+
|
38
|
+
file.document.tree
|
39
|
+
=> <Nokogiri Document>
|
40
|
+
```
|
41
|
+
|
42
|
+
### Command line usage
|
43
|
+
|
44
|
+
Once you've installed the gem, it's just:
|
45
|
+
|
46
|
+
```
|
47
|
+
$ w2m path/to/document.docx
|
48
|
+
```
|
49
|
+
|
50
|
+
*Outputs the resulting markdown to stdout*
|
51
|
+
|
52
|
+
## Supports
|
53
|
+
|
54
|
+
* Paragraphs
|
55
|
+
* Numbered lists
|
56
|
+
* Unnumbered lists
|
57
|
+
* Nested lists
|
58
|
+
* Italic
|
59
|
+
* Bold
|
60
|
+
* Explicit headings (e.g., selected as "Heading 1" or "Heading 2")
|
61
|
+
* Implicit headings (e.g., text with a larger font size relative to paragraph text)
|
62
|
+
* Images
|
63
|
+
* Tables
|
64
|
+
* Hyperlinks
|
65
|
+
|
66
|
+
## Requirements and configuration
|
67
|
+
|
68
|
+
Word-to-markdown requires `soffice` a command line interface to LibreOffice that works on Linux, Mac, and Windows. To install soffice, see [the LibreOffice documentation](https://www.libreoffice.org/get-help/install-howto/).
|
69
|
+
|
70
|
+
## Testing
|
71
|
+
|
72
|
+
```
|
73
|
+
script/cibuild
|
74
|
+
```
|
75
|
+
|
76
|
+
## Docker
|
77
|
+
|
78
|
+
First, create the `Gemfile.lock` by installing the dependencies:
|
79
|
+
|
80
|
+
```
|
81
|
+
bundle install
|
82
|
+
```
|
83
|
+
|
84
|
+
Everything you need to run the executable locally:
|
85
|
+
|
86
|
+
```
|
87
|
+
docker-compose build
|
88
|
+
docker-compose run --rm app bundle exec w2m --help
|
89
|
+
docker-compose run --rm app bundle exec w2m test/fixtures/em.docx
|
90
|
+
```
|
91
|
+
|
92
|
+
## Hosted service
|
93
|
+
|
94
|
+
[Word-to-markdown-server](https://github.com/benbalter/word-to-markdown-server) contains a lightweight server for converting Word Documents as a service. A live version runs at [word2md.com](https://word2md.com).
|
data/bin/w2m
CHANGED
@@ -1,13 +1,14 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
|
+
# frozen_string_literal: true
|
2
3
|
|
3
4
|
require 'word-to-markdown'
|
4
5
|
|
5
|
-
if ARGV.size != 1 || ARGV[0] ==
|
6
|
-
puts
|
6
|
+
if ARGV.size != 1 || ARGV[0] == '--help'
|
7
|
+
puts 'Usage: bundle exec w2m path/to/document.docx'
|
7
8
|
exit 1
|
8
9
|
end
|
9
10
|
|
10
|
-
if ARGV[0] ==
|
11
|
+
if ARGV[0] == '--version'
|
11
12
|
puts "WordToMarkdown v#{WordToMarkdown::VERSION}"
|
12
13
|
puts "LibreOffice v#{WordToMarkdown.soffice.version}" unless Gem.win_platform?
|
13
14
|
else
|
@@ -1,16 +1,18 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
require 'sys/proctable'
|
2
4
|
|
3
5
|
module Cliver
|
4
6
|
class Dependency
|
5
|
-
|
6
7
|
include Sys
|
7
8
|
|
8
9
|
# Memoized shortcut for detect
|
9
10
|
# Returns the path to the detected dependency
|
10
11
|
# Raises an error if the dependency was not satisfied
|
11
|
-
def
|
12
|
+
def detected_path
|
12
13
|
@detected_path ||= detect!
|
13
14
|
end
|
15
|
+
alias path detected_path
|
14
16
|
|
15
17
|
# Is the detected dependency currently open?
|
16
18
|
def open?
|
@@ -22,14 +24,15 @@ module Cliver
|
|
22
24
|
|
23
25
|
# Returns the version of the resolved dependency
|
24
26
|
def version
|
25
|
-
return @
|
27
|
+
return @version if defined? @version
|
26
28
|
return if Gem.win_platform?
|
27
|
-
|
28
|
-
|
29
|
+
|
30
|
+
version = installed_versions.find { |p, _v| p == path }
|
31
|
+
@version = version.nil? ? nil : version[1]
|
29
32
|
end
|
30
33
|
|
31
34
|
def major_version
|
32
|
-
version
|
35
|
+
version&.split('.')&.first
|
33
36
|
end
|
34
37
|
end
|
35
38
|
end
|
data/lib/nokogiri/xml/element.rb
CHANGED
@@ -1,7 +1,8 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
1
3
|
module Nokogiri
|
2
4
|
module XML
|
3
5
|
class Element
|
4
|
-
|
5
6
|
DEFAULT_FONT_SIZE = 12.to_f
|
6
7
|
|
7
8
|
# The node's font size
|
@@ -13,11 +14,11 @@ module Nokogiri
|
|
13
14
|
end
|
14
15
|
|
15
16
|
def bold?
|
16
|
-
styles['font-weight'] && styles['font-weight'] ==
|
17
|
+
styles['font-weight'] && styles['font-weight'] == 'bold'
|
17
18
|
end
|
18
19
|
|
19
20
|
def italic?
|
20
|
-
styles['font-style'] && styles['font-style'] ==
|
21
|
+
styles['font-style'] && styles['font-style'] == 'italic'
|
21
22
|
end
|
22
23
|
end
|
23
24
|
end
|
@@ -1,18 +1,29 @@
|
|
1
|
-
#
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
2
3
|
class WordToMarkdown
|
3
4
|
class Converter
|
4
|
-
|
5
5
|
attr_reader :document
|
6
6
|
|
7
|
-
|
8
|
-
|
7
|
+
# Number of headings to guess, e.g., h6
|
8
|
+
HEADING_DEPTH = 6
|
9
|
+
|
10
|
+
# Percentile step for eaceh eheading
|
11
|
+
HEADING_STEP = 100 / HEADING_DEPTH
|
12
|
+
|
13
|
+
# Minimum heading size
|
9
14
|
MIN_HEADING_SIZE = 20
|
10
|
-
UNICODE_BULLETS = ["○", "o", "●", "\u2022", "\\p{C}"]
|
11
15
|
|
16
|
+
# Unicode bullets to strip when processing
|
17
|
+
UNICODE_BULLETS = ['○', 'o', '●', "\u2022", '\\p{C}'].freeze
|
18
|
+
|
19
|
+
# @param document [WordToMarkdown::Document] The document to convert
|
12
20
|
def initialize(document)
|
13
21
|
@document = document
|
14
22
|
end
|
15
23
|
|
24
|
+
# Convert the document
|
25
|
+
#
|
26
|
+
# Note: this action is destructive!
|
16
27
|
def convert!
|
17
28
|
# Fonts and headings
|
18
29
|
semanticize_font_styles!
|
@@ -29,35 +40,35 @@ class WordToMarkdown
|
|
29
40
|
remove_numbering_from_list_items!
|
30
41
|
end
|
31
42
|
|
32
|
-
#
|
43
|
+
# @return [Array<Nokogiri::Node>] Return an array of Nokogiri Nodes that are implicit headings
|
33
44
|
def implicit_headings
|
34
45
|
@implicit_headings ||= begin
|
35
46
|
headings = []
|
36
|
-
@document.tree.css(
|
47
|
+
@document.tree.css('[style]').each do |element|
|
37
48
|
headings.push element unless element.font_size.nil? || element.font_size < MIN_HEADING_SIZE
|
38
49
|
end
|
39
50
|
headings
|
40
51
|
end
|
41
52
|
end
|
42
53
|
|
43
|
-
#
|
54
|
+
# @return [Array<Integer>] An array of font-sizes for implicit headings in the document
|
44
55
|
def font_sizes
|
45
56
|
@font_sizes ||= begin
|
46
57
|
sizes = []
|
47
|
-
@document.tree.css(
|
58
|
+
@document.tree.css('[style]').each do |element|
|
48
59
|
sizes.push element.font_size.round(-1) unless element.font_size.nil?
|
49
60
|
end
|
50
|
-
sizes.uniq.sort
|
61
|
+
sizes.uniq.sort.extend(DescriptiveStatistics)
|
51
62
|
end
|
52
63
|
end
|
53
64
|
|
54
65
|
# Given a Nokogiri node, guess what heading it represents, if any
|
55
66
|
#
|
56
|
-
# node
|
57
|
-
#
|
58
|
-
# retuns the heading tag (e.g., H1), or nil
|
67
|
+
# @param node [Nokigiri::Node] the nokigiri node
|
68
|
+
# @return [String, nil] the heading tag (e.g., H1), or nil
|
59
69
|
def guess_heading(node)
|
60
|
-
return nil if node.font_size
|
70
|
+
return nil if node.font_size.nil?
|
71
|
+
|
61
72
|
[*1...HEADING_DEPTH].each do |heading|
|
62
73
|
return "h#{heading}" if node.font_size >= h(heading)
|
63
74
|
end
|
@@ -67,51 +78,58 @@ class WordToMarkdown
|
|
67
78
|
# Minimum font size required for a given heading
|
68
79
|
# e.g., H(2) would represent the minimum font size of an implicit h2
|
69
80
|
#
|
70
|
-
#
|
81
|
+
# @param num [Integer] the heading number, e.g., 1, 2
|
71
82
|
#
|
72
|
-
#
|
73
|
-
def h(
|
74
|
-
font_sizes.percentile
|
83
|
+
# @return [Integer] the minimum font size
|
84
|
+
def h(num)
|
85
|
+
font_sizes.percentile(((HEADING_DEPTH - 1) - num) * HEADING_STEP)
|
75
86
|
end
|
76
87
|
|
88
|
+
# Convert span-based font styles to `strong`s and `em`s
|
77
89
|
def semanticize_font_styles!
|
78
|
-
@document.tree.css(
|
90
|
+
@document.tree.css('span').each do |node|
|
79
91
|
if node.bold?
|
80
|
-
node.node_name =
|
92
|
+
node.node_name = 'strong'
|
81
93
|
elsif node.italic?
|
82
|
-
node.node_name =
|
94
|
+
node.node_name = 'em'
|
83
95
|
end
|
84
96
|
end
|
85
97
|
end
|
86
98
|
|
99
|
+
# Remove top-level paragraphs from table cells
|
87
100
|
def remove_paragraphs_from_tables!
|
88
|
-
@document.tree.search(
|
101
|
+
@document.tree.search('td p').each { |node| node.node_name = 'span' }
|
89
102
|
end
|
90
103
|
|
104
|
+
# Remove top-level paragraphs from list items
|
91
105
|
def remove_paragraphs_from_list_items!
|
92
|
-
@document.tree.search(
|
106
|
+
@document.tree.search('li p').each { |node| node.node_name = 'span' }
|
93
107
|
end
|
94
108
|
|
109
|
+
# Remove prepended unicode bullets from list items
|
95
110
|
def remove_unicode_bullets_from_list_items!
|
96
|
-
path = WordToMarkdown.soffice.major_version ==
|
111
|
+
path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
|
97
112
|
@document.tree.search(path).each do |span|
|
98
|
-
span.inner_html = span.inner_html.gsub
|
113
|
+
span.inner_html = span.inner_html.gsub(/^([#{UNICODE_BULLETS.join}]+)/, '')
|
99
114
|
end
|
100
115
|
end
|
101
116
|
|
117
|
+
# Remove prepended numbers from list items
|
102
118
|
def remove_numbering_from_list_items!
|
103
|
-
path = WordToMarkdown.soffice.major_version ==
|
119
|
+
path = WordToMarkdown.soffice.major_version == '5' ? 'li span span' : 'li span'
|
104
120
|
@document.tree.search(path).each do |span|
|
105
|
-
span.inner_html = span.inner_html.gsub
|
121
|
+
span.inner_html = span.inner_html.gsub(/^[a-zA-Z0-9]+\./m, '')
|
106
122
|
end
|
107
123
|
end
|
108
124
|
|
125
|
+
# Remvoe whitespace from list items
|
109
126
|
def remove_whitespace_from_list_items!
|
110
|
-
@document.tree.search(
|
127
|
+
@document.tree.search('li span').each { |span| span.inner_html.strip! }
|
111
128
|
end
|
112
129
|
|
130
|
+
# Convert table headers to `th`s2
|
113
131
|
def semanticize_table_headers!
|
114
|
-
@document.tree.search(
|
132
|
+
@document.tree.search('table tr:first td').each { |node| node.node_name = 'th' }
|
115
133
|
end
|
116
134
|
|
117
135
|
# Try to guess heading where implicit bassed on font size
|
@@ -121,6 +139,5 @@ class WordToMarkdown
|
|
121
139
|
element.node_name = heading unless heading.nil?
|
122
140
|
end
|
123
141
|
end
|
124
|
-
|
125
142
|
end
|
126
143
|
end
|
@@ -1,50 +1,55 @@
|
|
1
|
-
#
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
2
3
|
class WordToMarkdown
|
3
4
|
class Document
|
4
5
|
class NotFoundError < StandardError; end
|
5
|
-
class ConverstionError < StandardError; end
|
6
6
|
|
7
|
-
|
7
|
+
class ConversionError < StandardError; end
|
8
|
+
|
9
|
+
attr_reader :path, :tmpdir
|
8
10
|
|
11
|
+
# @param path [string] Path to the Word document
|
12
|
+
# @param tmpdir [string] Path to a working directory to use
|
9
13
|
def initialize(path, tmpdir = nil)
|
10
14
|
@path = File.expand_path path, Dir.pwd
|
11
15
|
@tmpdir = tmpdir || Dir.mktmpdir
|
12
16
|
raise NotFoundError, "File #{@path} does not exist" unless File.exist?(@path)
|
13
17
|
end
|
14
18
|
|
19
|
+
# @return [String] the document's extension
|
15
20
|
def extension
|
16
21
|
File.extname path
|
17
22
|
end
|
18
23
|
|
24
|
+
# @return [Nokigiri::Document]
|
19
25
|
def tree
|
20
26
|
@tree ||= begin
|
21
27
|
tree = Nokogiri::HTML(normalized_html)
|
22
|
-
tree.css(
|
28
|
+
tree.css('title').remove
|
23
29
|
tree
|
24
30
|
end
|
25
31
|
end
|
26
32
|
|
27
|
-
#
|
33
|
+
# @return [String] the html representation of the document
|
28
34
|
def html
|
29
|
-
tree.to_html.gsub("</li>\n",
|
35
|
+
tree.to_html.gsub("</li>\n", '</li>')
|
30
36
|
end
|
31
37
|
|
32
|
-
#
|
33
|
-
def
|
38
|
+
# @return [String] the markdown representation of the document
|
39
|
+
def markdown
|
34
40
|
@markdown ||= scrub_whitespace(ReverseMarkdown.convert(html, WordToMarkdown::REVERSE_MARKDOWN_OPTIONS))
|
35
41
|
end
|
42
|
+
alias to_s markdown
|
36
43
|
|
37
44
|
# Determine the document encoding
|
38
45
|
#
|
39
|
-
#
|
40
|
-
#
|
41
|
-
# Returns the encoding, defaulting to "UTF-8"
|
46
|
+
# @return [String] the encoding, defaulting to "UTF-8"
|
42
47
|
def encoding
|
43
|
-
match = raw_html.encode(
|
48
|
+
match = raw_html.encode('UTF-8', invalid: :replace, replace: '').match(/charset=([^"]+)/)
|
44
49
|
if match
|
45
|
-
match[1].sub(
|
50
|
+
match[1].sub('macintosh', 'MacRoman')
|
46
51
|
else
|
47
|
-
|
52
|
+
'UTF-8'
|
48
53
|
end
|
49
54
|
end
|
50
55
|
|
@@ -52,55 +57,59 @@ class WordToMarkdown
|
|
52
57
|
|
53
58
|
# Perform pre-processing normalization
|
54
59
|
#
|
55
|
-
#
|
56
|
-
#
|
57
|
-
# Returns the normalized html
|
60
|
+
# @return [String] the normalized html
|
58
61
|
def normalized_html
|
59
|
-
html = raw_html.force_encoding(encoding)
|
60
|
-
html = html.encode(
|
61
|
-
html = Premailer.new(html, :
|
62
|
-
html.gsub!
|
63
|
-
html.gsub!
|
64
|
-
html.gsub!
|
65
|
-
html.gsub!
|
62
|
+
html = raw_html.dup.force_encoding(encoding)
|
63
|
+
html = html.encode('UTF-8', invalid: :replace, replace: '')
|
64
|
+
html = Premailer.new(html, with_html_string: true, input_encoding: 'UTF-8').to_inline_css
|
65
|
+
html.gsub!(/\n|\r/, ' ') # Remove linebreaks
|
66
|
+
html.gsub!(/“|”/, '"') # Straighten curly double quotes
|
67
|
+
html.gsub!(/‘|’/, "'") # Straighten curly single quotes
|
68
|
+
html.gsub!(/>\s+</, '><') # Remove extra whitespace between tags
|
66
69
|
html
|
67
70
|
end
|
68
71
|
|
69
72
|
# Perform post-processing normalization of certain Word quirks
|
70
73
|
#
|
71
|
-
# string
|
74
|
+
# @param string [String] the markdown representation of the document
|
72
75
|
#
|
73
|
-
#
|
76
|
+
# @return [String] the normalized markdown
|
74
77
|
def scrub_whitespace(string)
|
75
|
-
string
|
76
|
-
string.
|
77
|
-
string.sub!(
|
78
|
-
string.
|
79
|
-
string.gsub!(
|
80
|
-
string.gsub!(/\
|
78
|
+
string = string.dup
|
79
|
+
string.gsub!(' ', ' ') # HTML encoded spaces
|
80
|
+
string.sub!(/\A[[:space:]]+/, '') # document leading whitespace
|
81
|
+
string.sub!(/[[:space:]]+\z/, '') # document trailing whitespace
|
82
|
+
string.gsub!(/([ ]+)$/, '') # line trailing whitespace
|
83
|
+
string.gsub!(/\n\n\n\n/, "\n\n") # Quadruple line breaks
|
84
|
+
string.delete!(' ') # Unicode non-breaking spaces, injected as tabs
|
85
|
+
string.gsub!(/\*\*\ +(?!\*|_)([[:punct:]])/, '**\1') # Remove extra space after bold
|
81
86
|
string
|
82
87
|
end
|
83
88
|
|
89
|
+
# @return [String] the path to the intermediary HTML document
|
84
90
|
def dest_path
|
85
|
-
dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/,
|
91
|
+
dest_filename = File.basename(path).gsub(/#{Regexp.escape(extension)}$/, '.html')
|
86
92
|
File.expand_path(dest_filename, tmpdir)
|
87
93
|
end
|
88
94
|
|
95
|
+
# @return [String] the unnormalized HTML representation
|
89
96
|
def raw_html
|
90
97
|
@raw_html ||= begin
|
91
|
-
WordToMarkdown
|
92
|
-
raise
|
98
|
+
WordToMarkdown.run_command '--headless', '--convert-to', filter, path, '--outdir', tmpdir
|
99
|
+
raise ConversionError, "Failed to convert #{path}" unless File.exist?(dest_path)
|
100
|
+
|
93
101
|
html = File.read dest_path
|
94
102
|
File.delete dest_path
|
95
103
|
html
|
96
104
|
end
|
97
105
|
end
|
98
106
|
|
107
|
+
# @return [String] the LibreOffice filter to use for conversion
|
99
108
|
def filter
|
100
|
-
if WordToMarkdown.soffice.major_version ==
|
101
|
-
|
109
|
+
if WordToMarkdown.soffice.major_version == '5'
|
110
|
+
'html:XHTML Writer File:UTF8'
|
102
111
|
else
|
103
|
-
|
112
|
+
'html'
|
104
113
|
end
|
105
114
|
end
|
106
115
|
end
|
data/lib/word-to-markdown.rb
CHANGED
@@ -1,4 +1,6 @@
|
|
1
|
-
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'descriptive_statistics/safe'
|
2
4
|
require 'reverse_markdown'
|
3
5
|
require 'nokogiri-styles'
|
4
6
|
require 'premailer'
|
@@ -16,86 +18,94 @@ require_relative 'nokogiri/xml/element'
|
|
16
18
|
require_relative 'cliver/dependency_ext'
|
17
19
|
|
18
20
|
class WordToMarkdown
|
19
|
-
|
20
21
|
attr_reader :document, :converter
|
21
22
|
|
23
|
+
# Options to be passed to Reverse Markdown
|
22
24
|
REVERSE_MARKDOWN_OPTIONS = {
|
23
25
|
unknown_tags: :bypass,
|
24
26
|
github_flavored: true
|
25
|
-
}
|
27
|
+
}.freeze
|
26
28
|
|
29
|
+
# Minimum version of LibreOffice Required
|
27
30
|
SOFFICE_VERSION_REQUIREMENT = '> 4.0'
|
28
31
|
|
32
|
+
# Paths to look for LibreOffice, in order of preference
|
29
33
|
PATHS = [
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
]
|
34
|
+
'*', # Sub'd for ENV["PATH"]
|
35
|
+
'~/Applications/LibreOffice.app/Contents/MacOS',
|
36
|
+
'/Applications/LibreOffice.app/Contents/MacOS',
|
37
|
+
'/Program Files/LibreOffice 5/program',
|
38
|
+
'/Program Files (x86)/LibreOffice 4/program'
|
39
|
+
].freeze
|
36
40
|
|
37
41
|
# Create a new WordToMarkdown object
|
38
42
|
#
|
39
|
-
#
|
40
|
-
#
|
41
|
-
#
|
43
|
+
# @param path [string] Path to the Word document
|
44
|
+
# @param tmpdir [string] Path to a working directory to use
|
45
|
+
# @return [WordToMarkdown] WordToMarkdown object with the converted document
|
42
46
|
def initialize(path, tmpdir = nil)
|
43
47
|
@document = WordToMarkdown::Document.new path, tmpdir
|
44
48
|
@converter = WordToMarkdown::Converter.new @document
|
45
49
|
converter.convert!
|
46
50
|
end
|
47
51
|
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
logger.debug output
|
53
|
-
raise "Command `#{soffice.path} #{args.join(" ")}` failed: #{output}" if status.exitstatus != 0
|
54
|
-
output
|
52
|
+
# Helper method to return the document body, as markdown
|
53
|
+
# @return [string] the document body, as markdown
|
54
|
+
def to_s
|
55
|
+
document.to_s
|
55
56
|
end
|
56
57
|
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
# open - is the dependency currently open/running?
|
65
|
-
def self.soffice
|
66
|
-
@@soffice_dependency ||= Cliver::Dependency.new("soffice", *soffice_dependency_args)
|
67
|
-
end
|
58
|
+
class << self
|
59
|
+
# Run an soffice command
|
60
|
+
#
|
61
|
+
# @param args [string] one or more arguments to pass to the sofice command
|
62
|
+
# @return [string] the command output
|
63
|
+
def run_command(*args)
|
64
|
+
raise 'LibreOffice already running' if soffice.open?
|
68
65
|
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
66
|
+
output, status = Open3.capture2e(soffice.path, *args)
|
67
|
+
logger.debug output
|
68
|
+
raise "Command `#{soffice.path} #{args.join(' ')}` failed: #{output}" if status.exitstatus != 0
|
69
|
+
|
70
|
+
output
|
74
71
|
end
|
75
|
-
end
|
76
72
|
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
73
|
+
# Returns a Cliver::Dependency object representing our soffice dependency
|
74
|
+
#
|
75
|
+
# Attempts to resolve by looking at PATH followed by paths in the PATHS constant
|
76
|
+
#
|
77
|
+
# Methods used internally:
|
78
|
+
# path - returns the resolved path. Raises an error if not satisfied
|
79
|
+
# version - returns the resolved version
|
80
|
+
# open - is the dependency currently open/running?
|
81
|
+
# @return Cliver::Dependency instance
|
82
|
+
def soffice
|
83
|
+
@soffice ||= Cliver::Dependency.new('soffice', *soffice_dependency_args)
|
84
|
+
end
|
81
85
|
|
82
|
-
|
83
|
-
|
84
|
-
|
86
|
+
# @return Logger instance
|
87
|
+
def logger
|
88
|
+
@logger ||= begin
|
89
|
+
logger = Logger.new($stdout)
|
90
|
+
logger.level = Logger::ERROR unless ENV['DEBUG']
|
91
|
+
logger
|
92
|
+
end
|
93
|
+
end
|
85
94
|
|
86
|
-
|
95
|
+
private
|
87
96
|
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
97
|
+
# Workaround for two upstream bugs:
|
98
|
+
# 1. `soffice.exe --version` on windows opens a popup and retuns a null string when manually closed
|
99
|
+
# 2. Even if the second argument to Cliver is nil, Cliver thinks there's a requirement
|
100
|
+
# and will shell out to `soffice.exe --version`
|
101
|
+
# In order to support Windows, don't pass *any* version requirement to Cliver
|
102
|
+
def soffice_dependency_args
|
103
|
+
args = [path: PATHS.join(File::PATH_SEPARATOR)]
|
104
|
+
if Gem.win_platform?
|
105
|
+
args
|
106
|
+
else
|
107
|
+
args.unshift SOFFICE_VERSION_REQUIREMENT
|
108
|
+
end
|
99
109
|
end
|
100
110
|
end
|
101
111
|
end
|
metadata
CHANGED
@@ -1,29 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: word-to-markdown
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.1.
|
4
|
+
version: 1.1.9
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ben Balter
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2025-01-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
|
-
name:
|
14
|
+
name: cliver
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
17
|
- - "~>"
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '0.
|
19
|
+
version: '0.3'
|
20
20
|
type: :runtime
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
24
|
- - "~>"
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '0.
|
26
|
+
version: '0.3'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: descriptive_statistics
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -38,6 +38,20 @@ dependencies:
|
|
38
38
|
- - "~>"
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '2.5'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: nokogiri-styles
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0.1'
|
48
|
+
type: :runtime
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0.1'
|
41
55
|
- !ruby/object:Gem::Dependency
|
42
56
|
name: premailer
|
43
57
|
requirement: !ruby/object:Gem::Requirement
|
@@ -53,131 +67,151 @@ dependencies:
|
|
53
67
|
- !ruby/object:Gem::Version
|
54
68
|
version: '1.8'
|
55
69
|
- !ruby/object:Gem::Dependency
|
56
|
-
name:
|
70
|
+
name: reverse_markdown
|
57
71
|
requirement: !ruby/object:Gem::Requirement
|
58
72
|
requirements:
|
59
|
-
- - "
|
73
|
+
- - ">="
|
60
74
|
- !ruby/object:Gem::Version
|
61
|
-
version: '
|
75
|
+
version: '1'
|
76
|
+
- - "<"
|
77
|
+
- !ruby/object:Gem::Version
|
78
|
+
version: '3'
|
62
79
|
type: :runtime
|
63
80
|
prerelease: false
|
64
81
|
version_requirements: !ruby/object:Gem::Requirement
|
65
82
|
requirements:
|
66
|
-
- - "
|
83
|
+
- - ">="
|
67
84
|
- !ruby/object:Gem::Version
|
68
|
-
version: '
|
85
|
+
version: '1'
|
86
|
+
- - "<"
|
87
|
+
- !ruby/object:Gem::Version
|
88
|
+
version: '3'
|
69
89
|
- !ruby/object:Gem::Dependency
|
70
90
|
name: sys-proctable
|
71
91
|
requirement: !ruby/object:Gem::Requirement
|
72
92
|
requirements:
|
73
93
|
- - "~>"
|
74
94
|
- !ruby/object:Gem::Version
|
75
|
-
version: '0
|
95
|
+
version: '1.0'
|
76
96
|
type: :runtime
|
77
97
|
prerelease: false
|
78
98
|
version_requirements: !ruby/object:Gem::Requirement
|
79
99
|
requirements:
|
80
100
|
- - "~>"
|
81
101
|
- !ruby/object:Gem::Version
|
82
|
-
version: '0
|
102
|
+
version: '1.0'
|
83
103
|
- !ruby/object:Gem::Dependency
|
84
|
-
name:
|
104
|
+
name: minitest
|
85
105
|
requirement: !ruby/object:Gem::Requirement
|
86
106
|
requirements:
|
87
107
|
- - "~>"
|
88
108
|
- !ruby/object:Gem::Version
|
89
|
-
version: '0
|
90
|
-
type: :
|
109
|
+
version: '5.0'
|
110
|
+
type: :development
|
91
111
|
prerelease: false
|
92
112
|
version_requirements: !ruby/object:Gem::Requirement
|
93
113
|
requirements:
|
94
114
|
- - "~>"
|
95
115
|
- !ruby/object:Gem::Version
|
96
|
-
version: '0
|
116
|
+
version: '5.0'
|
97
117
|
- !ruby/object:Gem::Dependency
|
98
|
-
name:
|
118
|
+
name: mocha
|
99
119
|
requirement: !ruby/object:Gem::Requirement
|
100
120
|
requirements:
|
101
121
|
- - "~>"
|
102
122
|
- !ruby/object:Gem::Version
|
103
|
-
version: '
|
123
|
+
version: '1.1'
|
104
124
|
type: :development
|
105
125
|
prerelease: false
|
106
126
|
version_requirements: !ruby/object:Gem::Requirement
|
107
127
|
requirements:
|
108
128
|
- - "~>"
|
109
129
|
- !ruby/object:Gem::Version
|
110
|
-
version: '
|
130
|
+
version: '1.1'
|
111
131
|
- !ruby/object:Gem::Dependency
|
112
|
-
name:
|
132
|
+
name: pry
|
113
133
|
requirement: !ruby/object:Gem::Requirement
|
114
134
|
requirements:
|
115
135
|
- - "~>"
|
116
136
|
- !ruby/object:Gem::Version
|
117
|
-
version: '
|
137
|
+
version: '0.10'
|
118
138
|
type: :development
|
119
139
|
prerelease: false
|
120
140
|
version_requirements: !ruby/object:Gem::Requirement
|
121
141
|
requirements:
|
122
142
|
- - "~>"
|
123
143
|
- !ruby/object:Gem::Version
|
124
|
-
version: '
|
144
|
+
version: '0.10'
|
125
145
|
- !ruby/object:Gem::Dependency
|
126
|
-
name:
|
146
|
+
name: rake
|
127
147
|
requirement: !ruby/object:Gem::Requirement
|
128
148
|
requirements:
|
129
149
|
- - "~>"
|
130
150
|
- !ruby/object:Gem::Version
|
131
|
-
version: '
|
151
|
+
version: '13.0'
|
132
152
|
type: :development
|
133
153
|
prerelease: false
|
134
154
|
version_requirements: !ruby/object:Gem::Requirement
|
135
155
|
requirements:
|
136
156
|
- - "~>"
|
137
157
|
- !ruby/object:Gem::Version
|
138
|
-
version: '
|
158
|
+
version: '13.0'
|
139
159
|
- !ruby/object:Gem::Dependency
|
140
|
-
name:
|
160
|
+
name: rubocop
|
141
161
|
requirement: !ruby/object:Gem::Requirement
|
142
162
|
requirements:
|
143
163
|
- - "~>"
|
144
164
|
- !ruby/object:Gem::Version
|
145
|
-
version: '0
|
165
|
+
version: '1.0'
|
146
166
|
type: :development
|
147
167
|
prerelease: false
|
148
168
|
version_requirements: !ruby/object:Gem::Requirement
|
149
169
|
requirements:
|
150
170
|
- - "~>"
|
151
171
|
- !ruby/object:Gem::Version
|
152
|
-
version: '0
|
172
|
+
version: '1.0'
|
153
173
|
- !ruby/object:Gem::Dependency
|
154
|
-
name:
|
174
|
+
name: rubocop-minitest
|
155
175
|
requirement: !ruby/object:Gem::Requirement
|
156
176
|
requirements:
|
157
177
|
- - "~>"
|
158
178
|
- !ruby/object:Gem::Version
|
159
|
-
version: '
|
179
|
+
version: '0.3'
|
160
180
|
type: :development
|
161
181
|
prerelease: false
|
162
182
|
version_requirements: !ruby/object:Gem::Requirement
|
163
183
|
requirements:
|
164
184
|
- - "~>"
|
165
185
|
- !ruby/object:Gem::Version
|
166
|
-
version: '
|
186
|
+
version: '0.3'
|
167
187
|
- !ruby/object:Gem::Dependency
|
168
|
-
name:
|
188
|
+
name: rubocop-performance
|
169
189
|
requirement: !ruby/object:Gem::Requirement
|
170
190
|
requirements:
|
171
191
|
- - "~>"
|
172
192
|
- !ruby/object:Gem::Version
|
173
|
-
version: '5
|
193
|
+
version: '1.5'
|
174
194
|
type: :development
|
175
195
|
prerelease: false
|
176
196
|
version_requirements: !ruby/object:Gem::Requirement
|
177
197
|
requirements:
|
178
198
|
- - "~>"
|
179
199
|
- !ruby/object:Gem::Version
|
180
|
-
version: '5
|
200
|
+
version: '1.5'
|
201
|
+
- !ruby/object:Gem::Dependency
|
202
|
+
name: shoulda
|
203
|
+
requirement: !ruby/object:Gem::Requirement
|
204
|
+
requirements:
|
205
|
+
- - "~>"
|
206
|
+
- !ruby/object:Gem::Version
|
207
|
+
version: '4.0'
|
208
|
+
type: :development
|
209
|
+
prerelease: false
|
210
|
+
version_requirements: !ruby/object:Gem::Requirement
|
211
|
+
requirements:
|
212
|
+
- - "~>"
|
213
|
+
- !ruby/object:Gem::Version
|
214
|
+
version: '4.0'
|
181
215
|
description: Ruby Gem to convert Word documents to markdown.
|
182
216
|
email: ben.balter@github.com
|
183
217
|
executables:
|
@@ -185,6 +219,8 @@ executables:
|
|
185
219
|
extensions: []
|
186
220
|
extra_rdoc_files: []
|
187
221
|
files:
|
222
|
+
- LICENSE.md
|
223
|
+
- README.md
|
188
224
|
- bin/w2m
|
189
225
|
- lib/cliver/dependency_ext.rb
|
190
226
|
- lib/nokogiri/xml/element.rb
|
@@ -195,8 +231,9 @@ files:
|
|
195
231
|
homepage: https://github.com/benbalter/word-to-markdown
|
196
232
|
licenses:
|
197
233
|
- MIT
|
198
|
-
metadata:
|
199
|
-
|
234
|
+
metadata:
|
235
|
+
rubygems_mfa_required: 'true'
|
236
|
+
post_install_message:
|
200
237
|
rdoc_options: []
|
201
238
|
require_paths:
|
202
239
|
- lib
|
@@ -211,9 +248,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
211
248
|
- !ruby/object:Gem::Version
|
212
249
|
version: '0'
|
213
250
|
requirements: []
|
214
|
-
|
215
|
-
|
216
|
-
signing_key:
|
251
|
+
rubygems_version: 3.5.16
|
252
|
+
signing_key:
|
217
253
|
specification_version: 4
|
218
254
|
summary: Ruby Gem to convert Word documents to markdown
|
219
255
|
test_files: []
|