docx_converter 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +7 -0
- data/Gemfile +3 -0
- data/README.md +57 -0
- data/Rakefile +1 -0
- data/bin/docx_converter +51 -0
- data/docx_converter.gemspec +26 -0
- data/lib/docx_converter.rb +28 -0
- data/lib/docx_converter/parser.rb +241 -0
- data/lib/docx_converter/postprocessor.rb +75 -0
- data/lib/docx_converter/render.rb +119 -0
- data/lib/docx_converter/version.rb +20 -0
- metadata +141 -0
data/.gitignore
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
@@ -0,0 +1,57 @@
|
|
1
|
+
docx-converter
|
2
|
+
=================
|
3
|
+
|
4
|
+
This Ruby library (gem) parses and translates `.docx` Word documents into kramdown syntax, which allows for easy subsequent translation into `html` or `TeX` code via the excellent `kramdown` library. `kramdown` is a superset of `Markdown`. See http://kramdown.gettalong.org/ for more details.
|
5
|
+
|
6
|
+
A `.docx` file as written by modern versions of Microsoft Office is just a `.zip` file in disguise. It contains a directory tree containing XML files. Parsing of these compressed XML trees is rather staightforward, thanks to the `zip` and `nokogiri` Ruby libraries.
|
7
|
+
|
8
|
+
`docx-converter` contains a parser which translates all common Word document entities into corresponding `kramdown` syntax. It extracts images and converts them into `.jpg` files with a maximum width or height of 800 pixels.
|
9
|
+
|
10
|
+
Output files and directories will be created according to the `webgen` conventions. This is useful when you want to generate a static website with the `webgen` gem after you have converted your `.docx` file into `html`. The file naming is in the format `ss.nnnn.ll.page`, where `ss` is a 2-digit sort number, `nnnn` is the main file name, `ll` is the language code. For more information on `webgen` see http://webgen.gettalong.org/
|
11
|
+
|
12
|
+
`docx_converter` was written for our project `publishr_web`, see http://documentation.red-e.eu/publishr/index.html
|
13
|
+
|
14
|
+
Supported Word elements:
|
15
|
+
|
16
|
+
* Paragraph
|
17
|
+
* Line break
|
18
|
+
* Page break
|
19
|
+
* Bold
|
20
|
+
* Italic
|
21
|
+
* paragraph styles like Heading1, Heading2 and Title
|
22
|
+
* character styles like Strong and Quote
|
23
|
+
* footnotes
|
24
|
+
* images including captions
|
25
|
+
* non-breaking spaces
|
26
|
+
|
27
|
+
Installation
|
28
|
+
----------
|
29
|
+
|
30
|
+
`gem install docx-converter`
|
31
|
+
|
32
|
+
Usage
|
33
|
+
----------
|
34
|
+
|
35
|
+
From the command line:
|
36
|
+
|
37
|
+
`docx-converter` `inputfile` `format` `output_directory`
|
38
|
+
|
39
|
+
`format` can be either `kramdown`, `html` or `latex`. For example:
|
40
|
+
|
41
|
+
`docx-converter` `~/Downloads/testdoc1.docx` `latex` `/tmp/docxoutput`
|
42
|
+
|
43
|
+
`output_directory` will be created if it doesn't exist. A subdirectory `/src` will be created by default, which is merely a convention to be identical with the `webgen` file system standard.
|
44
|
+
|
45
|
+
If you want to use `docx_converter` from a Ruby script, you can use the API like this:
|
46
|
+
|
47
|
+
r = DocxConverter::Render.new(options)
|
48
|
+
rendered_filepaths = r.render(:html)
|
49
|
+
|
50
|
+
`options` is a hash with the following keys
|
51
|
+
|
52
|
+
* `:output_dir`: The directory to be created for the output files. A subdirectory `/src` will be created by default, which is merely a convention to be identical with the `webgen` file system standard.
|
53
|
+
* `:inputfile`: The path to the `.docx` file to be parsed
|
54
|
+
* `:image_subdir_filesystem`: The subdirectory name into which images will be put. It will be created below the `/src` subdirectory.
|
55
|
+
* `:image_subdir_kramdown`: Usually this is identical to `:image_subdir_filesystem` and should only be different when you do further manual postprocessing with the kramdown output. This string will be added as a prefix for images in the final kramdown output. An example: `![image description](/image_subdir_kramdown/imagename.jpg)`.
|
56
|
+
* `:language`: The language to be used for the generated file names. See `webgen` conventions above.
|
57
|
+
* `:split_chapters`: when `true`, the output files will be split between headings which have the Word paragraph style "Heading1". This is useful for large documents. When `false`, no splitting is done and all content will be output to the file `01.chapter01.ll.page`. Footnotes will be split correctly into the various chapters.
|
data/Rakefile
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
require "bundler/gem_tasks"
|
data/bin/docx_converter
ADDED
@@ -0,0 +1,51 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
# docx_converter -- Converts docx files into html or LaTeX via the kramdown syntax
|
4
|
+
# Copyright (C) 2013 Red (E) Tools Ltd. (www.thebigrede.net)
|
5
|
+
#
|
6
|
+
# This program is free software: you can redistribute it and/or modify
|
7
|
+
# it under the terms of the GNU Affero General Public License as
|
8
|
+
# published by the Free Software Foundation, either version 3 of the
|
9
|
+
# License, or (at your option) any later version.
|
10
|
+
#
|
11
|
+
#
|
12
|
+
# This program is distributed in the hope that it will be useful,
|
13
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
14
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
15
|
+
# GNU Affero General Public License for more details.
|
16
|
+
#
|
17
|
+
# You should have received a copy of the GNU Affero General Public License
|
18
|
+
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
19
|
+
|
20
|
+
$LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
|
21
|
+
|
22
|
+
require 'docx_converter'
|
23
|
+
|
24
|
+
if ARGV[2].nil?
|
25
|
+
puts 'Error: Arguments missing! See https://github.com/michaelfranzl /docx_converter or README.md for documentation.'
|
26
|
+
Process.exit!
|
27
|
+
end
|
28
|
+
|
29
|
+
options = {
|
30
|
+
:inputfile => ARGV[0],
|
31
|
+
:image_subdir_filesystem => "images",
|
32
|
+
:image_subdir_kramdown => "images",
|
33
|
+
:output_dir => File.join(ARGV[2], "src"),
|
34
|
+
:language => "en",
|
35
|
+
:split_chapters => true
|
36
|
+
}
|
37
|
+
|
38
|
+
output_format = ARGV[1].to_sym
|
39
|
+
|
40
|
+
supported_output_formats = [:kramdown, :html, :latex]
|
41
|
+
unless supported_output_formats.include? output_format
|
42
|
+
puts "Output format must be one of #{ supported_output_formats.join(", ") }. Exiting."
|
43
|
+
Process.exit!
|
44
|
+
end
|
45
|
+
|
46
|
+
r = DocxConverter::Render.new(options)
|
47
|
+
rendered_filepaths = r.render(output_format)
|
48
|
+
|
49
|
+
if rendered_filepaths.any?
|
50
|
+
puts "Rendered files #{ rendered_filepaths.join(", ") } in directory #{ File.join(ARGV[2], "src") }"
|
51
|
+
end
|
@@ -0,0 +1,26 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
$:.push File.expand_path("../lib", __FILE__)
|
3
|
+
require "docx_converter/version"
|
4
|
+
|
5
|
+
Gem::Specification.new do |s|
|
6
|
+
s.name = "docx_converter"
|
7
|
+
s.version = DocxConverter::VERSION
|
8
|
+
s.authors = ["Michael Franzl"]
|
9
|
+
s.email = ["office@michaelfranzl.com"]
|
10
|
+
s.homepage = "https://github.com/michaelfranzl/docx_converter"
|
11
|
+
s.summary = %q{Converts Word docx files into html or LaTeX via the kramdown syntax}
|
12
|
+
s.description = %q{Converts Word docx files into html or LaTeX via the kramdown syntax. It supports Word's most common paragraph, character and mixed styles (Title, Heading, Strong, Quote), footnotes, line breaks, page breaks, non-breaking spaces and images with captions. The output is in kramdown syntax (see http://kramdown.gettalong.org/) which can be converted into beautiful html and LaTex code.}
|
13
|
+
s.rubyforge_project = "docx_converter"
|
14
|
+
|
15
|
+
s.files = `git ls-files`.split("\n")
|
16
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
17
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
18
|
+
s.require_paths = ["lib"]
|
19
|
+
|
20
|
+
# specify any dependencies here; for example:
|
21
|
+
s.add_runtime_dependency 'kramdown'
|
22
|
+
s.add_runtime_dependency 'nokogiri'
|
23
|
+
s.add_runtime_dependency 'rubyzip'
|
24
|
+
s.add_runtime_dependency 'rmagick'
|
25
|
+
s.add_runtime_dependency 'ruby-filemagic'
|
26
|
+
end
|
@@ -0,0 +1,28 @@
|
|
1
|
+
# docx_converter -- Converts Word docx files into html or LaTeX via the kramdown syntax
|
2
|
+
# Copyright (C) 2013 Red (E) Tools Ltd. (www.thebigrede.net)
|
3
|
+
#
|
4
|
+
# This program is free software: you can redistribute it and/or modify
|
5
|
+
# it under the terms of the GNU Affero General Public License as
|
6
|
+
# published by the Free Software Foundation, either version 3 of the
|
7
|
+
# License, or (at your option) any later version.
|
8
|
+
#
|
9
|
+
#
|
10
|
+
# This program is distributed in the hope that it will be useful,
|
11
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
13
|
+
# GNU Affero General Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU Affero General Public License
|
16
|
+
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
17
|
+
|
18
|
+
require 'publishr'
|
19
|
+
require 'kramdown'
|
20
|
+
require 'nokogiri'
|
21
|
+
require 'zip/zipfilesystem'
|
22
|
+
require 'filemagic'
|
23
|
+
require 'RMagick'
|
24
|
+
|
25
|
+
dir = File.dirname(__FILE__)
|
26
|
+
Dir[File.expand_path("#{dir}/docx_converter/*.rb")].uniq.each do |file|
|
27
|
+
require file
|
28
|
+
end
|
@@ -0,0 +1,241 @@
|
|
1
|
+
# encoding: UTF-8
|
2
|
+
|
3
|
+
# docx_converter -- Converts Word docx files into html or LaTeX via the kramdown syntax
|
4
|
+
# Copyright (C) 2013 Red (E) Tools Ltd. (www.thebigrede.net)
|
5
|
+
#
|
6
|
+
# This program is free software: you can redistribute it and/or modify
|
7
|
+
# it under the terms of the GNU Affero General Public License as
|
8
|
+
# published by the Free Software Foundation, either version 3 of the
|
9
|
+
# License, or (at your option) any later version.
|
10
|
+
#
|
11
|
+
#
|
12
|
+
# This program is distributed in the hope that it will be useful,
|
13
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
14
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
15
|
+
# GNU Affero General Public License for more details.
|
16
|
+
#
|
17
|
+
# You should have received a copy of the GNU Affero General Public License
|
18
|
+
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
19
|
+
|
20
|
+
module DocxConverter
|
21
|
+
class Parser
|
22
|
+
def initialize(options)
|
23
|
+
@output_dir = options[:output_dir]
|
24
|
+
@docx_filepath = options[:inputfile]
|
25
|
+
|
26
|
+
@image_subdir_filesystem = options[:image_subdir_filesystem]
|
27
|
+
@image_subdir_kramdown = options[:image_subdir_kramdown]
|
28
|
+
|
29
|
+
@relationships_hash = {}
|
30
|
+
|
31
|
+
@zipfile = Zip::ZipFile.new(@docx_filepath)
|
32
|
+
end
|
33
|
+
|
34
|
+
def parse
|
35
|
+
document_xml = unzip_read("word/document.xml")
|
36
|
+
footnotes_xml = unzip_read("word/footnotes.xml")
|
37
|
+
relationships_xml = unzip_read("word/_rels/document.xml.rels")
|
38
|
+
|
39
|
+
content = Nokogiri::XML(document_xml)
|
40
|
+
footnotes = Nokogiri::XML(footnotes_xml)
|
41
|
+
relationships = Nokogiri::XML(relationships_xml)
|
42
|
+
|
43
|
+
@relationships_hash = parse_relationships(relationships)
|
44
|
+
|
45
|
+
footnote_definitions = parse_footnotes(footnotes)
|
46
|
+
output_content = parse_content(content.elements.first,0)
|
47
|
+
|
48
|
+
return {
|
49
|
+
:content => output_content,
|
50
|
+
:footnote_definitions => footnote_definitions
|
51
|
+
}
|
52
|
+
end
|
53
|
+
|
54
|
+
private
|
55
|
+
|
56
|
+
def unzip_read(zip_path)
|
57
|
+
file = @zipfile.find_entry(zip_path)
|
58
|
+
contents = ""
|
59
|
+
file.get_input_stream do |f|
|
60
|
+
contents = f.read
|
61
|
+
end
|
62
|
+
return contents
|
63
|
+
end
|
64
|
+
|
65
|
+
# this is only needed for embedded images
|
66
|
+
def extract_image(zip_path)
|
67
|
+
file_contents = unzip_read(zip_path)
|
68
|
+
extract_basename = File.basename(zip_path, ".*") + ".jpg"
|
69
|
+
extract_path = File.join(@output_dir, @image_subdir_filesystem, extract_basename)
|
70
|
+
|
71
|
+
fm = FileMagic.new
|
72
|
+
filetype = fm.buffer(file_contents)
|
73
|
+
case filetype
|
74
|
+
when /^JPEG image data/, /^PNG image data/
|
75
|
+
img = Magick::Image.from_blob(file_contents)[0]
|
76
|
+
if img.columns > 800 || img.rows > 800
|
77
|
+
img.resize_to_fit!(800)
|
78
|
+
end
|
79
|
+
ret = img.write(extract_path) {
|
80
|
+
self.format = "JPEG"
|
81
|
+
self.quality = 80
|
82
|
+
}
|
83
|
+
end
|
84
|
+
if @image_subdir_kramdown.empty?
|
85
|
+
kramdown_path = extract_basename
|
86
|
+
else
|
87
|
+
kramdown_path = File.join(@image_subdir_kramdown, extract_basename)
|
88
|
+
end
|
89
|
+
return kramdown_path
|
90
|
+
end
|
91
|
+
|
92
|
+
def parse_relationships(relationships)
|
93
|
+
output = {}
|
94
|
+
relationships.children.first.children.each do |rel|
|
95
|
+
rel_id = rel.attributes["Id"].value
|
96
|
+
rel_target = rel.attributes["Target"].value
|
97
|
+
output[rel_id] = rel_target
|
98
|
+
end
|
99
|
+
return output
|
100
|
+
end
|
101
|
+
|
102
|
+
def parse_footnotes(node)
|
103
|
+
output = {}
|
104
|
+
node.xpath("//w:footnote").each do |fnode|
|
105
|
+
footnote_number = fnode.attributes["id"].value
|
106
|
+
if ["-1", "0"].include?(footnote_number)
|
107
|
+
# Word outputs -1 and 0 as 'magic' footnotes
|
108
|
+
next
|
109
|
+
end
|
110
|
+
output[footnote_number] = parse_content(fnode,0).strip
|
111
|
+
end
|
112
|
+
return output
|
113
|
+
end
|
114
|
+
|
115
|
+
def parse_content(node,depth)
|
116
|
+
output = ""
|
117
|
+
depth += 1
|
118
|
+
children_count = node.children.length
|
119
|
+
i = 0
|
120
|
+
|
121
|
+
while i < children_count
|
122
|
+
add = ""
|
123
|
+
nd = node.children[i]
|
124
|
+
|
125
|
+
case nd.name
|
126
|
+
when "body"
|
127
|
+
# This is just a container element.
|
128
|
+
add = parse_content(nd,depth)
|
129
|
+
|
130
|
+
when "document"
|
131
|
+
# This is just a container element.
|
132
|
+
add = parse_content(nd,depth)
|
133
|
+
|
134
|
+
when "p"
|
135
|
+
# This is a paragraph. In kramdown, paragraphs are spearated by an empty line.
|
136
|
+
add = parse_content(nd,depth) + "\n\n"
|
137
|
+
|
138
|
+
when "pPr"
|
139
|
+
# This is Word's paragraph-level preset
|
140
|
+
add = parse_content(nd,depth)
|
141
|
+
|
142
|
+
when "pStyle"
|
143
|
+
# This is a reference to one of Word's paragraph-level styles
|
144
|
+
case nd["w:val"]
|
145
|
+
when "Title"
|
146
|
+
add = "{: .class = 'title' }\n"
|
147
|
+
when "Heading1"
|
148
|
+
add = "# "
|
149
|
+
when "Heading2"
|
150
|
+
add = "## "
|
151
|
+
when "Quote"
|
152
|
+
add = "> "
|
153
|
+
end
|
154
|
+
|
155
|
+
when "r"
|
156
|
+
# This corresponds to Word's character/inline node. Word's XML is not nested for formatting, wo we cannot descend recursively and 'close' kramdown's formatting in the recursion. Rather, we have to look ahead if this node is formatted, and if yes, set a formatting prefix and postfix which is required for kramdown (e.g. **bold**).
|
157
|
+
prefix = postfix = ""
|
158
|
+
first_child = nd.children.first
|
159
|
+
|
160
|
+
case first_child.name
|
161
|
+
when "rPr"
|
162
|
+
# This inline node is formatted. The first child always specifies the formatting of the subsequent 't' (text) node.
|
163
|
+
format_node = first_child.children.first
|
164
|
+
case format_node.name
|
165
|
+
when "b"
|
166
|
+
# This is regular (non-style) bold
|
167
|
+
prefix = postfix = "**"
|
168
|
+
when "i"
|
169
|
+
# This is regular (non-style) italic
|
170
|
+
prefix = postfix = "*"
|
171
|
+
when "rStyle"
|
172
|
+
# This is a reference to one of Word's style names
|
173
|
+
case format_node.attributes["val"].value
|
174
|
+
when "Strong"
|
175
|
+
# "Strong" is a predefined Word style
|
176
|
+
# This node is missing the xml:space="preserve" attribute, so we need to set the spaces ourselves.
|
177
|
+
prefix = " **"
|
178
|
+
postfix = "** "
|
179
|
+
end
|
180
|
+
end
|
181
|
+
add = prefix + parse_content(nd,depth) + postfix
|
182
|
+
when "br"
|
183
|
+
if first_child.attributes.empty?
|
184
|
+
# This is a line break. In kramdown, this corresponds to two spaces followed by a newline.
|
185
|
+
add = " \n"
|
186
|
+
else first_child.attributes["type"] == "page"
|
187
|
+
# this is a Word page break
|
188
|
+
add = "<br style='page-break-before:always;'>"
|
189
|
+
end
|
190
|
+
|
191
|
+
else
|
192
|
+
add = parse_content(nd,depth)
|
193
|
+
end
|
194
|
+
|
195
|
+
|
196
|
+
when "t"
|
197
|
+
# this is a regular text node
|
198
|
+
add = nd.text
|
199
|
+
|
200
|
+
when "footnoteReference"
|
201
|
+
# output the Kramdown footnote syntax
|
202
|
+
footnote_number = nd.attributes["id"].value
|
203
|
+
add = "[^#{ footnote_number }]"
|
204
|
+
|
205
|
+
when "tbl"
|
206
|
+
# parse the table recursively
|
207
|
+
add = parse_content(nd,depth)
|
208
|
+
|
209
|
+
when "tr"
|
210
|
+
# select all paragraph nodes below the table row and render them into Kramdown syntax
|
211
|
+
table_paragraphs = nd.xpath(".//w:p")
|
212
|
+
td = []
|
213
|
+
table_paragraphs.each do |tp|
|
214
|
+
td << parse_content(tp,depth)
|
215
|
+
end
|
216
|
+
add = "|" + td.join("|") + "|\n"
|
217
|
+
|
218
|
+
when "drawing"
|
219
|
+
image_nodes = nd.xpath(".//a:blip", :a => 'http://schemas.openxmlformats.org/drawingml/2006/main')
|
220
|
+
image_node = image_nodes.first
|
221
|
+
image_id = image_node.attributes["embed"].value
|
222
|
+
image_path_zip = File.join("word", @relationships_hash[image_id])
|
223
|
+
|
224
|
+
extracted_imagename = extract_image(image_path_zip)
|
225
|
+
|
226
|
+
add = "![](#{ extracted_imagename })\n"
|
227
|
+
else
|
228
|
+
# ignore those nodes
|
229
|
+
# puts ' ' * depth + "ELSE: #{ nd.name }"
|
230
|
+
end
|
231
|
+
|
232
|
+
output += add
|
233
|
+
i += 1
|
234
|
+
end
|
235
|
+
|
236
|
+
depth -= 1
|
237
|
+
return output
|
238
|
+
end
|
239
|
+
|
240
|
+
end
|
241
|
+
end
|
@@ -0,0 +1,75 @@
|
|
1
|
+
# encoding: UTF-8
|
2
|
+
|
3
|
+
# docx_converter -- Converts Word docx files into html or LaTeX via the kramdown syntax
|
4
|
+
# Copyright (C) 2013 Red (E) Tools Ltd. (www.thebigrede.net)
|
5
|
+
#
|
6
|
+
# This program is free software: you can redistribute it and/or modify
|
7
|
+
# it under the terms of the GNU Affero General Public License as
|
8
|
+
# published by the Free Software Foundation, either version 3 of the
|
9
|
+
# License, or (at your option) any later version.
|
10
|
+
#
|
11
|
+
#
|
12
|
+
# This program is distributed in the hope that it will be useful,
|
13
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
14
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
15
|
+
# GNU Affero General Public License for more details.
|
16
|
+
#
|
17
|
+
# You should have received a copy of the GNU Affero General Public License
|
18
|
+
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
19
|
+
|
20
|
+
module DocxConverter
|
21
|
+
class PostProcessor
|
22
|
+
|
23
|
+
def initialize(content, footnote_definitions)
|
24
|
+
@content = content
|
25
|
+
@footnote_definitions = footnote_definitions
|
26
|
+
|
27
|
+
@chapters = []
|
28
|
+
end
|
29
|
+
|
30
|
+
# getter method
|
31
|
+
def chapters
|
32
|
+
return @chapters
|
33
|
+
end
|
34
|
+
|
35
|
+
def join_blockquotes
|
36
|
+
lines = @content.split("\n")
|
37
|
+
processed_lines = []
|
38
|
+
lines.size.times do |i|
|
39
|
+
if /^> /.match(lines[i-1]) && /^> /.match(lines[i+1])
|
40
|
+
processed_lines << ">" + lines[i]
|
41
|
+
else
|
42
|
+
processed_lines << lines[i]
|
43
|
+
end
|
44
|
+
end
|
45
|
+
@content = processed_lines.join("\n")
|
46
|
+
@chapters[0] = @content
|
47
|
+
return @content
|
48
|
+
end
|
49
|
+
|
50
|
+
def split_into_chapters
|
51
|
+
chapter_number = 0
|
52
|
+
@chapters[chapter_number] = ""
|
53
|
+
@content.split("\n").each do |line|
|
54
|
+
if /^# /.match(line)
|
55
|
+
# this is the style Heading1. A new chapter begins here.
|
56
|
+
chapter_number += 1
|
57
|
+
@chapters[chapter_number] = ""
|
58
|
+
end
|
59
|
+
@chapters[chapter_number] += line + "\n"
|
60
|
+
end
|
61
|
+
return @chapters
|
62
|
+
end
|
63
|
+
|
64
|
+
def add_foonote_definitions
|
65
|
+
@chapters.size.times do |n|
|
66
|
+
footnote_ids = @chapters[n].scan(/\[\^(.+?)\]/).flatten
|
67
|
+
@chapters[n] += "\n\n"
|
68
|
+
footnote_ids.each do |i|
|
69
|
+
@chapters[n] += "[^#{ i }]: #{ @footnote_definitions[i] }\n\n"
|
70
|
+
end
|
71
|
+
end
|
72
|
+
end
|
73
|
+
|
74
|
+
end
|
75
|
+
end
|
@@ -0,0 +1,119 @@
|
|
1
|
+
# encoding: UTF-8
|
2
|
+
|
3
|
+
# docx_converter -- Converts Word docx files into html or LaTeX via the kramdown syntax
|
4
|
+
# Copyright (C) 2013 Red (E) Tools Ltd. (www.thebigrede.net)
|
5
|
+
#
|
6
|
+
# This program is free software: you can redistribute it and/or modify
|
7
|
+
# it under the terms of the GNU Affero General Public License as
|
8
|
+
# published by the Free Software Foundation, either version 3 of the
|
9
|
+
# License, or (at your option) any later version.
|
10
|
+
#
|
11
|
+
#
|
12
|
+
# This program is distributed in the hope that it will be useful,
|
13
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
14
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
15
|
+
# GNU Affero General Public License for more details.
|
16
|
+
#
|
17
|
+
# You should have received a copy of the GNU Affero General Public License
|
18
|
+
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
19
|
+
|
20
|
+
module DocxConverter
|
21
|
+
class Render
|
22
|
+
|
23
|
+
def initialize(options)
|
24
|
+
@options = options
|
25
|
+
@output_dir = options[:output_dir]
|
26
|
+
|
27
|
+
@chapters = nil
|
28
|
+
end
|
29
|
+
|
30
|
+
def render(output_format)
|
31
|
+
FileUtils.mkdir_p(File.join(@output_dir, @options[:image_subdir_filesystem]))
|
32
|
+
|
33
|
+
output = DocxConverter::Parser.new(@options).parse
|
34
|
+
|
35
|
+
content = output[:content]
|
36
|
+
footnote_definitions = output[:footnote_definitions]
|
37
|
+
|
38
|
+
p = DocxConverter::PostProcessor.new(content, footnote_definitions)
|
39
|
+
p.join_blockquotes
|
40
|
+
|
41
|
+
if @options[:split_chapters] == true
|
42
|
+
p.split_into_chapters
|
43
|
+
end
|
44
|
+
|
45
|
+
p.add_foonote_definitions
|
46
|
+
@chapters = p.chapters
|
47
|
+
|
48
|
+
case output_format
|
49
|
+
when :kramdown
|
50
|
+
filepaths = render_kramdown
|
51
|
+
when :html
|
52
|
+
filepaths = render_html
|
53
|
+
when :latex
|
54
|
+
filepaths = render_latex
|
55
|
+
else
|
56
|
+
filepaths = []
|
57
|
+
end
|
58
|
+
return filepaths
|
59
|
+
end
|
60
|
+
|
61
|
+
private
|
62
|
+
|
63
|
+
def render_kramdown
|
64
|
+
# output is in .page file extension. this is merely a convetion taken from the webgen file system structure
|
65
|
+
rendered_kramdown_file_paths = []
|
66
|
+
if @options[:split_chapters] == true
|
67
|
+
@chapters.size.times do |n|
|
68
|
+
filename = "%02i.chapter%02i.%s.page" % [n, n, @options[:language]]
|
69
|
+
file_path = File.join(@output_dir, filename)
|
70
|
+
File.write(file_path, @chapters[n])
|
71
|
+
rendered_kramdown_file_paths << filename
|
72
|
+
end
|
73
|
+
else
|
74
|
+
filename = "01.chapter01.%s.page" % [@options[:language]]
|
75
|
+
file_path = File.join(@output_dir, filename)
|
76
|
+
File.write(file_path, @chapters[0])
|
77
|
+
rendered_kramdown_file_paths << filename
|
78
|
+
end
|
79
|
+
return rendered_kramdown_file_paths
|
80
|
+
end
|
81
|
+
|
82
|
+
def render_html
|
83
|
+
rendered_kramdown_file_paths = render_kramdown
|
84
|
+
rendered_html_file_paths = []
|
85
|
+
rendered_kramdown_file_paths.each do |kfp|
|
86
|
+
filename = kfp.gsub("page", "html")
|
87
|
+
file_path = File.join(@output_dir, filename)
|
88
|
+
kramdown = File.read(File.join(@output_dir, kfp))
|
89
|
+
html = Kramdown::Document.new(
|
90
|
+
kramdown,
|
91
|
+
:input => 'kramdown',
|
92
|
+
:line_width => 100000
|
93
|
+
).to_html
|
94
|
+
File.write(file_path, html)
|
95
|
+
rendered_html_file_paths << filename
|
96
|
+
end
|
97
|
+
return rendered_html_file_paths
|
98
|
+
end
|
99
|
+
|
100
|
+
def render_latex
|
101
|
+
rendered_kramdown_file_paths = render_kramdown
|
102
|
+
rendered_latex_file_paths = []
|
103
|
+
rendered_kramdown_file_paths.each do |kfp|
|
104
|
+
filename = kfp.gsub("page", "tex")
|
105
|
+
file_path = File.join(@output_dir, filename)
|
106
|
+
kramdown = File.read(File.join(@output_dir, kfp))
|
107
|
+
latex = Kramdown::Document.new(
|
108
|
+
kramdown,
|
109
|
+
:input => 'kramdown',
|
110
|
+
:line_width => 100000
|
111
|
+
).to_latex
|
112
|
+
File.write(file_path, latex)
|
113
|
+
rendered_latex_file_paths << filename
|
114
|
+
end
|
115
|
+
return rendered_latex_file_paths
|
116
|
+
end
|
117
|
+
|
118
|
+
end
|
119
|
+
end
|
@@ -0,0 +1,20 @@
|
|
1
|
+
# docx_converter -- Converts Word docx files into html or LaTeX via the kramdown syntax
|
2
|
+
# Copyright (C) 2013 Red (E) Tools Ltd. (www.thebigrede.net)
|
3
|
+
#
|
4
|
+
# This program is free software: you can redistribute it and/or modify
|
5
|
+
# it under the terms of the GNU Affero General Public License as
|
6
|
+
# published by the Free Software Foundation, either version 3 of the
|
7
|
+
# License, or (at your option) any later version.
|
8
|
+
#
|
9
|
+
#
|
10
|
+
# This program is distributed in the hope that it will be useful,
|
11
|
+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
|
12
|
+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
13
|
+
# GNU Affero General Public License for more details.
|
14
|
+
#
|
15
|
+
# You should have received a copy of the GNU Affero General Public License
|
16
|
+
# along with this program. If not, see <http://www.gnu.org/licenses/>.
|
17
|
+
|
18
|
+
module DocxConverter
|
19
|
+
VERSION = "1.0.0"
|
20
|
+
end
|
metadata
ADDED
@@ -0,0 +1,141 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: docx_converter
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.0.0
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Michael Franzl
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2013-12-30 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: kramdown
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ! '>='
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: '0'
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: nokogiri
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :runtime
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
- !ruby/object:Gem::Dependency
|
47
|
+
name: rubyzip
|
48
|
+
requirement: !ruby/object:Gem::Requirement
|
49
|
+
none: false
|
50
|
+
requirements:
|
51
|
+
- - ! '>='
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0'
|
54
|
+
type: :runtime
|
55
|
+
prerelease: false
|
56
|
+
version_requirements: !ruby/object:Gem::Requirement
|
57
|
+
none: false
|
58
|
+
requirements:
|
59
|
+
- - ! '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
- !ruby/object:Gem::Dependency
|
63
|
+
name: rmagick
|
64
|
+
requirement: !ruby/object:Gem::Requirement
|
65
|
+
none: false
|
66
|
+
requirements:
|
67
|
+
- - ! '>='
|
68
|
+
- !ruby/object:Gem::Version
|
69
|
+
version: '0'
|
70
|
+
type: :runtime
|
71
|
+
prerelease: false
|
72
|
+
version_requirements: !ruby/object:Gem::Requirement
|
73
|
+
none: false
|
74
|
+
requirements:
|
75
|
+
- - ! '>='
|
76
|
+
- !ruby/object:Gem::Version
|
77
|
+
version: '0'
|
78
|
+
- !ruby/object:Gem::Dependency
|
79
|
+
name: ruby-filemagic
|
80
|
+
requirement: !ruby/object:Gem::Requirement
|
81
|
+
none: false
|
82
|
+
requirements:
|
83
|
+
- - ! '>='
|
84
|
+
- !ruby/object:Gem::Version
|
85
|
+
version: '0'
|
86
|
+
type: :runtime
|
87
|
+
prerelease: false
|
88
|
+
version_requirements: !ruby/object:Gem::Requirement
|
89
|
+
none: false
|
90
|
+
requirements:
|
91
|
+
- - ! '>='
|
92
|
+
- !ruby/object:Gem::Version
|
93
|
+
version: '0'
|
94
|
+
description: Converts Word docx files into html or LaTeX via the kramdown syntax.
|
95
|
+
It supports Word's most common paragraph, character and mixed styles (Title, Heading,
|
96
|
+
Strong, Quote), footnotes, line breaks, page breaks, non-breaking spaces and images
|
97
|
+
with captions. The output is in kramdown syntax (see http://kramdown.gettalong.org/)
|
98
|
+
which can be converted into beautiful html and LaTex code.
|
99
|
+
email:
|
100
|
+
- office@michaelfranzl.com
|
101
|
+
executables:
|
102
|
+
- docx_converter
|
103
|
+
extensions: []
|
104
|
+
extra_rdoc_files: []
|
105
|
+
files:
|
106
|
+
- .gitignore
|
107
|
+
- Gemfile
|
108
|
+
- README.md
|
109
|
+
- Rakefile
|
110
|
+
- bin/docx_converter
|
111
|
+
- docx_converter.gemspec
|
112
|
+
- lib/docx_converter.rb
|
113
|
+
- lib/docx_converter/parser.rb
|
114
|
+
- lib/docx_converter/postprocessor.rb
|
115
|
+
- lib/docx_converter/render.rb
|
116
|
+
- lib/docx_converter/version.rb
|
117
|
+
homepage: https://github.com/michaelfranzl/docx_converter
|
118
|
+
licenses: []
|
119
|
+
post_install_message:
|
120
|
+
rdoc_options: []
|
121
|
+
require_paths:
|
122
|
+
- lib
|
123
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
124
|
+
none: false
|
125
|
+
requirements:
|
126
|
+
- - ! '>='
|
127
|
+
- !ruby/object:Gem::Version
|
128
|
+
version: '0'
|
129
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
130
|
+
none: false
|
131
|
+
requirements:
|
132
|
+
- - ! '>='
|
133
|
+
- !ruby/object:Gem::Version
|
134
|
+
version: '0'
|
135
|
+
requirements: []
|
136
|
+
rubyforge_project: docx_converter
|
137
|
+
rubygems_version: 1.8.23
|
138
|
+
signing_key:
|
139
|
+
specification_version: 3
|
140
|
+
summary: Converts Word docx files into html or LaTeX via the kramdown syntax
|
141
|
+
test_files: []
|