html2doc 0.8.4 → 0.8.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile.lock +1 -1
- data/README.adoc +6 -3
- data/lib/html2doc/lists.rb +17 -4
- data/lib/html2doc/mime.rb +18 -12
- data/lib/html2doc/version.rb +1 -1
- data/spec/html2doc_spec.rb +11 -6
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 36da12f189ac6654c4024d328fe2ceca7c171e5b8aa834c291f2ebef568a44b6
|
4
|
+
data.tar.gz: 38f446130c868c6368752f39a6f74c632ceab4c6e3016b48f63a454f3b1a7d22
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: f7d6ba90a533613f18e7edbb1bf1287109b750e97faf2d05800b9da7c7058e5be1f8ce0765a7546369ae528d332cf60aa64ea91781b45c902f74eccec87f682b
|
7
|
+
data.tar.gz: 73623c29bfbd99f6c089dd86d2bd25ba78e6096145f17bcc5a6efd9a28f2c17ff0b9a7baed55574decfb05f068f28257dedeab8c89b610e5b2df0389a40ba714
|
data/Gemfile.lock
CHANGED
data/README.adoc
CHANGED
@@ -23,6 +23,7 @@ The gem currently does the following:
|
|
23
23
|
* Identify any footnotes in the document (defined as hyperlinks with attributes `class = "Footnote"` or `epub:type = "footnote"`), and render them as Microsoft Word footnotes.
|
24
24
|
* Resize any local images in the HTML file to fit within the maximum page size. (Word will otherwise crash on reading the document.)
|
25
25
|
* Optionally apply list styles with predefined bullet and numbering from a Word CSS to the unordered and ordered lists in the document, restarting numbering for each ordered list.
|
26
|
+
* Convert all lists to native Word HTML rendering (using paragraphs with `MsoListParagraphCxSpFirst, MsoListParagraphCxSpMiddle, MsoListParagraphCxSpLast` styles)
|
26
27
|
* Convert any internal `@id` anchors to `a@name` anchors; Word only hyperlinks to the latter.
|
27
28
|
* Generate a filelist.xml listing of all files to be bundled into the Word document.
|
28
29
|
* Assign the class `MsoNormal` to any paragraphs that do not have a class, so that they can be treated as Normal Style when editing the Word document.
|
@@ -33,7 +34,9 @@ For a representative generator of HTML that uses this gem in postprocessing, see
|
|
33
34
|
|
34
35
|
== Constraints
|
35
36
|
|
36
|
-
This generates `.doc` documents. Future versions may upgrade the output to `docx`.
|
37
|
+
This gem generates `.doc` documents. Future versions may upgrade the output to `docx`.
|
38
|
+
|
39
|
+
Because `.doc` is the format of an older version of Microsoft Word, the output of this gem do *not* support SVG graphics. (Word itself converts SVG into PNG when it saves documents as Word HTML, which is the input to this gem.)
|
37
40
|
|
38
41
|
There there are two other Microsoft Word vendors in the Ruby ecosystem.
|
39
42
|
|
@@ -99,7 +102,7 @@ Here are the steps to convert our output into native-`docx`.
|
|
99
102
|
|
100
103
|
The good news is that Word understands HTML.
|
101
104
|
|
102
|
-
The bad news is that Word's understanding of HTML is HTML 4. In order for bookmarks to work, for example, this gem has to translate `<p id="">` back down into `<p><a name="">`. Word (and this gem) will not do much with HTML 5-specific elements, and if you're generating HTML for automated generation of Word documents, keep your HTML old-fashioned.
|
105
|
+
The bad news is that Word's understanding of HTML is HTML 4. In order for bookmarks to work, for example, this gem has to translate `<p id="">` back down into `<p><a name="">`. Word (and this gem) will not do much with HTML 5-specific elements (or SVG graphics), and if you're generating HTML for automated generation of Word documents, you need to keep your HTML old-fashioned.
|
103
106
|
|
104
107
|
=== CSS
|
105
108
|
|
@@ -116,7 +119,7 @@ The good news is that the stylesheet is not identical to the stylesheet `mathml2
|
|
116
119
|
The bad news is that the stylesheet is not identical to the stylesheet `mathml2omml.xsl` that is published with Microsoft Word, so it isn't guaranteed to have identical output. If you want to make sure that your MathML import is identical to what Word currently uses, replace `mml2omml.xsl` with `mathml2omml.xsl`, and edit the gem accordingly for your local installation. On Windows, you will find the stylesheet in the same directory as the `winword.exe` executable. On Mac, right-click on the Word application, and select "Show Package Contents"; you will find the stylesheet under `Contents/Resources`.
|
117
120
|
|
118
121
|
=== Lists
|
119
|
-
Natively, Word does not use `<ol>`, `<ul>`, or `<dl>` lists in its HTML exports at all: it uses paragraphs styled with list styles. If you save a Word document as HTML in order to use its CSS for Word documents generated by HTML, those styles will still work (with the caveat that you will need to extract the `@list` style specific to ordered and unordered lists, and pass it as a `liststyles` parameter to the conversion).
|
122
|
+
Natively, Word does not use `<ol>`, `<ul>`, or `<dl>` lists in its HTML exports at all: it uses paragraphs styled with list styles. If you save a Word document as HTML in order to use its CSS for Word documents generated by HTML, those styles will still work (with the caveat that you will need to extract the `@list` style specific to ordered and unordered lists, and pass it as a `liststyles` parameter to the conversion). Word HTML understands `<ol>, <ul>, <li>`, but its rendering is fragile: in particular, any instance of `<p>` within a `<li>` is treated as a new list item (so Word HTML will not let you have multi-paragraph list items if you use native HTML.) This gem now exports lists as Word HTML prefers to see them, with `MsoListParagraphCxSpFirst, MsoListParagraphCxSpMiddle, MsoListParagraphCxSpLast` styles. You will need to include these in the CSS stylesheet you supply, in order to get the right indentation for lists.
|
120
123
|
|
121
124
|
=== Math Positioning
|
122
125
|
By default, mathematical formulas that are the only content of their paragraph are rendered as centered in Word. If you want your AsciiMath or MathML to be left-aligned or right-aligned, add `style="text-align:left"` or `style="text-align:right"` to its ancestor `div`, `p` or `td` node in HTML.
|
data/lib/html2doc/lists.rb
CHANGED
@@ -27,13 +27,26 @@ module Html2Doc
|
|
27
27
|
end
|
28
28
|
end
|
29
29
|
|
30
|
+
def self.list2para(u)
|
31
|
+
return if u.xpath("./li").empty?
|
32
|
+
u.xpath("./li").last["class"] = "MsoListParagraphCxSpLast"
|
33
|
+
u.xpath("./li").first["class"] = "MsoListParagraphCxSpFirst"
|
34
|
+
u.xpath("./li/p").each { |p| p["class"] ||= "MsoListParagraphCxSpMiddle" }
|
35
|
+
u.xpath("./li").each do |l|
|
36
|
+
l.name = "p"
|
37
|
+
l["class"] ||= "MsoListParagraphCxSpMiddle"
|
38
|
+
l&.first_element_child&.name == "p" and l.first_element_child.replace(l.first_element_child.children)
|
39
|
+
end
|
40
|
+
u.replace(u.children)
|
41
|
+
end
|
42
|
+
|
30
43
|
def self.lists(docxml, liststyles)
|
31
44
|
return if liststyles.nil?
|
32
|
-
|
45
|
+
liststyles.has_key?(:ul) and
|
33
46
|
list_add(docxml.xpath("//ul[not(ancestor::ul) and not(ancestor::ol)]"), liststyles, :ul, 1, nil)
|
34
|
-
|
35
|
-
if liststyles.has_key?(:ol)
|
47
|
+
liststyles.has_key?(:ol) and
|
36
48
|
list_add(docxml.xpath("//ol[not(ancestor::ul) and not(ancestor::ol)]"), liststyles, :ol, 1, nil)
|
37
|
-
|
49
|
+
liststyles.has_key?(:ul) and docxml.xpath("//ul").each { |u| list2para(u) }
|
50
|
+
liststyles.has_key?(:ol) and docxml.xpath("//ol").each { |u| list2para(u) }
|
38
51
|
end
|
39
52
|
end
|
data/lib/html2doc/mime.rb
CHANGED
@@ -73,13 +73,21 @@ module Html2Doc
|
|
73
73
|
|
74
74
|
IMAGE_PATH = "//*[local-name() = 'img' or local-name() = 'imagedata']".freeze
|
75
75
|
|
76
|
+
def self.mkuuid
|
77
|
+
UUIDTools::UUID.random_create.to_s
|
78
|
+
end
|
79
|
+
|
80
|
+
def self.warnsvg(src)
|
81
|
+
warn "#{src}: SVG not supported" if /\.svg$/i.match(src)
|
82
|
+
end
|
83
|
+
|
76
84
|
# only processes locally stored images
|
77
85
|
def self.image_cleanup(docxml, dir)
|
78
86
|
docxml.xpath(IMAGE_PATH).each do |i|
|
79
|
-
next if /^http/.match i["src"]
|
80
87
|
matched = /\.(?<suffix>\S+)$/.match i["src"]
|
81
|
-
|
82
|
-
|
88
|
+
warnsvg(i["src"])
|
89
|
+
next if /^http/.match i["src"]
|
90
|
+
new_full_filename = File.join(dir, "#{mkuuid}.#{matched[:suffix]}")
|
83
91
|
FileUtils.cp i["src"], new_full_filename
|
84
92
|
i["width"], i["height"] = image_resize(i, 680, 400)
|
85
93
|
i["src"] = new_full_filename
|
@@ -96,15 +104,13 @@ module Html2Doc
|
|
96
104
|
end
|
97
105
|
|
98
106
|
def self.header_image_cleanup1(a, dir, filename)
|
99
|
-
if a.size == 2
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
FileUtils.cp matched[:src], dest_filename
|
107
|
-
a[1].sub!(%r{ src=['"](?<src>[^"']+)['"]}, " src='#{new_full_filename}'")
|
107
|
+
if a.size == 2 && !(/ src="https?:/.match a[1])
|
108
|
+
m = / src=['"](?<src>[^"']+)['"]/.match a[1]
|
109
|
+
warnsvg(m[:src])
|
110
|
+
m2 = /\.(?<suffix>\S+)$/.match m[:src]
|
111
|
+
new_filename = "file:///C:/Doc/#{filename}_files/#{mkuuid}.#{m2[:suffix]}"
|
112
|
+
FileUtils.cp m[:src], File.join(dir, "#{mkuuid}.#{m2[:suffix]}")
|
113
|
+
a[1].sub!(%r{ src=['"](?<src>[^"']+)['"]}, " src='#{new_filename}'")
|
108
114
|
end
|
109
115
|
a.join
|
110
116
|
end
|
data/lib/html2doc/version.rb
CHANGED
data/spec/html2doc_spec.rb
CHANGED
@@ -383,7 +383,7 @@ RSpec.describe Html2Doc do
|
|
383
383
|
end
|
384
384
|
|
385
385
|
it "processes AsciiMath" do
|
386
|
-
Html2Doc.process(html_input("<div>{{sum_(i=1)^n i^3=((n(n+1))/2)^2}}</div>"), filename: "test", asciimathdelims: ["{{", "}}"])
|
386
|
+
Html2Doc.process(html_input("<div>{{sum_(i=1)^n i^3=((n(n+1))/2)^2)}}</div>"), filename: "test", asciimathdelims: ["{{", "}}"])
|
387
387
|
expect(guid_clean(File.read("test.doc", encoding: "utf-8"))).
|
388
388
|
to match_fuzzy(<<~OUTPUT)
|
389
389
|
#{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
|
@@ -588,6 +588,11 @@ RSpec.describe Html2Doc do
|
|
588
588
|
OUTPUT
|
589
589
|
end
|
590
590
|
|
591
|
+
it "warns about SVG" do
|
592
|
+
simple_body = '<img src="https://example.com/19160-6.svg">'
|
593
|
+
expect{ Html2Doc.process(html_input(simple_body), filename: "test") }.to output("https://example.com/19160-6.svg: SVG not supported\n").to_stderr
|
594
|
+
end
|
595
|
+
|
591
596
|
it "processes epub:type footnotes" do
|
592
597
|
simple_body = '<div>This is a very simple
|
593
598
|
document<a epub:type="footnote" href="#a1">1</a> allegedly<a epub:type="footnote" href="#a2">2</a></div>
|
@@ -651,14 +656,14 @@ RSpec.describe Html2Doc do
|
|
651
656
|
it "labels lists with list styles" do
|
652
657
|
simple_body = <<~BODY
|
653
658
|
<div><ul>
|
654
|
-
<li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
|
659
|
+
<li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li><li><p>B</p><p>B2</p></li><li>C</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
|
655
660
|
BODY
|
656
661
|
Html2Doc.process(html_input(simple_body), filename: "test", liststyles: {ul: "l1", ol: "l2"})
|
657
662
|
expect(guid_clean(File.read("test.doc", encoding: "utf-8"))).
|
658
663
|
to match_fuzzy(<<~OUTPUT)
|
659
664
|
#{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
|
660
|
-
#{word_body('<div
|
661
|
-
|
665
|
+
#{word_body('<div>
|
666
|
+
<p style="mso-list:l1 level1 lfo1;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpFirst">A</p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpMiddle">B<p class="MsoListParagraphCxSpMiddle">B2</p></p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpLast">C</p></p></p></p></div></p></div>',
|
662
667
|
'<div style="mso-element:footnote-list"/>')}
|
663
668
|
#{WORD_FTR1}
|
664
669
|
OUTPUT
|
@@ -676,8 +681,8 @@ RSpec.describe Html2Doc do
|
|
676
681
|
to match_fuzzy(<<~OUTPUT)
|
677
682
|
#{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
|
678
683
|
#{word_body('<div>
|
679
|
-
<
|
680
|
-
<
|
684
|
+
<p style="mso-list:l2 level1 lfo1;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpFirst">A</p></p></p></p></div></p>
|
685
|
+
<p style="mso-list:l2 level1 lfo2;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo2;" class="MsoListParagraphCxSpFirst">A</p></p></p></p></div></p></div>',
|
681
686
|
'<div style="mso-element:footnote-list"/>')}
|
682
687
|
#{WORD_FTR1}
|
683
688
|
OUTPUT
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: html2doc
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.8.
|
4
|
+
version: 0.8.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ribose Inc.
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-10-
|
11
|
+
date: 2018-10-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: htmlentities
|