html2doc 0.9.0 → 0.9.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.adoc +5 -8
- data/lib/html2doc/lists.rb +45 -10
- data/lib/html2doc/math.rb +18 -14
- data/lib/html2doc/version.rb +1 -1
- data/spec/html2doc_spec.rb +24 -0
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f1c601730137c578b807990f177d8ded7b35c644d9239b57f22207d61d0a080d
|
4
|
+
data.tar.gz: 93f31b6831c3b9a32eb26165e0d702e7d0a92335fffc8660e3fd8215fbfc1d85
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fcbc14ccff1c8ac0e11796def1bc0b15a64425ed09a5a1960ec5a26b1b486f7d7cf924ef19906c654729d6f4040a903006c1028189b6bbc3c76af78e0e612f11
|
7
|
+
data.tar.gz: 07f2968ad98ee9607c5b2c3b07c978c763cd276c83415399501f09a59a649b6a30a05b2bcbbeaa99e7dd50a148a6997c0f6d2f990406aefd40a5067a2bdb7fa1
|
data/README.adoc
CHANGED
@@ -3,14 +3,11 @@
|
|
3
3
|
https://github.com/metanorma/html2doc/workflows/main/badge.svg
|
4
4
|
|
5
5
|
image:https://img.shields.io/gem/v/html2doc.svg["Gem Version", link="https://rubygems.org/gems/html2doc"]
|
6
|
-
image:https://
|
7
|
-
image:https://
|
8
|
-
image:https://github.com/metanorma/html2doc/workflows/windows/badge.svg["Build Status", link="https://github.com/metanorma/html2doc/actions?workflow=windows"]
|
6
|
+
image:https://travis-ci.com/metanorma/html2doc.svg["Build Status", link="https://travis-ci.com/metanorma/html2doc"]
|
7
|
+
image:https://ci.appveyor.com/api/projects/status/aspj42o70q3dnkf1?svg=true["Appveyor Build Status", link="https://ci.appveyor.com/project/metanorma/html2doc"]
|
9
8
|
image:https://codeclimate.com/github/metanorma/html2doc/badges/gpa.svg["Code Climate", link="https://codeclimate.com/github/metanorma/html2doc"]
|
10
|
-
|
11
|
-
|
12
|
-
image:https://ci.appveyor.com/api/projects/status/reqae7y99cfd0yod?svg=true["Appveyor Build Status", link="https://ci.appveyor.com/project/ribose/html2doc"]
|
13
|
-
////
|
9
|
+
image:https://img.shields.io/github/issues-pr-raw/metanorma/html2doc.svg["Pull Requests", link="https://github.com/metanorma/html2doc/pulls"]
|
10
|
+
image:https://img.shields.io/github/commits-since/metanorma/html2doc/latest.svg["Commits since latest",link="https://github.com/metanorma/html2doc/releases"]
|
14
11
|
|
15
12
|
== Purpose
|
16
13
|
|
@@ -68,7 +65,7 @@ stylesheet:: is the full path filename of the CSS stylesheet for Microsoft Word-
|
|
68
65
|
header_filename:: is the filename of the HTML document containing header and footer for the document, as well as footnote/endnote separators; if there is none, use nil. To generate your own such document, save a Word document with headers/footers and/or footnote/endnote separators as an HTML document; the `header.html` will be in the `{filename}.fld` folder generated along with the HTML. A sample file is available at https://github.com/metanorma/metanorma-iso/blob/master/lib/asciidoctor/iso/word/header.html
|
69
66
|
dir:: is the folder that any ancillary files (images, headers, filelist) are to be saved to. If not provided, it will be created as `{filename}_files`. Anything in the directory will be attached to the Word document; so this folder should only contain the images that accompany the document. (If the images are elsewhere on the local drive, the gem will move them into the folder. External URL images are left alone, and are not downloaded.)
|
70
67
|
asciimathdelims:: are the AsciiMath delimiters used in the text (an array of an opening and a closing delimiter). If none are provided, no AsciiMath conversion is attempted.
|
71
|
-
liststyles:: a hash of list style labels in Word CSS, which are used to define the behaviour of list item labels (e.g. _i)_ vs _i._). The gem recognises the hash keys `ul`, `ol`. So if the appearance of an ordered list's item labels in the supplied stylesheet is governed by style `@list l1` (e.g. `@list l1:level1 {mso-level-text:"%1\)";}` appears in the stylesheet), call the method with `liststyles:{ol: "l1"}`.
|
68
|
+
liststyles:: a hash of list style labels in Word CSS, which are used to define the behaviour of list item labels (e.g. _i)_ vs _i._). The gem recognises the hash keys `ul`, `ol`. So if the appearance of an ordered list's item labels in the supplied stylesheet is governed by style `@list l1` (e.g. `@list l1:level1 {mso-level-text:"%1\)";}` appears in the stylesheet), call the method with `liststyles:{ol: "l1"}`. The lists that the `ul` and `ol` list styles are applied to are assumed not to have any CSS class. If there any additional hash keys, they are assumed to be classes applied to the topmost ordered or unordered list; e.g. `liststyles:{steps: "l5"}` means that any list with class `steps` at the topmost level has the list style `l5` recursively applied to it. Any top-level lists without a class named in liststyles will be treated like lists with no CSS class.
|
72
69
|
|
73
70
|
Note that the local CSS stylesheet file contains a variable `FILENAME` for the location of footnote/endnote separators and headers/footers, which are provided in the header HTML file. The gem replaces `FILENAME` with the file name that the document will be saved as. If you supply your own stylesheet and also wish to use separators or headers/footers, you will likewise need to replace the document name mentioned in your stylesheet with a `FILENAME` string.
|
74
71
|
|
data/lib/html2doc/lists.rb
CHANGED
@@ -15,13 +15,23 @@ module Html2Doc
|
|
15
15
|
li["style"] += "mso-list:#{liststyle} level#{level} lfo#{listnumber};"
|
16
16
|
end
|
17
17
|
|
18
|
-
def self.list_add(xpath, liststyles, listtype, level
|
18
|
+
def self.list_add(xpath, liststyles, listtype, level)
|
19
19
|
xpath.each_with_index do |list, i|
|
20
|
-
listnumber
|
20
|
+
@listnumber += 1 if level == 1
|
21
|
+
list["seen"] = true if level == 1
|
21
22
|
(list.xpath(".//li") - list.xpath(".//ol//li | .//ul//li")).each do |li|
|
22
|
-
style_list(li, level, liststyles[listtype], listnumber)
|
23
|
-
|
24
|
-
|
23
|
+
style_list(li, level, liststyles[listtype], @listnumber)
|
24
|
+
if [:ul, :ol].include? listtype
|
25
|
+
list_add(li.xpath(".//ul") - li.xpath(".//ul//ul | .//ol//ul"),
|
26
|
+
liststyles, :ul, level + 1)
|
27
|
+
list_add(li.xpath(".//ol") - li.xpath(".//ul//ol | .//ol//ol"),
|
28
|
+
liststyles, :ol, level + 1)
|
29
|
+
else
|
30
|
+
list_add(li.xpath(".//ul") - li.xpath(".//ul//ul | .//ol//ul"),
|
31
|
+
liststyles, listtype, level + 1)
|
32
|
+
list_add(li.xpath(".//ol") - li.xpath(".//ul//ol | .//ol//ol"),
|
33
|
+
liststyles, listtype, level + 1)
|
34
|
+
end
|
25
35
|
end
|
26
36
|
end
|
27
37
|
end
|
@@ -34,17 +44,42 @@ module Html2Doc
|
|
34
44
|
u.xpath("./li").each do |l|
|
35
45
|
l.name = "p"
|
36
46
|
l["class"] ||= "MsoListParagraphCxSpMiddle"
|
37
|
-
l&.first_element_child&.name == "p" and
|
47
|
+
l&.first_element_child&.name == "p" and
|
48
|
+
l.first_element_child.replace(l.first_element_child.children)
|
38
49
|
end
|
39
50
|
u.replace(u.children)
|
40
51
|
end
|
41
52
|
|
53
|
+
TOPLIST = "[not(ancestor::ul) and not(ancestor::ol)]".freeze
|
54
|
+
|
55
|
+
def self.lists1(docxml, liststyles, k)
|
56
|
+
case k
|
57
|
+
when :ul then list_add(docxml.xpath("//ul[not(@class)]#{TOPLIST}"),
|
58
|
+
liststyles, :ul, 1)
|
59
|
+
when :ol then list_add(docxml.xpath("//ol[not(@class)]#{TOPLIST}"),
|
60
|
+
liststyles, :ol, 1)
|
61
|
+
else
|
62
|
+
list_add(docxml.xpath("//ol[@class = '#{k.to_s}']#{TOPLIST} | "\
|
63
|
+
"//ul[@class = '#{k.to_s}']#{TOPLIST}"),
|
64
|
+
liststyles, k, 1)
|
65
|
+
end
|
66
|
+
end
|
67
|
+
|
68
|
+
def self.lists_unstyled(docxml, liststyles)
|
69
|
+
list_add(docxml.xpath("//ul#{TOPLIST}[not(@seen)]"),
|
70
|
+
liststyles, :ul, 1) if liststyles.has_key?(:ul)
|
71
|
+
list_add(docxml.xpath("//ol#{TOPLIST}[not(@seen)]"),
|
72
|
+
liststyles, :ul, 1) if liststyles.has_key?(:ol)
|
73
|
+
docxml.xpath("//ul[@seen] | //ol[@seen]").each do |l|
|
74
|
+
l.delete("seen")
|
75
|
+
end
|
76
|
+
end
|
77
|
+
|
42
78
|
def self.lists(docxml, liststyles)
|
43
79
|
return if liststyles.nil?
|
44
|
-
|
45
|
-
|
46
|
-
liststyles
|
47
|
-
list_add(docxml.xpath("//ol[not(ancestor::ul) and not(ancestor::ol)]"), liststyles, :ol, 1, nil)
|
80
|
+
@listnumber = 0
|
81
|
+
liststyles.each_key { |k| lists1(docxml, liststyles, k) }
|
82
|
+
lists_unstyled(docxml, liststyles)
|
48
83
|
liststyles.has_key?(:ul) and docxml.xpath("//ul").each { |u| list2para(u) }
|
49
84
|
liststyles.has_key?(:ol) and docxml.xpath("//ol").each { |u| list2para(u) }
|
50
85
|
end
|
data/lib/html2doc/math.rb
CHANGED
@@ -4,7 +4,9 @@ require "htmlentities"
|
|
4
4
|
require "nokogiri"
|
5
5
|
|
6
6
|
module Html2Doc
|
7
|
-
@xsltemplate =
|
7
|
+
@xsltemplate =
|
8
|
+
Nokogiri::XSLT(File.read(File.join(File.dirname(__FILE__), "mml2omml.xsl"),
|
9
|
+
encoding: "utf-8"))
|
8
10
|
|
9
11
|
def self.asciimath_to_mathml1(x)
|
10
12
|
AsciiMath.parse(HTMLEntities.new.decode(x)).to_mathml.
|
@@ -15,7 +17,8 @@ module Html2Doc
|
|
15
17
|
return doc if delims.nil? || delims.size < 2
|
16
18
|
m = doc.split(/(#{Regexp.escape(delims[0])}|#{Regexp.escape(delims[1])})/)
|
17
19
|
m.each_slice(4).map.with_index do |(*a), i|
|
18
|
-
|
20
|
+
i % 500 == 0 && m.size > 1000 && i > 0 and
|
21
|
+
warn "MathML #{i} of #{(m.size / 4).floor}"
|
19
22
|
a[2].nil? || a[2] = asciimath_to_mathml1(a[2])
|
20
23
|
a.size > 1 ? a[0] + a[2] : a[0]
|
21
24
|
end.join
|
@@ -23,10 +26,10 @@ module Html2Doc
|
|
23
26
|
|
24
27
|
# random fixes to MathML input that OOXML needs to render properly
|
25
28
|
def self.ooxml_cleanup(m, docnamespaces)
|
26
|
-
m.xpath(
|
27
|
-
docnamespaces).each do |x|
|
28
|
-
|
29
|
-
|
29
|
+
m.xpath(%w(msup msub msubsup munder mover munderover).
|
30
|
+
map { |m| ".//xmlns:#{m}" }.join(" | "), docnamespaces).each do |x|
|
31
|
+
next unless x.next_element && x.next_element != "mrow"
|
32
|
+
x.next_element.wrap("<mrow/>")
|
30
33
|
end
|
31
34
|
m.add_namespace(nil, "http://www.w3.org/1998/Math/MathML")
|
32
35
|
m
|
@@ -36,22 +39,22 @@ module Html2Doc
|
|
36
39
|
docnamespaces = docxml.collect_namespaces
|
37
40
|
m = docxml.xpath("//*[local-name() = 'math']")
|
38
41
|
m.each_with_index do |x, i|
|
39
|
-
|
42
|
+
i % 100 == 0 && m.size > 500 && i > 0 and
|
43
|
+
warn "Math OOXML #{i} of #{m.size}"
|
40
44
|
element = ooxml_cleanup(x, docnamespaces)
|
41
|
-
|
42
45
|
doc = Nokogiri::XML::Document::new()
|
43
46
|
doc.root = element
|
44
|
-
|
45
|
-
ooxml = @xsltemplate.transform(doc).to_s.
|
47
|
+
ooxml = (esc_space(@xsltemplate.transform(doc))).to_s.
|
46
48
|
gsub(/<\?[^>]+>\s*/, "").
|
47
49
|
gsub(/ xmlns(:[^=]+)?="[^"]+"/, "").
|
48
50
|
gsub(%r{<(/)?([a-z])}, "<\\1m:\\2")
|
49
|
-
ooxml = uncenter(
|
51
|
+
ooxml = uncenter(x, ooxml)
|
50
52
|
x.swap(ooxml)
|
51
53
|
end
|
52
54
|
end
|
53
55
|
|
54
|
-
# escape space as 2; we are removing any spaces generated by
|
56
|
+
# escape space as 2; we are removing any spaces generated by
|
57
|
+
# XML indentation
|
55
58
|
def self.esc_space(xml)
|
56
59
|
xml.traverse do |n|
|
57
60
|
next unless n.text?
|
@@ -64,8 +67,9 @@ module Html2Doc
|
|
64
67
|
# left/right if parent is so tagged
|
65
68
|
def self.uncenter(m, ooxml)
|
66
69
|
if m.next == nil && m.previous == nil
|
67
|
-
alignnode = m.at(".//ancestor::*[@style][local-name() = 'p' or
|
68
|
-
"'div' or local-name() = 'td']/@style")
|
70
|
+
alignnode = m.at(".//ancestor::*[@style][local-name() = 'p' or "\
|
71
|
+
"local-name() = 'div' or local-name() = 'td']/@style")
|
72
|
+
return ooxml unless alignnode
|
69
73
|
if alignnode.text.include? ("text-align:left")
|
70
74
|
ooxml = "<m:oMathPara><m:oMathParaPr><m:jc "\
|
71
75
|
"m:val='left'/></m:oMathParaPr>#{ooxml}</m:oMathPara>"
|
data/lib/html2doc/version.rb
CHANGED
data/spec/html2doc_spec.rb
CHANGED
@@ -641,6 +641,30 @@ RSpec.describe Html2Doc do
|
|
641
641
|
OUTPUT
|
642
642
|
end
|
643
643
|
|
644
|
+
it "labels lists with multiple list styles" do
|
645
|
+
simple_body = <<~BODY
|
646
|
+
<div><ul class="steps">
|
647
|
+
<li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li><li><p>B</p><p>B2</p></li><li>C</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
|
648
|
+
<div><ul>
|
649
|
+
<li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li><li><p>B</p><p>B2</p></li><li>C</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
|
650
|
+
<div><ul class="other">
|
651
|
+
<li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li><li><p>B</p><p>B2</p></li><li>C</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
|
652
|
+
BODY
|
653
|
+
Html2Doc.process(html_input(simple_body), filename: "test", liststyles: {ul: "l1", ol: "l2", steps: "l3"})
|
654
|
+
expect(guid_clean(File.read("test.doc", encoding: "utf-8"))).
|
655
|
+
to match_fuzzy(<<~OUTPUT)
|
656
|
+
#{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
|
657
|
+
#{word_body('<div>
|
658
|
+
<p style="mso-list:l3 level1 lfo2;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l3 level2 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l3 level4 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l3 level5 lfo2;" class="MsoListParagraphCxSpFirst">A</p><p style="mso-list:l3 level5 lfo2;" class="MsoListParagraphCxSpMiddle">B<p class="MsoListParagraphCxSpMiddle">B2</p></p><p style="mso-list:l3 level5 lfo2;" class="MsoListParagraphCxSpLast">C</p></p></p></p></div></p></div>
|
659
|
+
<div>
|
660
|
+
<p style="mso-list:l1 level1 lfo1;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpFirst">A</p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpMiddle">B<p class="MsoListParagraphCxSpMiddle">B2</p></p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpLast">C</p></p></p></p></div></p></div>
|
661
|
+
<div>
|
662
|
+
<p style="mso-list:l1 level1 lfo3;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo3;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo3;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo3;" class="MsoListParagraphCxSpFirst">A</p><p style="mso-list:l2 level5 lfo3;" class="MsoListParagraphCxSpMiddle">B<p class="MsoListParagraphCxSpMiddle">B2</p></p><p style="mso-list:l2 level5 lfo3;" class="MsoListParagraphCxSpLast">C</p></p></p></p></div></p></div>',
|
663
|
+
'<div style="mso-element:footnote-list"/>')}
|
664
|
+
#{WORD_FTR1}
|
665
|
+
OUTPUT
|
666
|
+
end
|
667
|
+
|
644
668
|
it "replaces id attributes with explicit a@name bookmarks" do
|
645
669
|
simple_body = <<~BODY
|
646
670
|
<div>
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: html2doc
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.9.
|
4
|
+
version: 0.9.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ribose Inc.
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2019-
|
11
|
+
date: 2019-11-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: htmlentities
|