html2doc 0.8.4 → 0.8.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ebe0e5db8814308a2e3d8ade848d2a7a0323d2caf2e87f471d333edcf271ab8b
4
- data.tar.gz: 6a0b6944d46e3a05aece3a52f445d00425ac78fa6d4e66a08adc9697618a7df4
3
+ metadata.gz: 36da12f189ac6654c4024d328fe2ceca7c171e5b8aa834c291f2ebef568a44b6
4
+ data.tar.gz: 38f446130c868c6368752f39a6f74c632ceab4c6e3016b48f63a454f3b1a7d22
5
5
  SHA512:
6
- metadata.gz: 78b8a0b62a7420145b032a40004dd97cb05851d700c182ca9fc6bc94884c1ae4f5eb917fae3e0084ef189db3d41d1338afe6a6dc6415e8445848e72db7642b85
7
- data.tar.gz: 5245dc6ae3dbb947f47b1fb21f775b6021385bba13449014fd5dde1d55cd572e52abc2019b7430ad42ead2f55c5ccb165c495d359ed6d9bb8010875b14988322
6
+ metadata.gz: f7d6ba90a533613f18e7edbb1bf1287109b750e97faf2d05800b9da7c7058e5be1f8ce0765a7546369ae528d332cf60aa64ea91781b45c902f74eccec87f682b
7
+ data.tar.gz: 73623c29bfbd99f6c089dd86d2bd25ba78e6096145f17bcc5a6efd9a28f2c17ff0b9a7baed55574decfb05f068f28257dedeab8c89b610e5b2df0389a40ba714
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- html2doc (0.8.4)
4
+ html2doc (0.8.5)
5
5
  asciimath
6
6
  htmlentities (~> 4.3.4)
7
7
  image_size
@@ -23,6 +23,7 @@ The gem currently does the following:
23
23
  * Identify any footnotes in the document (defined as hyperlinks with attributes `class = "Footnote"` or `epub:type = "footnote"`), and render them as Microsoft Word footnotes.
24
24
  * Resize any local images in the HTML file to fit within the maximum page size. (Word will otherwise crash on reading the document.)
25
25
  * Optionally apply list styles with predefined bullet and numbering from a Word CSS to the unordered and ordered lists in the document, restarting numbering for each ordered list.
26
+ * Convert all lists to native Word HTML rendering (using paragraphs with `MsoListParagraphCxSpFirst, MsoListParagraphCxSpMiddle, MsoListParagraphCxSpLast` styles)
26
27
  * Convert any internal `@id` anchors to `a@name` anchors; Word only hyperlinks to the latter.
27
28
  * Generate a filelist.xml listing of all files to be bundled into the Word document.
28
29
  * Assign the class `MsoNormal` to any paragraphs that do not have a class, so that they can be treated as Normal Style when editing the Word document.
@@ -33,7 +34,9 @@ For a representative generator of HTML that uses this gem in postprocessing, see
33
34
 
34
35
  == Constraints
35
36
 
36
- This generates `.doc` documents. Future versions may upgrade the output to `docx`.
37
+ This gem generates `.doc` documents. Future versions may upgrade the output to `docx`.
38
+
39
+ Because `.doc` is the format of an older version of Microsoft Word, the output of this gem do *not* support SVG graphics. (Word itself converts SVG into PNG when it saves documents as Word HTML, which is the input to this gem.)
37
40
 
38
41
  There there are two other Microsoft Word vendors in the Ruby ecosystem.
39
42
 
@@ -99,7 +102,7 @@ Here are the steps to convert our output into native-`docx`.
99
102
 
100
103
  The good news is that Word understands HTML.
101
104
 
102
- The bad news is that Word's understanding of HTML is HTML 4. In order for bookmarks to work, for example, this gem has to translate `<p id="">` back down into `<p><a name="">`. Word (and this gem) will not do much with HTML 5-specific elements, and if you're generating HTML for automated generation of Word documents, keep your HTML old-fashioned.
105
+ The bad news is that Word's understanding of HTML is HTML 4. In order for bookmarks to work, for example, this gem has to translate `<p id="">` back down into `<p><a name="">`. Word (and this gem) will not do much with HTML 5-specific elements (or SVG graphics), and if you're generating HTML for automated generation of Word documents, you need to keep your HTML old-fashioned.
103
106
 
104
107
  === CSS
105
108
 
@@ -116,7 +119,7 @@ The good news is that the stylesheet is not identical to the stylesheet `mathml2
116
119
  The bad news is that the stylesheet is not identical to the stylesheet `mathml2omml.xsl` that is published with Microsoft Word, so it isn't guaranteed to have identical output. If you want to make sure that your MathML import is identical to what Word currently uses, replace `mml2omml.xsl` with `mathml2omml.xsl`, and edit the gem accordingly for your local installation. On Windows, you will find the stylesheet in the same directory as the `winword.exe` executable. On Mac, right-click on the Word application, and select "Show Package Contents"; you will find the stylesheet under `Contents/Resources`.
117
120
 
118
121
  === Lists
119
- Natively, Word does not use `<ol>`, `<ul>`, or `<dl>` lists in its HTML exports at all: it uses paragraphs styled with list styles. If you save a Word document as HTML in order to use its CSS for Word documents generated by HTML, those styles will still work (with the caveat that you will need to extract the `@list` style specific to ordered and unordered lists, and pass it as a `liststyles` parameter to the conversion). *However*, Word applies a default indentation to all instances of `<ol>`, `<ul>` and `<dl>`, which the CSS stylesheet of a Word HTML will not have accounted for (because the Word HTML does not use lists at all.) If you are going to reuse that CSS for generating new documents using lists, you will need to add the rule `margin-left:0pt` to `ul`, `ol`, `dl` in the CSS stylesheet you supply, so that the margins in the Word-exported CSS remain correct.
122
+ Natively, Word does not use `<ol>`, `<ul>`, or `<dl>` lists in its HTML exports at all: it uses paragraphs styled with list styles. If you save a Word document as HTML in order to use its CSS for Word documents generated by HTML, those styles will still work (with the caveat that you will need to extract the `@list` style specific to ordered and unordered lists, and pass it as a `liststyles` parameter to the conversion). Word HTML understands `<ol>, <ul>, <li>`, but its rendering is fragile: in particular, any instance of `<p>` within a `<li>` is treated as a new list item (so Word HTML will not let you have multi-paragraph list items if you use native HTML.) This gem now exports lists as Word HTML prefers to see them, with `MsoListParagraphCxSpFirst, MsoListParagraphCxSpMiddle, MsoListParagraphCxSpLast` styles. You will need to include these in the CSS stylesheet you supply, in order to get the right indentation for lists.
120
123
 
121
124
  === Math Positioning
122
125
  By default, mathematical formulas that are the only content of their paragraph are rendered as centered in Word. If you want your AsciiMath or MathML to be left-aligned or right-aligned, add `style="text-align:left"` or `style="text-align:right"` to its ancestor `div`, `p` or `td` node in HTML.
@@ -27,13 +27,26 @@ module Html2Doc
27
27
  end
28
28
  end
29
29
 
30
+ def self.list2para(u)
31
+ return if u.xpath("./li").empty?
32
+ u.xpath("./li").last["class"] = "MsoListParagraphCxSpLast"
33
+ u.xpath("./li").first["class"] = "MsoListParagraphCxSpFirst"
34
+ u.xpath("./li/p").each { |p| p["class"] ||= "MsoListParagraphCxSpMiddle" }
35
+ u.xpath("./li").each do |l|
36
+ l.name = "p"
37
+ l["class"] ||= "MsoListParagraphCxSpMiddle"
38
+ l&.first_element_child&.name == "p" and l.first_element_child.replace(l.first_element_child.children)
39
+ end
40
+ u.replace(u.children)
41
+ end
42
+
30
43
  def self.lists(docxml, liststyles)
31
44
  return if liststyles.nil?
32
- if liststyles.has_key?(:ul)
45
+ liststyles.has_key?(:ul) and
33
46
  list_add(docxml.xpath("//ul[not(ancestor::ul) and not(ancestor::ol)]"), liststyles, :ul, 1, nil)
34
- end
35
- if liststyles.has_key?(:ol)
47
+ liststyles.has_key?(:ol) and
36
48
  list_add(docxml.xpath("//ol[not(ancestor::ul) and not(ancestor::ol)]"), liststyles, :ol, 1, nil)
37
- end
49
+ liststyles.has_key?(:ul) and docxml.xpath("//ul").each { |u| list2para(u) }
50
+ liststyles.has_key?(:ol) and docxml.xpath("//ol").each { |u| list2para(u) }
38
51
  end
39
52
  end
@@ -73,13 +73,21 @@ module Html2Doc
73
73
 
74
74
  IMAGE_PATH = "//*[local-name() = 'img' or local-name() = 'imagedata']".freeze
75
75
 
76
+ def self.mkuuid
77
+ UUIDTools::UUID.random_create.to_s
78
+ end
79
+
80
+ def self.warnsvg(src)
81
+ warn "#{src}: SVG not supported" if /\.svg$/i.match(src)
82
+ end
83
+
76
84
  # only processes locally stored images
77
85
  def self.image_cleanup(docxml, dir)
78
86
  docxml.xpath(IMAGE_PATH).each do |i|
79
- next if /^http/.match i["src"]
80
87
  matched = /\.(?<suffix>\S+)$/.match i["src"]
81
- uuid = UUIDTools::UUID.random_create.to_s
82
- new_full_filename = File.join(dir, "#{uuid}.#{matched[:suffix]}")
88
+ warnsvg(i["src"])
89
+ next if /^http/.match i["src"]
90
+ new_full_filename = File.join(dir, "#{mkuuid}.#{matched[:suffix]}")
83
91
  FileUtils.cp i["src"], new_full_filename
84
92
  i["width"], i["height"] = image_resize(i, 680, 400)
85
93
  i["src"] = new_full_filename
@@ -96,15 +104,13 @@ module Html2Doc
96
104
  end
97
105
 
98
106
  def self.header_image_cleanup1(a, dir, filename)
99
- if a.size == 2
100
- matched = / src=['"](?<src>[^"']+)['"]/.match a[1]
101
- matched2 = /\.(?<suffix>\S+)$/.match matched[:src]
102
- uuid = UUIDTools::UUID.random_create.to_s
103
- new_full_filename = "file:///C:/Doc/#{filename}_files/#{uuid}.#{matched2[:suffix]}"
104
- dest_filename = File.join(dir, "#{uuid}.#{matched2[:suffix]}")
105
- #system "cp #{matched[:src]} #{dest_filename}"
106
- FileUtils.cp matched[:src], dest_filename
107
- a[1].sub!(%r{ src=['"](?<src>[^"']+)['"]}, " src='#{new_full_filename}'")
107
+ if a.size == 2 && !(/ src="https?:/.match a[1])
108
+ m = / src=['"](?<src>[^"']+)['"]/.match a[1]
109
+ warnsvg(m[:src])
110
+ m2 = /\.(?<suffix>\S+)$/.match m[:src]
111
+ new_filename = "file:///C:/Doc/#{filename}_files/#{mkuuid}.#{m2[:suffix]}"
112
+ FileUtils.cp m[:src], File.join(dir, "#{mkuuid}.#{m2[:suffix]}")
113
+ a[1].sub!(%r{ src=['"](?<src>[^"']+)['"]}, " src='#{new_filename}'")
108
114
  end
109
115
  a.join
110
116
  end
@@ -1,3 +1,3 @@
1
1
  module Html2Doc
2
- VERSION = "0.8.4".freeze
2
+ VERSION = "0.8.5".freeze
3
3
  end
@@ -383,7 +383,7 @@ RSpec.describe Html2Doc do
383
383
  end
384
384
 
385
385
  it "processes AsciiMath" do
386
- Html2Doc.process(html_input("<div>{{sum_(i=1)^n i^3=((n(n+1))/2)^2}}</div>"), filename: "test", asciimathdelims: ["{{", "}}"])
386
+ Html2Doc.process(html_input("<div>{{sum_(i=1)^n i^3=((n(n+1))/2)^2)}}</div>"), filename: "test", asciimathdelims: ["{{", "}}"])
387
387
  expect(guid_clean(File.read("test.doc", encoding: "utf-8"))).
388
388
  to match_fuzzy(<<~OUTPUT)
389
389
  #{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
@@ -588,6 +588,11 @@ RSpec.describe Html2Doc do
588
588
  OUTPUT
589
589
  end
590
590
 
591
+ it "warns about SVG" do
592
+ simple_body = '<img src="https://example.com/19160-6.svg">'
593
+ expect{ Html2Doc.process(html_input(simple_body), filename: "test") }.to output("https://example.com/19160-6.svg: SVG not supported\n").to_stderr
594
+ end
595
+
591
596
  it "processes epub:type footnotes" do
592
597
  simple_body = '<div>This is a very simple
593
598
  document<a epub:type="footnote" href="#a1">1</a> allegedly<a epub:type="footnote" href="#a2">2</a></div>
@@ -651,14 +656,14 @@ RSpec.describe Html2Doc do
651
656
  it "labels lists with list styles" do
652
657
  simple_body = <<~BODY
653
658
  <div><ul>
654
- <li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
659
+ <li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li><li><p>B</p><p>B2</p></li><li>C</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
655
660
  BODY
656
661
  Html2Doc.process(html_input(simple_body), filename: "test", liststyles: {ul: "l1", ol: "l2"})
657
662
  expect(guid_clean(File.read("test.doc", encoding: "utf-8"))).
658
663
  to match_fuzzy(<<~OUTPUT)
659
664
  #{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
660
- #{word_body('<div><ul>
661
- <li style="mso-list:l1 level1 lfo1;" class="MsoNormal"><div><p class="MsoNormal"><ol><li style="mso-list:l2 level2 lfo1;" class="MsoNormal"><ul><li style="mso-list:l1 level3 lfo1;" class="MsoNormal"><p class="MsoNormal"><ol><li style="mso-list:l2 level4 lfo1;" class="MsoNormal"><ol><li style="mso-list:l2 level5 lfo1;" class="MsoNormal">A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>',
665
+ #{word_body('<div>
666
+ <p style="mso-list:l1 level1 lfo1;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpFirst">A</p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpMiddle">B<p class="MsoListParagraphCxSpMiddle">B2</p></p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpLast">C</p></p></p></p></div></p></div>',
662
667
  '<div style="mso-element:footnote-list"/>')}
663
668
  #{WORD_FTR1}
664
669
  OUTPUT
@@ -676,8 +681,8 @@ RSpec.describe Html2Doc do
676
681
  to match_fuzzy(<<~OUTPUT)
677
682
  #{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
678
683
  #{word_body('<div>
679
- <ol><li style="mso-list:l2 level1 lfo1;" class="MsoNormal"><div><p class="MsoNormal"><ol><li style="mso-list:l2 level2 lfo1;" class="MsoNormal"><ul><li style="mso-list:l1 level3 lfo1;" class="MsoNormal"><p class="MsoNormal"><ol><li style="mso-list:l2 level4 lfo1;" class="MsoNormal"><ol><li style="mso-list:l2 level5 lfo1;" class="MsoNormal">A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ol>
680
- <ol><li style="mso-list:l2 level1 lfo2;" class="MsoNormal"><div><p class="MsoNormal"><ol><li style="mso-list:l2 level2 lfo2;" class="MsoNormal"><ul><li style="mso-list:l1 level3 lfo2;" class="MsoNormal"><p class="MsoNormal"><ol><li style="mso-list:l2 level4 lfo2;" class="MsoNormal"><ol><li style="mso-list:l2 level5 lfo2;" class="MsoNormal">A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ol></div>',
684
+ <p style="mso-list:l2 level1 lfo1;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpFirst">A</p></p></p></p></div></p>
685
+ <p style="mso-list:l2 level1 lfo2;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo2;" class="MsoListParagraphCxSpFirst">A</p></p></p></p></div></p></div>',
681
686
  '<div style="mso-element:footnote-list"/>')}
682
687
  #{WORD_FTR1}
683
688
  OUTPUT
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: html2doc
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.8.4
4
+ version: 0.8.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ribose Inc.
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2018-10-05 00:00:00.000000000 Z
11
+ date: 2018-10-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: htmlentities