RubyGems - html2doc - Versions diffs - 0.8.4 → 0.8.5 - Mend

html2doc 0.8.4 → 0.8.5

Files changed (8) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ebe0e5db8814308a2e3d8ade848d2a7a0323d2caf2e87f471d333edcf271ab8b
-  data.tar.gz: 6a0b6944d46e3a05aece3a52f445d00425ac78fa6d4e66a08adc9697618a7df4
+  metadata.gz: 36da12f189ac6654c4024d328fe2ceca7c171e5b8aa834c291f2ebef568a44b6
+  data.tar.gz: 38f446130c868c6368752f39a6f74c632ceab4c6e3016b48f63a454f3b1a7d22
 SHA512:
-  metadata.gz: 78b8a0b62a7420145b032a40004dd97cb05851d700c182ca9fc6bc94884c1ae4f5eb917fae3e0084ef189db3d41d1338afe6a6dc6415e8445848e72db7642b85
-  data.tar.gz: 5245dc6ae3dbb947f47b1fb21f775b6021385bba13449014fd5dde1d55cd572e52abc2019b7430ad42ead2f55c5ccb165c495d359ed6d9bb8010875b14988322
+  metadata.gz: f7d6ba90a533613f18e7edbb1bf1287109b750e97faf2d05800b9da7c7058e5be1f8ce0765a7546369ae528d332cf60aa64ea91781b45c902f74eccec87f682b
+  data.tar.gz: 73623c29bfbd99f6c089dd86d2bd25ba78e6096145f17bcc5a6efd9a28f2c17ff0b9a7baed55574decfb05f068f28257dedeab8c89b610e5b2df0389a40ba714

data/Gemfile.lock CHANGED

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    html2doc (0.8.4)
+    html2doc (0.8.5)
       asciimath
       htmlentities (~> 4.3.4)
       image_size

data/README.adoc CHANGED

@@ -23,6 +23,7 @@ The gem currently does the following:
 * Identify any footnotes in the document (defined as hyperlinks with attributes `class = "Footnote"` or `epub:type = "footnote"`), and render them as Microsoft Word footnotes.
 * Resize any local images in the HTML file to fit within the maximum page size. (Word will otherwise crash on reading the document.)
 * Optionally apply list styles with predefined bullet and numbering from a Word CSS to the unordered and ordered lists in the document, restarting numbering for each ordered list.
+* Convert all lists to native Word HTML rendering (using paragraphs with `MsoListParagraphCxSpFirst, MsoListParagraphCxSpMiddle, MsoListParagraphCxSpLast` styles)
 * Convert any internal `@id` anchors to `a@name` anchors; Word only hyperlinks to the latter.
 * Generate a filelist.xml listing of all files to be bundled into the Word document.
 * Assign the class `MsoNormal` to any paragraphs that do not have a class, so that they can be treated as Normal Style when editing the Word document.
@@ -33,7 +34,9 @@ For a representative generator of HTML that uses this gem in postprocessing, see
 == Constraints
-This generates `.doc` documents. Future versions may upgrade the output to `docx`.
+This gem generates `.doc` documents. Future versions may upgrade the output to `docx`.
+Because `.doc` is the format of an older version of Microsoft Word, the output of this gem do *not* support SVG graphics. (Word itself converts SVG into PNG when it saves documents as Word HTML, which is the input to this gem.)
 There there are two other Microsoft Word vendors in the Ruby ecosystem.
@@ -99,7 +102,7 @@ Here are the steps to convert our output into native-`docx`.
 The good news is that Word understands HTML.
-The bad news is that Word's understanding of HTML is HTML 4. In order for bookmarks to work, for example, this gem has to translate `<p id="">` back down into `<p><a name="">`. Word (and this gem) will not do much with HTML 5-specific elements, and if you're generating HTML for automated generation of Word documents, keep your HTML old-fashioned.
+The bad news is that Word's understanding of HTML is HTML 4. In order for bookmarks to work, for example, this gem has to translate `<p id="">` back down into `<p><a name="">`. Word (and this gem) will not do much with HTML 5-specific elements (or SVG graphics), and if you're generating HTML for automated generation of Word documents, you need to keep your HTML old-fashioned.
 === CSS
@@ -116,7 +119,7 @@ The good news is that the stylesheet is not identical to the stylesheet `mathml2
 The bad news is that the stylesheet is not identical to the stylesheet `mathml2omml.xsl` that is published with Microsoft Word, so it isn't guaranteed to have identical output. If you want to make sure that your MathML import is identical to what Word currently uses, replace `mml2omml.xsl` with `mathml2omml.xsl`, and edit the gem accordingly for your local installation. On Windows, you will find the stylesheet in the same directory as the `winword.exe` executable. On Mac, right-click on the Word application, and select "Show Package Contents"; you will find the stylesheet under `Contents/Resources`.
 === Lists
-Natively, Word does not use `<ol>`, `<ul>`, or `<dl>` lists in its HTML exports at all: it uses paragraphs styled with list styles. If you save a Word document as HTML in order to use its CSS for Word documents generated by HTML, those styles will still work (with the caveat that you will need to extract the `@list` style specific to ordered and unordered lists, and pass it as a `liststyles` parameter to the conversion). *However*, Word applies a default indentation to all instances of `<ol>`, `<ul>` and `<dl>`, which the CSS stylesheet of a Word HTML will not have accounted for (because the Word HTML does not use lists at all.) If you are going to reuse that CSS for generating new documents using lists, you will need to add the rule `margin-left:0pt` to `ul`, `ol`, `dl` in the CSS stylesheet you supply, so that the margins in the Word-exported CSS remain correct.
+Natively, Word does not use `<ol>`, `<ul>`, or `<dl>` lists in its HTML exports at all: it uses paragraphs styled with list styles. If you save a Word document as HTML in order to use its CSS for Word documents generated by HTML, those styles will still work (with the caveat that you will need to extract the `@list` style specific to ordered and unordered lists, and pass it as a `liststyles` parameter to the conversion). Word HTML understands `<ol>, <ul>, <li>`, but its rendering is fragile: in particular, any instance of `<p>` within a `<li>` is treated as a new list item (so Word HTML will not let you have multi-paragraph list items if you use native HTML.) This gem now exports lists as Word HTML prefers to see them, with `MsoListParagraphCxSpFirst, MsoListParagraphCxSpMiddle, MsoListParagraphCxSpLast` styles. You will need to include these in the CSS stylesheet you supply, in order to get the right indentation for lists.
 === Math Positioning
 By default, mathematical formulas that are the only content of their paragraph are rendered as centered in Word. If you want your AsciiMath or MathML to be left-aligned or right-aligned, add `style="text-align:left"` or `style="text-align:right"` to its ancestor `div`, `p` or `td` node in HTML.

data/lib/html2doc/lists.rb CHANGED

@@ -27,13 +27,26 @@ module Html2Doc
     end
   end
+  def self.list2para(u)
+    return if u.xpath("./li").empty?
+    u.xpath("./li").last["class"] = "MsoListParagraphCxSpLast"
+    u.xpath("./li").first["class"] = "MsoListParagraphCxSpFirst"
+    u.xpath("./li/p").each { |p| p["class"] ||= "MsoListParagraphCxSpMiddle" }
+    u.xpath("./li").each do |l|
+      l.name = "p"
+      l["class"] ||= "MsoListParagraphCxSpMiddle"
+      l&.first_element_child&.name == "p" and l.first_element_child.replace(l.first_element_child.children)
+    end
+    u.replace(u.children)
+  end
   def self.lists(docxml, liststyles)
     return if liststyles.nil?
-    if liststyles.has_key?(:ul)
+    liststyles.has_key?(:ul) and
       list_add(docxml.xpath("//ul[not(ancestor::ul) and not(ancestor::ol)]"), liststyles, :ul, 1, nil)
-    end
-    if liststyles.has_key?(:ol)
+    liststyles.has_key?(:ol) and
       list_add(docxml.xpath("//ol[not(ancestor::ul) and not(ancestor::ol)]"), liststyles, :ol, 1, nil)
-    end
+    liststyles.has_key?(:ul) and docxml.xpath("//ul").each { |u| list2para(u) }
+    liststyles.has_key?(:ol) and docxml.xpath("//ol").each { |u| list2para(u) }
   end
 end

data/lib/html2doc/mime.rb CHANGED

@@ -73,13 +73,21 @@ module Html2Doc
   IMAGE_PATH = "//*[local-name() = 'img' or local-name() = 'imagedata']".freeze
+  def self.mkuuid
+    UUIDTools::UUID.random_create.to_s
+  end
+  def self.warnsvg(src)
+    warn "#{src}: SVG not supported" if /\.svg$/i.match(src)
+  end
   # only processes locally stored images
   def self.image_cleanup(docxml, dir)
     docxml.xpath(IMAGE_PATH).each do |i|
-      next if /^http/.match i["src"]
       matched = /\.(?<suffix>\S+)$/.match i["src"]
-      uuid = UUIDTools::UUID.random_create.to_s
-      new_full_filename = File.join(dir, "#{uuid}.#{matched[:suffix]}")
+      warnsvg(i["src"])
+      next if /^http/.match i["src"]
+      new_full_filename = File.join(dir, "#{mkuuid}.#{matched[:suffix]}")
       FileUtils.cp i["src"], new_full_filename
       i["width"], i["height"] = image_resize(i, 680, 400)
       i["src"] = new_full_filename
@@ -96,15 +104,13 @@ module Html2Doc
   end
   def self.header_image_cleanup1(a, dir, filename)
-    if a.size == 2
-      matched = / src=['"](?<src>[^"']+)['"]/.match a[1]
-      matched2 = /\.(?<suffix>\S+)$/.match matched[:src]
-      uuid = UUIDTools::UUID.random_create.to_s
-      new_full_filename = "file:///C:/Doc/#{filename}_files/#{uuid}.#{matched2[:suffix]}"
-      dest_filename = File.join(dir, "#{uuid}.#{matched2[:suffix]}")
-      #system "cp #{matched[:src]} #{dest_filename}"
-      FileUtils.cp matched[:src], dest_filename
-      a[1].sub!(%r{ src=['"](?<src>[^"']+)['"]}, " src='#{new_full_filename}'")
+    if a.size == 2 && !(/ src="https?:/.match a[1])
+      m = / src=['"](?<src>[^"']+)['"]/.match a[1]
+      warnsvg(m[:src])
+      m2 = /\.(?<suffix>\S+)$/.match m[:src]
+      new_filename = "file:///C:/Doc/#{filename}_files/#{mkuuid}.#{m2[:suffix]}"
+      FileUtils.cp m[:src], File.join(dir, "#{mkuuid}.#{m2[:suffix]}")
+      a[1].sub!(%r{ src=['"](?<src>[^"']+)['"]}, " src='#{new_filename}'")
     end
     a.join
   end

data/lib/html2doc/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Html2Doc
-  VERSION = "0.8.4".freeze
+  VERSION = "0.8.5".freeze
 end

data/spec/html2doc_spec.rb CHANGED

@@ -383,7 +383,7 @@ RSpec.describe Html2Doc do
   end
   it "processes AsciiMath" do
-    Html2Doc.process(html_input("<div>{{sum_(i=1)^n i^3=((n(n+1))/2)^2}}</div>"), filename: "test", asciimathdelims: ["{{", "}}"])
+    Html2Doc.process(html_input("<div>{{sum_(i=1)^n i^3=((n(n+1))/2)^2)}}</div>"), filename: "test", asciimathdelims: ["{{", "}}"])
     expect(guid_clean(File.read("test.doc", encoding: "utf-8"))).
       to match_fuzzy(<<~OUTPUT)
     #{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
@@ -588,6 +588,11 @@ RSpec.describe Html2Doc do
     OUTPUT
   end
+  it "warns about SVG" do
+    simple_body = '<img src="https://example.com/19160-6.svg">'
+    expect{ Html2Doc.process(html_input(simple_body), filename: "test") }.to output("https://example.com/19160-6.svg: SVG not supported\n").to_stderr
+  end
   it "processes epub:type footnotes" do
     simple_body = '<div>This is a very simple
      document<a epub:type="footnote" href="#a1">1</a> allegedly<a epub:type="footnote" href="#a2">2</a></div>
@@ -651,14 +656,14 @@ RSpec.describe Html2Doc do
   it "labels lists with list styles" do
     simple_body = <<~BODY
       <div><ul>
-      <li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
+      <li><div><p><ol><li><ul><li><p><ol><li><ol><li>A</li><li><p>B</p><p>B2</p></li><li>C</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>
     BODY
     Html2Doc.process(html_input(simple_body), filename: "test", liststyles: {ul: "l1", ol: "l2"})
     expect(guid_clean(File.read("test.doc", encoding: "utf-8"))).
       to match_fuzzy(<<~OUTPUT)
     #{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
-    #{word_body('<div><ul>
-       <li style="mso-list:l1 level1 lfo1;" class="MsoNormal"><div><p class="MsoNormal"><ol><li style="mso-list:l2 level2 lfo1;" class="MsoNormal"><ul><li style="mso-list:l1 level3 lfo1;" class="MsoNormal"><p class="MsoNormal"><ol><li style="mso-list:l2 level4 lfo1;" class="MsoNormal"><ol><li style="mso-list:l2 level5 lfo1;" class="MsoNormal">A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ul></div>',
+    #{word_body('<div>
+    <p style="mso-list:l1 level1 lfo1;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpFirst">A</p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpMiddle">B<p class="MsoListParagraphCxSpMiddle">B2</p></p><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpLast">C</p></p></p></p></div></p></div>',
     '<div style="mso-element:footnote-list"/>')}
     #{WORD_FTR1}
     OUTPUT
@@ -676,8 +681,8 @@ RSpec.describe Html2Doc do
       to match_fuzzy(<<~OUTPUT)
     #{WORD_HDR} #{DEFAULT_STYLESHEET} #{WORD_HDR_END}
     #{word_body('<div>
-      <ol><li style="mso-list:l2 level1 lfo1;" class="MsoNormal"><div><p class="MsoNormal"><ol><li style="mso-list:l2 level2 lfo1;" class="MsoNormal"><ul><li style="mso-list:l1 level3 lfo1;" class="MsoNormal"><p class="MsoNormal"><ol><li style="mso-list:l2 level4 lfo1;" class="MsoNormal"><ol><li style="mso-list:l2 level5 lfo1;" class="MsoNormal">A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ol>
-      <ol><li style="mso-list:l2 level1 lfo2;" class="MsoNormal"><div><p class="MsoNormal"><ol><li style="mso-list:l2 level2 lfo2;" class="MsoNormal"><ul><li style="mso-list:l1 level3 lfo2;" class="MsoNormal"><p class="MsoNormal"><ol><li style="mso-list:l2 level4 lfo2;" class="MsoNormal"><ol><li style="mso-list:l2 level5 lfo2;" class="MsoNormal">A</li></ol></li></ol></p></li></ul></li></ol></p></div></li></ol></div>',
+      <p style="mso-list:l2 level1 lfo1;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo1;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo1;" class="MsoListParagraphCxSpFirst">A</p></p></p></p></div></p>
+      <p style="mso-list:l2 level1 lfo2;" class="MsoListParagraphCxSpFirst"><div><p class="MsoNormal"><p style="mso-list:l2 level2 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level4 lfo2;" class="MsoListParagraphCxSpFirst"><p style="mso-list:l2 level5 lfo2;" class="MsoListParagraphCxSpFirst">A</p></p></p></p></div></p></div>',
     '<div style="mso-element:footnote-list"/>')}
     #{WORD_FTR1}
     OUTPUT

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: html2doc
 version: !ruby/object:Gem::Version
-  version: 0.8.4
+  version: 0.8.5
 platform: ruby
 authors:
 - Ribose Inc.
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-10-05 00:00:00.000000000 Z
+date: 2018-10-08 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: htmlentities