RubyGems - html2text - Versions diffs - 0.1.1 - Mend

html2text 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

checksums.yaml +7 -0
data/MIT-LICENSE +20 -0
data/README.md +70 -0
data/lib/html2text.rb +138 -0
data/lib/html2text/version.rb +3 -0
data/spec/examples/anchors.html +12 -0
data/spec/examples/anchors.txt +5 -0
data/spec/examples/basic.html +21 -0
data/spec/examples/basic.txt +13 -0
data/spec/examples/lists.html +24 -0
data/spec/examples/lists.txt +17 -0
data/spec/examples/more-anchors.html +14 -0
data/spec/examples/more-anchors.txt +7 -0
data/spec/examples/nbsp.html +1 -0
data/spec/examples/nbsp.txt +1 -0
data/spec/examples/table.html +53 -0
data/spec/examples/table.txt +7 -0
data/spec/examples/test3.html +1 -0
data/spec/examples/test3.txt +2 -0
data/spec/examples/test4.html +1 -0
data/spec/examples/test4.txt +5 -0
data/spec/examples_spec.rb +25 -0
data/spec/html2text_spec.rb +37 -0
data/spec/spec_helper.rb +4 -0
metadata +156 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 7c84c460e75e64099fa12a010871f9859ab48b9f
+  data.tar.gz: ea56a52568f22804cdcbc44b5f35e6b99164ea6c
+SHA512:
+  metadata.gz: a3833c4546b86912872d777fc57be15cc0fac89e273e5ad65b6714a0b723f4815a81a3865e9ee0b05746ef7dee356baf5824ace242ab914d26eb79bf3aa6bf65
+  data.tar.gz: 737d869f81c782f93d520e935bb5b26a0a88798f940b60856519a084eabd1dfca84171d673f3abd5e73ecf0f84917909573cd6d92a67510fcdfcc075c4a676ed

data/MIT-LICENSE ADDED

@@ -0,0 +1,20 @@
+Copyright 2015 Jevon Wright
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,70 @@
+html2text [![Build Status](https://travis-ci.org/soundasleep/html2text_ruby.svg?branch=master)](https://travis-ci.org/soundasleep/html2text_ruby)
+==============
+`html2text` is a very simple script that uses Ruby's DOM methods to load HTML from a string, and then iterates over the resulting DOM to correctly output plain text. For example:
+```html
+<html>
+<title>Ignored Title</title>
+<body>
+  <h1>Hello, World!</h1>
+  <p>This is some e-mail content.
+  Even though it has whitespace and newlines, the e-mail converter
+  will handle it correctly.
+  <p>Even mismatched tags.</p>
+  <div>A div</div>
+  <div>Another div</div>
+  <div>A div<div>within a div</div></div>
+  <a href="http://foo.com">A link</a>
+</body>
+</html>
+```
+Will be converted into:
+```text
+Hello, World!
+This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
+Even mismatched tags.
+A div
+Another div
+A div
+within a div
+[A link](http://foo.com)
+```
+See the [original blog post](http://journals.jevon.org/users/jevon-phd/entry/19818) or the related [StackOverflow answer](http://stackoverflow.com/a/2564472/39531).
+## Installing
+TODO Install the gem, then you can:
+```ruby
+require 'html2text'
+text = Html2Text.convert(html)
+```
+## Tests
+See all of the test cases defined in [spec/examples/](spec/examples/). These can be run with:
+```
+bundle install
+rspec
+```
+## License
+`html2text` is licensed under MIT.
+## Other versions
+Also see [html2text](https://github.com/soundasleep/html2text), the original PHP implementation.

data/lib/html2text.rb ADDED

@@ -0,0 +1,138 @@
+require 'nokogiri'
+class Html2Text
+  attr_reader :doc
+  def initialize(doc)
+    @doc = doc
+  end
+  def self.convert(html)
+    html = fix_newlines(replace_entities(html))
+    doc = Nokogiri::HTML(html)
+    Html2Text.new(doc).convert
+  end
+  def self.fix_newlines(text)
+    text.gsub("\r\n", "\n").gsub("\r", "\n")
+  end
+  def self.replace_entities(text)
+    text.gsub("&nbsp;", " ")
+  end
+  def convert
+    output = iterate_over(doc)
+    output = remove_leading_and_trailing_whitespace(output)
+    output.strip
+  end
+  def remove_leading_and_trailing_whitespace(text)
+    text.gsub(/[ \t]*\n[ \t]*/im, "\n")
+  end
+  def trimmed_whitespace(text)
+    # Replace whitespace characters with a space (equivalent to \s)
+    text.gsub(/[\t\n\f\r ]+/im, " ")
+  end
+  def next_node_name(node)
+    next_node = node.next_sibling
+    while next_node != nil
+      break if next_node.element?
+      next_node = next_node.next_sibling
+    end
+    if next_node && next_node.element?
+      next_node.name.downcase
+    end
+  end
+  def iterate_over(node)
+    return trimmed_whitespace(node.text) if node.text?
+    if ["style", "head", "title", "meta", "script"].include?(node.name.downcase)
+      return ""
+    end
+    output = []
+    output << prefix_whitespace(node)
+    output += node.children.map do |child|
+      iterate_over(child)
+    end
+    output << suffix_whitespace(node)
+    output = output.compact.join("") || ""
+    if node.name.downcase == "a"
+      output = wrap_link(node, output)
+    end
+    output
+  end
+  def prefix_whitespace(node)
+    case node.name.downcase
+      when "hr"
+        "------\n"
+      when "h1", "h2", "h3", "h4", "h5", "h6", "ol", "ul"
+        "\n"
+      when "tr", "p", "div"
+        "\n"
+      when "td", "th"
+        "\t"
+      when "li"
+        "- "
+    end
+  end
+  def suffix_whitespace(node)
+    case node.name.downcase
+      when "h1", "h2", "h3", "h4", "h5", "h6"
+        # add another line
+        "\n"
+      when "p", "br"
+        "\n" if next_node_name(node) != "div"
+      when "li"
+        "\n"
+      when "div"
+        # add one line only if the next child isn't a div
+        "\n" if next_node_name(node) != "div" && next_node_name(node) != nil
+    end
+  end
+  # links are returned in [text](link) format
+  def wrap_link(node, output)
+    href = node.attribute("href")
+    name = node.attribute("name")
+    if href.nil?
+      if !name.nil?
+        output = "[#{output}]"
+      end
+    else
+      href = href.to_s
+      if href != output && href != "mailto:#{output}" &&
+          href != "http://#{output}" && href != "https://#{output}"
+        output = "[#{output}](#{href})"
+      end
+    end
+    case next_node_name(node)
+      when "h1", "h2", "h3", "h4", "h5", "h6"
+        output += "\n"
+    end
+    output
+  end
+end

data/lib/html2text/version.rb ADDED

@@ -0,0 +1,3 @@
+class Html2Text
+  VERSION = "0.1.1"
+end

data/spec/examples/anchors.html ADDED

@@ -0,0 +1,12 @@
+A document without any HTML open/closing tags.
+<hr>
+We try and use the representation given by common browsers of the
+HTML document, so that it looks similar when converted to plain text.
+<a href="http://foo.com">visit foo.com</a> - or <a href="http://www.foo.com">http://www.foo.com</a>
+<a href="http://foo.com" title="a link with a title">link</a>
+<h2><a name="anchor">An anchor which will not appear</a></h2>

data/spec/examples/anchors.txt ADDED

@@ -0,0 +1,5 @@
+A document without any HTML open/closing tags.
+------
+We try and use the representation given by common browsers of the HTML document, so that it looks similar when converted to plain text. [visit foo.com](http://foo.com) - or http://www.foo.com [link](http://foo.com)
+[An anchor which will not appear]

data/spec/examples/basic.html ADDED

@@ -0,0 +1,21 @@
+<html>
+<title>Ignored Title</title>
+<body>
+  <h1>Hello, World!</h1>
+  <p>This is some e-mail content.
+  Even though it has whitespace and newlines, the e-mail converter
+  will handle it correctly.
+  <p>Even mismatched tags.</p>
+  <div>A div</div>
+  <div>Another div</div>
+  <div>A div<div>within a div</div></div>
+  <p>Another line<br />Yet another line</p>
+  <a href="http://foo.com">A link</a>
+</body>
+</html>

data/spec/examples/basic.txt ADDED

@@ -0,0 +1,13 @@
+Hello, World!
+This is some e-mail content. Even though it has whitespace and newlines, the e-mail converter will handle it correctly.
+Even mismatched tags.
+A div
+Another div
+A div
+within a div
+Another line
+Yet another line
+[A link](http://foo.com)

data/spec/examples/lists.html ADDED

@@ -0,0 +1,24 @@
+<h1>List tests</h1>
+<p>
+Add some lists.
+</p>
+<ol>
+	<li>one</li>
+	<li>two
+	<li>three
+</ol>
+<h2>An unordered list</h2>
+<ul>
+	<li>one
+	<li>two</li>
+	<li>three</li>
+</ul>
+<ul>
+	<li>one
+	<li>two</li>
+	<li>three</li>
+</ul>

data/spec/examples/lists.txt ADDED

@@ -0,0 +1,17 @@
+List tests
+Add some lists.
+- one
+- two
+- three
+An unordered list
+- one
+- two
+- three
+- one
+- two
+- three

data/spec/examples/more-anchors.html ADDED

@@ -0,0 +1,14 @@
+<h1>Anchor tests</h1>
+<p>
+Visit http://openiaml.org or <a href="http://openiaml.org">openiaml.org</a> or <a href="http://openiaml.org">http://openiaml.org</a>.
+</p>
+<p>
+To visit with SSL, visit https://openiaml.org or <a href="https://openiaml.org">openiaml.org</a> or <a href="https://openiaml.org">https://openiaml.org</a>.
+</p>
+<p>
+To mail, email support@openiaml.org or mailto:support@openiaml.org
+or <a href="mailto:support@openiaml.org">support@openiaml.org</a> or <a href="mailto:support@openiaml.org">mailto:support@openiaml.org</a>.
+</p>

data/spec/examples/more-anchors.txt ADDED

@@ -0,0 +1,7 @@
+Anchor tests
+Visit http://openiaml.org or openiaml.org or http://openiaml.org.
+To visit with SSL, visit https://openiaml.org or openiaml.org or https://openiaml.org.
+To mail, email support@openiaml.org or mailto:support@openiaml.org or support@openiaml.org or mailto:support@openiaml.org.

data/spec/examples/nbsp.html ADDED

	@@ -0,0 +1 @@
1	+ hello   world & people < > &NBSP;

data/spec/examples/nbsp.txt ADDED

	@@ -0,0 +1 @@
1	+ hello world & people < > &NBSP;

data/spec/examples/table.html ADDED

@@ -0,0 +1,53 @@
+<html>
+<title>Ignored Title</title>
+<body>
+  <h1>Hello, World!</h1>
+  <table>
+    <thead>
+      <tr>
+        <th>Col A</th>
+        <th>Col B</th>
+      </tr>
+    </thead>
+    <tbody>
+      <tr>
+        <td>
+          Data A1
+        </td>
+        <td>
+          Data B1
+        </td>
+      </tr>
+      <tr>
+          <td>
+            Data A2
+          </td>
+          <td>
+            Data B2
+          </td>
+      </tr>
+      <tr>
+        <td>
+          Data A3
+        </td>
+        <td>
+          Data B4
+        </td>
+      </tr>
+    </tbody>
+    <tfoot>
+      <tr>
+          <td>
+            Total A
+          </td>
+          <td>
+            Total B
+          </td>
+       </tr>
+    </tfoot>
+  </table>
+</body>
+</html>

data/spec/examples/table.txt ADDED

@@ -0,0 +1,7 @@
+Hello, World!
+Col A 	Col B
+Data A1  	 Data B1
+Data A2  	 Data B2
+Data A3  	 Data B4
+Total A  	 Total B

data/spec/examples/test3.html ADDED

	@@ -0,0 +1 @@
1	+ test one<br />test two

data/spec/examples/test3.txt ADDED

	@@ -0,0 +1,2 @@
1	+ test one
2	+ test two

data/spec/examples/test4.html ADDED

	@@ -0,0 +1 @@
1	+ 1<br />2<br />3<br />4<br />5 6

data/spec/examples/test4.txt ADDED

@@ -0,0 +1,5 @@
+1
+2
+3
+4
+5 6

data/spec/examples_spec.rb ADDED

@@ -0,0 +1,25 @@
+require "spec_helper"
+describe Html2Text do
+  describe "#convert" do
+    let(:text) { Html2Text.convert(html) }
+    examples = Dir[File.dirname(__FILE__) + "/examples/*.html"]
+    examples.each do |filename|
+      context "#{filename}" do
+        let(:html) { File.read(filename) }
+        let(:text_file) { filename.sub(".html", ".txt") }
+        let(:expected) { Html2Text.fix_newlines(File.read(text_file)) }
+        it "converts to text" do
+          expect(text).to eq(expected)
+        end
+      end
+    end
+    it "has examples to test" do
+      expect(examples.size).to_not eq(0)
+    end
+  end
+end

data/spec/html2text_spec.rb ADDED

@@ -0,0 +1,37 @@
+require "spec_helper"
+describe Html2Text do
+  describe "#convert" do
+    let(:text) { Html2Text.convert(html) }
+    context "an empty line" do
+      let(:html) { "" }
+      it "is an empty line" do
+        expect(text).to eq("")
+      end
+    end
+    context "a simple string" do
+      let(:html) { "hello world" }
+      it "is an empty line" do
+        expect(text).to eq("hello world")
+      end
+    end
+  end
+  describe "#remove_leading_and_trailing_whitespace" do
+    let(:subject) { Html2Text.new(nil).remove_leading_and_trailing_whitespace(input) }
+    context "an empty string" do
+      let(:input) { "" }
+      it { is_expected.to eq("") }
+    end
+    context "many new lines" do
+      let(:input) { "hello\n  world \n yes" }
+      it { is_expected.to eq("hello\nworld\nyes") }
+    end
+  end
+end

data/spec/spec_helper.rb ADDED

@@ -0,0 +1,4 @@
+require "rspec"
+require "rspec/collection_matchers"
+require File.join(File.dirname(__FILE__), "..", "lib", "html2text")

metadata ADDED

@@ -0,0 +1,156 @@
+--- !ruby/object:Gem::Specification
+name: html2text
+version: !ruby/object:Gem::Version
+  version: 0.1.1
+platform: ruby
+authors:
+- Jevon Wright
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-12-17 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.6'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec-collection_matchers
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: colorize
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description: A Ruby component to convert HTML into a plain text format.
+email:
+- jevon@powershop.co.nz
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- MIT-LICENSE
+- README.md
+- lib/html2text.rb
+- lib/html2text/version.rb
+- spec/examples/anchors.html
+- spec/examples/anchors.txt
+- spec/examples/basic.html
+- spec/examples/basic.txt
+- spec/examples/lists.html
+- spec/examples/lists.txt
+- spec/examples/more-anchors.html
+- spec/examples/more-anchors.txt
+- spec/examples/nbsp.html
+- spec/examples/nbsp.txt
+- spec/examples/table.html
+- spec/examples/table.txt
+- spec/examples/test3.html
+- spec/examples/test3.txt
+- spec/examples/test4.html
+- spec/examples/test4.txt
+- spec/examples_spec.rb
+- spec/html2text_spec.rb
+- spec/spec_helper.rb
+homepage: https://github.com/soundasleep/html2text_ruby
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.5
+signing_key:
+specification_version: 4
+summary: Convert HTML into plain text.
+test_files:
+- spec/examples/anchors.html
+- spec/examples/anchors.txt
+- spec/examples/basic.html
+- spec/examples/basic.txt
+- spec/examples/lists.html
+- spec/examples/lists.txt
+- spec/examples/more-anchors.html
+- spec/examples/more-anchors.txt
+- spec/examples/nbsp.html
+- spec/examples/nbsp.txt
+- spec/examples/table.html
+- spec/examples/table.txt
+- spec/examples/test3.html
+- spec/examples/test3.txt
+- spec/examples/test4.html
+- spec/examples/test4.txt
+- spec/examples_spec.rb
+- spec/html2text_spec.rb
+- spec/spec_helper.rb