RubyGems - slasher - Versions diffs - 0.5.0 - Mend

slasher 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

checksums.yaml +7 -0
data/.gitignore +38 -0
data/.rspec +3 -0
data/Gemfile +8 -0
data/Gemfile.lock +55 -0
data/README.md +30 -0
data/doc/website_coverage.txt +21 -0
data/lib/slasher/content.rb +23 -0
data/lib/slasher/dom.rb +43 -0
data/lib/slasher.rb +37 -0
data/spec/fixtures/test.html +22 -0
data/spec/fixtures/test_doc.html +21 -0
data/spec/fixtures/test_paragraph.html +16 -0
data/spec/fixtures/test_text.html +20 -0
data/spec/slasher/content_spec.rb +38 -0
data/spec/slasher/dom_spec.rb +70 -0
data/spec/slasher_spec.rb +33 -0
data/spec/spec_helper.rb +96 -0
metadata +60 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: b8fd01b12d6c5944f17ea1c2b6ee3a6c0cf59fee
+  data.tar.gz: a91be77c10b7467986f0bedefb09ea96c5fef00a
+SHA512:
+  metadata.gz: 439798d0ff97d07ed81519e7db4e9680b4edc9412cd8ab36b3b6340c143d09bbbdaeebd01c4f477be5c9a12e91b6ae8898b7c61fcf6b1f666f8cca4bfe2a7e8e
+  data.tar.gz: 1bc395d88baf44337bdfcbe767a961f8122acd1cceb877ac699650c95b4dd2fe871c7b5888400bcce382a893f76326b625b905b64215c6893343b06a7200d986

data/.gitignore ADDED Viewed

@@ -0,0 +1,38 @@
+*.gem
+*.rbc
+/.config
+/coverage/
+/InstalledFiles
+/pkg/
+/spec/reports/
+/test/tmp/
+/test/version_tmp/
+/tmp/
+## Specific to RubyMotion:
+.dat*
+.repl_history
+build/
+## Documentation cache and generated files:
+/.yardoc/
+/_yardoc/
+/rdoc/
+## Environment normalisation:
+/.bundle/
+/vendor/bundle
+/lib/bundler/man/
+# for a library or gem, you might want to ignore these files since the code is
+# intended to run in multiple environments; otherwise, check them in:
+# Gemfile.lock
+# .ruby-version
+# .ruby-gemset
+# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
+.rvmrc
+*.gemspec
+/spec/cases/
+/spec/cases_spec.rb

data/.rspec ADDED Viewed

@@ -0,0 +1,3 @@
+--color
+--require spec_helper
+--format documentation

data/Gemfile ADDED Viewed

@@ -0,0 +1,8 @@
+source 'https://rubygems.org'
+gem 'rspec'
+gem 'rspec-collection_matchers'
+gem 'capybara'
+gem 'pry'
+gem 'faker'
+gem 'nokogiri'

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,55 @@
+GEM
+  remote: https://rubygems.org/
+  specs:
+    capybara (2.4.4)
+      mime-types (>= 1.16)
+      nokogiri (>= 1.3.3)
+      rack (>= 1.0.0)
+      rack-test (>= 0.5.4)
+      xpath (~> 2.0)
+    coderay (1.1.0)
+    diff-lcs (1.2.5)
+    faker (1.4.3)
+      i18n (~> 0.5)
+    i18n (0.7.0)
+    method_source (0.8.2)
+    mime-types (2.6.1)
+    mini_portile (0.6.0)
+    nokogiri (1.6.5)
+      mini_portile (~> 0.6.0)
+    pry (0.10.1)
+      coderay (~> 1.1.0)
+      method_source (~> 0.8.1)
+      slop (~> 3.4)
+    rack (1.5.3)
+    rack-test (0.6.3)
+      rack (>= 1.0)
+    rspec (3.2.0)
+      rspec-core (~> 3.2.0)
+      rspec-expectations (~> 3.2.0)
+      rspec-mocks (~> 3.2.0)
+    rspec-collection_matchers (1.1.2)
+      rspec-expectations (>= 2.99.0.beta1)
+    rspec-core (3.2.3)
+      rspec-support (~> 3.2.0)
+    rspec-expectations (3.2.1)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.2.0)
+    rspec-mocks (3.2.1)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.2.0)
+    rspec-support (3.2.2)
+    slop (3.6.0)
+    xpath (2.0.0)
+      nokogiri (~> 1.3)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  capybara
+  faker
+  nokogiri
+  pry
+  rspec
+  rspec-collection_matchers

data/README.md ADDED Viewed

@@ -0,0 +1,30 @@
+# slasherrb
+[![Build Status](https://semaphoreci.com/api/v1/projects/58c6aef2-91c2-428e-a803-37a8e6ffac2d/445101/badge.svg)](https://semaphoreci.com/hafizbadrie/slasherrb)
+This project is actually the ruby version of [slasherjs](https://github.com/hafizbadrie/slasherjs). Slasher is a library that could extract the main content of an HTML article document.
+The result of extraction is depending of assumption on HTML document structure itself. Therefore, there may be flaws in the result if the document doesn't match the structure that is recognised by the library.
+This condition will make the library will be improved from time to time.
+## How To Use
+To use the library, you need to have an HTML document first.
+```ruby
+require 'net/http'
+require 'slasher'
+uri = URI("http://sea-games-2015.liputan6.com/read/2252937/all-indonesia-finals-ganda-putra-sumbang-emas")
+html = Net::HTTP.get(uri)
+slasher = Slasher.new(html)
+content = slasher.slash
+#content variable will have the main content of the HTML document (article).
+```
+## Website Coverage
+This library has been tested against some websites and you can see the complete list in this [document](https://github.com/hafizbadrie/slasherrb/blob/master/doc/website_coverage.txt)
+## TODO
+1. Add more test cases: international websites
+2. Anytime I want to slash a new site, I don't need to re initialize the object.

data/doc/website_coverage.txt ADDED Viewed

@@ -0,0 +1,21 @@
+1. liputan6.com
+2. kompas.com
+3. detik.com
+4. thejakartapost.com
+5. thejakartaglobe.beritasatu.com
+6. tribunnews.com
+7. merdeka.com
+8. okezone.com
+9. suara.com
+10. viva.co.id
+11. tempo.co
+12. republika.co.id
+13. metrotvnews.com
+14. bola.net
+15. bisnis.com
+16. cnnindonesia.com
+17. sindonews.com
+18. ttwigo.com
+19. jakpost.travel
+20. dailysocial.net
+21. teknojurnal.com

data/lib/slasher/content.rb ADDED Viewed

@@ -0,0 +1,23 @@
+class Slasher
+  class Content
+    attr_accessor :collection
+    def initialize
+      @collection = []
+    end
+    def push_content(content)
+      stored_content = {
+        length: content.gsub(/\s/, '').size,
+        content: content
+      }
+      @collection << stored_content
+    end
+    def get_longest_length
+      collection.sort_by do |content|
+        content[:length]
+      end.last
+    end
+  end
+end

data/lib/slasher/dom.rb ADDED Viewed

@@ -0,0 +1,43 @@
+class Slasher
+  class DOM
+    REMOVED_ELEMENTS  = ['iframe', 'script', 'style', 'noscript', 'header', 'footer', 'br', 'img']
+    STRIPPED_ELEMENTS = ['blockquote', 'strong', 'a', 'em', 'b']
+    attr_accessor :document
+    def initialize(document)
+      @document = Nokogiri::HTML(document)
+    end
+    def remove_elements
+      REMOVED_ELEMENTS.each do |element|
+        @document.xpath("//#{element}").remove
+      end
+    end
+    def strip_elements
+      STRIPPED_ELEMENTS.each do |element|
+        @document.search("//#{element}").each do |node|
+          node.replace(Nokogiri::XML::Text.new(node.text, node.document))
+        end
+      end
+    end
+    def get_paragraphs_content(node)
+      content = ""
+      node.send(:>, "p").each do |p|
+        content += p.text
+        p.remove
+      end
+      content
+    end
+    def get_texts(node)
+      content = ""
+      node.children.each do |child|
+        content += child.text.delete("\n").strip if child.text?
+      end
+      content
+    end
+  end
+end

data/lib/slasher.rb ADDED Viewed

@@ -0,0 +1,37 @@
+require 'slasher/content'
+require 'slasher/dom'
+class Slasher
+  attr_accessor :dom, :content
+  def initialize(html)
+    @dom      = Slasher::DOM.new(html)
+    @content  = Slasher::Content.new
+  end
+  def recursive_slash(doc)
+    content.push_content(dom.get_texts(doc))
+    doc.children.each do |child|
+      if child.send(:>, "p").count > 0
+        p_content = dom.get_paragraphs_content(child)
+        content.push_content(p_content)
+      end
+      if child.children.count > 0
+        recursive_slash(child)
+      else
+        if child.text != '' && !child.text.nil?
+          content.push_content(child.text)
+        end
+      end
+    end
+  end
+  def slash
+    dom.remove_elements
+    dom.strip_elements
+    recursive_slash(dom.document)
+    content.get_longest_length[:content]
+  end
+end

data/spec/fixtures/test.html ADDED Viewed

@@ -0,0 +1,22 @@
+<html>
+<head>
+  <title>Slasher.rb Test</title>
+</head>
+<body>
+  <style>h1 { font-size: 36px; }</style>
+  <script type="text/javascript">console.log("Hello World");</script>
+  <iframe src="http://facebook.com"></iframe>
+  <iframe src="http://twitter.com"></iframe>
+  <noscript>Hello</noscript>
+  <header>This is header</header>
+  <br>
+  <img src="https://avatars0.githubusercontent.com/u/494642?v=3&s=460">
+  <footer>This is footer</footer>
+  <div class="content">
+    <blockquote><h2>This is quote</h2></blockquote>
+    <strong>This is strong</strong>
+    <a href='#'>This is a link</a>
+    <em>This is italic sentence</em>
+  </div>
+</body>
+</html>

data/spec/fixtures/test_doc.html ADDED Viewed

@@ -0,0 +1,21 @@
+<html>
+<head>
+  <title>Slasher.rb Test</title>
+</head>
+<body>
+  <div class="content">
+    <div class="content-header">
+      This is just a content header
+    </div>
+    <div class="content-body">
+      <p>This is first paragraph.</p>
+      <p>This is second paragraph.</p>
+      <p>This is third paragraph.</p>
+    </div>
+  </div>
+  <div class="sidebar">
+    <p>This is paragraph</p>
+  </div>
+</body>
+</html>

data/spec/fixtures/test_paragraph.html ADDED Viewed

@@ -0,0 +1,16 @@
+<html>
+<head>
+  <title>Slasher.rb Test</title>
+</head>
+<body>
+  <div class="content">
+    <p>This is first paragraph.</p>
+    <p>This is second paragraph.</p>
+    <p>This is third paragraph.</p>
+  </div>
+  <div class="sidebar">
+    <p>This is paragraph</p>
+  </div>
+</body>
+</html>

data/spec/fixtures/test_text.html ADDED Viewed

@@ -0,0 +1,20 @@
+<html>
+<head>
+  <title>Slasher.rb Test</title>
+</head>
+<body>
+  <div class="content">
+    This is first paragraph.
+    <br>
+    <br>
+    This is second paragraph.
+    <br>
+    <br>
+    This is third paragraph.
+    <br>
+    <br>
+  </div>
+</body>
+</html>

data/spec/slasher/content_spec.rb ADDED Viewed

@@ -0,0 +1,38 @@
+describe Slasher::Content do
+  describe "#initialize" do
+    let(:content) { Slasher::Content.new }
+    it "will assign document based on provided data in initialisation" do
+      expect(content.collection).to be_empty
+    end
+  end
+  describe "#push_content" do
+    let(:content_1) { "This is just a content that needs to be stored in a collection" }
+    let(:content_2) { "This is just a content" }
+    let(:content) { Slasher::Content.new }
+    it "will store content in an array of hash" do
+      content.push_content(content_1)
+      content.push_content(content_2)
+      expect(content.collection).to have(2).items
+      expect(content.collection.first[:length]).to eq content_1.gsub(/\s/, '').size
+      expect(content.collection.first[:content]).to eq content_1
+    end
+  end
+  describe "#get_longest_length" do
+    let(:content) { Slasher::Content.new }
+    let(:content_1) { "This is the first content" }
+    let(:content_2) { "This should have the highest length among all"}
+    let(:content_3) { "Sortest" }
+    it "will return highest length from contents" do
+      content.push_content(content_1)
+      content.push_content(content_2)
+      content.push_content(content_3)
+      expect(content.get_longest_length[:content]).to eq content_2
+    end
+  end
+end

data/spec/slasher/dom_spec.rb ADDED Viewed

@@ -0,0 +1,70 @@
+describe Slasher::DOM do
+  describe "#initialize" do
+    let(:html) { "<html><head><title>Hello World</title></head><body><h1>Hello World</h1></body></html>" }
+    it "will assign document based on provided data in initialisation" do
+      dom = Slasher::DOM.new(html)
+      document = Nokogiri::HTML(html)
+      expect(dom.document).to be_a Nokogiri::HTML::Document
+    end
+  end
+  describe "#remove_elements" do
+    let(:html) { File.open("spec/fixtures/test.html").read }
+    let(:dom) { Slasher::DOM.new(html) }
+    it "will remove elements like script, iframe, style, noscript, header, footer, br, and img" do
+      dom.remove_elements
+      document = Capybara.string(dom.document)
+      expect(document).not_to have_css "script"
+      expect(document).not_to have_css "iframe"
+      expect(document).not_to have_css "style"
+      expect(document).not_to have_css "noscript"
+      expect(document).not_to have_css "header"
+      expect(document).not_to have_css "footer"
+      expect(document).not_to have_css "br"
+      expect(document).not_to have_css "img"
+    end
+  end
+  describe "#strip_elements" do
+    let(:html) { File.open("spec/fixtures/test.html").read }
+    let(:dom) { Slasher::DOM.new(html) }
+    it "will remove element but not with the content" do
+      dom.strip_elements
+      document = Capybara.string(dom.document)
+      expect(document).not_to have_css "blockquote"
+      expect(document).to have_content "This is quote"
+      expect(document).not_to have_css "strong"
+      expect(document).to have_content "This is strong"
+      expect(document).not_to have_css "a"
+      expect(document).to have_content "This is a link"
+      expect(document).not_to have_css "em"
+      expect(document).to have_content "This is italic sentence"
+    end
+  end
+  describe "#get_paragraphs_content" do
+    let(:html) { File.open("spec/fixtures/test_paragraph.html").read }
+    let(:dom) { Slasher::DOM.new(html) }
+    it "will get all the content inside tag p from specific parent" do
+      content = dom.get_paragraphs_content(dom.document.xpath("//div[@class='content']"))
+      expect(content).to eq "This is first paragraph.This is second paragraph.This is third paragraph."
+      content = dom.get_paragraphs_content(dom.document.xpath("//div[@class='sidebar']"))
+      expect(content).to eq "This is paragraph"
+    end
+  end
+  describe "#get_texts" do
+    let(:html) { File.open("spec/fixtures/test_text.html").read }
+    let(:dom) { Slasher::DOM.new(html) }
+    it "will concat all Text children into 1 content" do
+      content = dom.get_texts(dom.document.xpath("//div[@class='content']"))
+      expect(content).to eq "This is first paragraph.This is second paragraph.This is third paragraph."
+    end
+  end
+end

data/spec/slasher_spec.rb ADDED Viewed

@@ -0,0 +1,33 @@
+describe Slasher do
+  describe "#initialize" do
+    let(:html) { "<html><head><title>Hello World</title></head><body><h1>Hello World</h1></body></html>" }
+    it "will assign document based on provided data in initialisation" do
+      slasher = Slasher.new(html)
+      expect(slasher.dom).to be_a Slasher::DOM
+      expect(slasher.content).to be_a Slasher::Content
+    end
+  end
+  describe "#recursive_slash" do
+    let(:html) { File.open("spec/fixtures/test_doc.html") }
+    let(:slasher) { Slasher.new(html) }
+    it "will recursively turn document into array of hash" do
+      slasher.recursive_slash(slasher.dom.document)
+      content = slasher.content
+      expect(content.collection.size).to eq 30
+    end
+  end
+  describe "#slash" do
+    let(:html) { File.open("spec/fixtures/test_doc.html") }
+    let(:slasher) { Slasher.new(html) }
+    it "will return the longest/highest content" do
+      content = slasher.slash
+      expect(content).to eq "This is first paragraph.This is second paragraph.This is third paragraph."
+    end
+  end
+end

data/spec/spec_helper.rb ADDED Viewed

@@ -0,0 +1,96 @@
+# This file was generated by the `rspec --init` command. Conventionally, all
+# specs live under a `spec` directory, which RSpec adds to the `$LOAD_PATH`.
+# The generated `.rspec` file contains `--require spec_helper` which will cause
+# this file to always be loaded, without a need to explicitly require it in any
+# files.
+#
+# Given that it is always loaded, you are encouraged to keep this file as
+# light-weight as possible. Requiring heavyweight dependencies from this file
+# will add to the boot time of your test suite on EVERY test run, even for an
+# individual file that may not need all of that loaded. Instead, consider making
+# a separate helper file that requires the additional dependencies and performs
+# the additional setup, and require it from the spec files that actually need
+# it.
+#
+# The `.rspec` file also contains a few flags that are not defaults but that
+# users commonly want.
+#
+# See http://rubydoc.info/gems/rspec-core/RSpec/Core/Configuration
+require 'bundler'
+Bundler.require(:default)
+Dir.glob("./lib/**/*.rb") {|f| require f }
+RSpec.configure do |config|
+  # rspec-expectations config goes here. You can use an alternate
+  # assertion/expectation library such as wrong or the stdlib/minitest
+  # assertions if you prefer.
+  config.expect_with :rspec do |expectations|
+    # This option will default to `true` in RSpec 4. It makes the `description`
+    # and `failure_message` of custom matchers include text for helper methods
+    # defined using `chain`, e.g.:
+    #     be_bigger_than(2).and_smaller_than(4).description
+    #     # => "be bigger than 2 and smaller than 4"
+    # ...rather than:
+    #     # => "be bigger than 2"
+    expectations.include_chain_clauses_in_custom_matcher_descriptions = true
+  end
+  # rspec-mocks config goes here. You can use an alternate test double
+  # library (such as bogus or mocha) by changing the `mock_with` option here.
+  config.mock_with :rspec do |mocks|
+    # Prevents you from mocking or stubbing a method that does not exist on
+    # a real object. This is generally recommended, and will default to
+    # `true` in RSpec 4.
+    mocks.verify_partial_doubles = true
+  end
+# The settings below are suggested to provide a good initial experience
+# with RSpec, but feel free to customize to your heart's content.
+=begin
+  # These two settings work together to allow you to limit a spec run
+  # to individual examples or groups you care about by tagging them with
+  # `:focus` metadata. When nothing is tagged with `:focus`, all examples
+  # get run.
+  config.filter_run :focus
+  config.run_all_when_everything_filtered = true
+  # Limits the available syntax to the non-monkey patched syntax that is
+  # recommended. For more details, see:
+  #   - http://myronmars.to/n/dev-blog/2012/06/rspecs-new-expectation-syntax
+  #   - http://teaisaweso.me/blog/2013/05/27/rspecs-new-message-expectation-syntax/
+  #   - http://myronmars.to/n/dev-blog/2014/05/notable-changes-in-rspec-3#new__config_option_to_disable_rspeccore_monkey_patching
+  config.disable_monkey_patching!
+  # This setting enables warnings. It's recommended, but in some cases may
+  # be too noisy due to issues in dependencies.
+  config.warnings = true
+  # Many RSpec users commonly either run the entire suite or an individual
+  # file, and it's useful to allow more verbose output when running an
+  # individual spec file.
+  if config.files_to_run.one?
+    # Use the documentation formatter for detailed output,
+    # unless a formatter has already been configured
+    # (e.g. via a command-line flag).
+    config.default_formatter = 'doc'
+  end
+  # Print the 10 slowest examples and example groups at the
+  # end of the spec run, to help surface which specs are running
+  # particularly slow.
+  config.profile_examples = 10
+  # Run specs in random order to surface order dependencies. If you find an
+  # order dependency and want to debug it, you can fix the order by providing
+  # the seed, which is printed after each run.
+  #     --seed 1234
+  config.order = :random
+  # Seed global randomization in this process using the `--seed` CLI option.
+  # Setting this allows you to use `--seed` to deterministically reproduce
+  # test failures related to randomization by passing the same `--seed` value
+  # as the one that triggered the failure.
+  Kernel.srand config.seed
+=end
+end

metadata ADDED Viewed

@@ -0,0 +1,60 @@
+--- !ruby/object:Gem::Specification
+name: slasher
+version: !ruby/object:Gem::Version
+  version: 0.5.0
+platform: ruby
+authors:
+- Hafiz Badrie Lubis
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2015-06-16 00:00:00.000000000 Z
+dependencies: []
+description: Extract the content of an HTML article
+email: hafizbadrie@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- ".rspec"
+- Gemfile
+- Gemfile.lock
+- README.md
+- doc/website_coverage.txt
+- lib/slasher.rb
+- lib/slasher/content.rb
+- lib/slasher/dom.rb
+- spec/fixtures/test.html
+- spec/fixtures/test_doc.html
+- spec/fixtures/test_paragraph.html
+- spec/fixtures/test_text.html
+- spec/slasher/content_spec.rb
+- spec/slasher/dom_spec.rb
+- spec/slasher_spec.rb
+- spec/spec_helper.rb
+homepage: http://github.com/hafizbadrie/slasherrb
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.4.5
+signing_key:
+specification_version: 4
+summary: Extract the content of an HTML article
+test_files: []