RubyGems - html_scraper - Versions diffs - 0.1.1 → 0.2.0 - Mend

html_scraper 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: f5e3bd25867c47af6faa4f802cb9f73ed8d67523
-  data.tar.gz: 0a889ccd19d4fe9f309abdf5f6b7f10116fa3924
+  metadata.gz: c21eb31e2e7f7b4b0a50b2beda393ae9326e64d3
+  data.tar.gz: 5e1213905ec2d05376a0d9bd9ad86eae90d2f29e
 SHA512:
-  metadata.gz: dc7894c04442dd5a091185b674edad519d44688b46c87cf31bc6666294548b690b570d0c20d0b6b24bf47e8b3dbf0e1a045c85634c487a623beeb53098f57e59
-  data.tar.gz: 81bfcccfa4a9fad75e4ea560f4e6c9509cfbe169f514b59657c77f5454579330923b66bcd8af2f05298c842428ddb1c8336eb3738a97c63358296a9cbb1ef810
+  metadata.gz: ead2bc0b28d5bb9ccf14f3931cdba4077edc0fcc92390d26598a1560942df5182fafc280c75b063d97de8e64b2f148d94ee8e1e8903d2adfff521fb59cc57aa9
+  data.tar.gz: cc778c4a4613213f2583ec4d7ae9d30a0d300ea8ba6a37e8ac52e90acf39bcd7af5dce812e788bbbc9d736e83fab6e452c44138459d40eb50ec3e3087e44365f

data/.gitignore CHANGED

@@ -7,3 +7,4 @@
 /pkg/
 /spec/reports/
 /tmp/
+.rvmrc

data/CHANGELOG.md ADDED

@@ -0,0 +1,13 @@
+### 0.2.0
+* Allow attribute value parsing
+### 0.1.1
+* Allow higher versions of activesupport than 4
+### 0.1.0
+* HtmlParser templated html parsing

data/README.md CHANGED

@@ -1,6 +1,6 @@
 # HtmlScraper
-HtmlScraper is a ruby gem that transforms the content from a html web page to a json document following a defined html template
+HtmlScraper is a ruby gem that allows parsing an html document to a json structure following a template
 ## Installation
@@ -20,22 +20,24 @@ Or install it yourself as:
 ## Usage
-### Simple html parsing
-Expressions sourrounded by `{{ }}` will be parsed as simple json attributes:
+Define an html template matching the html document that will be parsed. On the blocks wehre data needs to be extracted define the json attribute sourrounded by `{{ }}` and the data for that block will be assigned in that json attribute:
 ```ruby
 template = '
-     <div class="person">
-        <h5>{{ surname }}</h5>
-        <p>{{ name }}</p>
+  <div id="people-list">
+    <div class="person" hs-repeat="people">
+      <a href="{{ link }}">{{ surname }}</a>
+      <p>{{ name }}</p>
     </div>
+  </div>
 '
 html = '
     <html>
       <body>
+          <div id="people-list">
           <div class="person">
-            <h5>Eastwood</h5>
+            <a href="/clint-eastwood">Eastwood</a>
             <p>Clint</p>
           </div>
       </body>
@@ -45,13 +47,14 @@ html = '
 ```
 The json result:
 ```
-{:surname=>"Eastwood", :name=>"Clint"}
+{:surname=>"Eastwood", :name=>"Clint", :link=>"/clint-eastwood"}
 ```
 ### Iterative data
-To parse iterative structures define the attribute `hs-repeat` to the html node containing the iteration:
+To parse iterative structures define the attribute `hs-repeat` to the html node containing the iteration. The value of `hs-repeat` will be the name of the json attribute containing an array of the parsed subelements:
 ```ruby
 template = '
@@ -88,13 +91,114 @@ json = HtmlScraper::Scraper.new(template: template).parse(html)
 The json result:
-````ruby
+```ruby
 {:people=>
   [{:surname=>"Eastwood", :name=>"Clint"},
    {:surname=>"Woods", :name=>"James"},
    {:surname=>"Kinski", :name=>"Klaus"}]}
 ```
+### Regular expressions
+Regular expressions can be used next to the attribute name (surrounded by `//`) to filter the parsed string that will be assigned to the attribute. The attribute value will be the first string matching the regular expression:
+```ruby
+  template = '<div id="people-list">
+    <div class="person">
+      <h5>{{ surname }}</h5>
+      <p>{{ name }}</p>
+      <span>{{ birthday/\d+\.\d+\.\d+/ }}</span>
+    </div>
+  </div>
+  '
+  html = '
+    <html>
+      <body>
+          <div id="people-list">
+          <div class="person">
+            <h5>Eastwood</h5>
+            <p>Clint</p>
+            <span>Born on 31.05.1930</span>
+          </div>
+      </body>
+    </html>
+ '
+ json = HtmlScraper::Scraper.new(template: template).parse(html)
+```
+will result in:
+```
+{:surname=>"Eastwood", :name=>"Clint", :birthday=>"31.05.1930"}
+```
+### Ruby code evaluation
+For more complex attribute evaluations, ruby code can be used to manipulate the parsed expression. After the attribute name and `=` a ruby block can follow and the result will be assigned to the corresponding json attriibute. Use the symbol `$` to reference the evaluated expression within the ruby block:
+```ruby
+  template = '
+  <div id="people-list">
+    <div class="person">
+      <h5>{{ surname = $.upcase }}</h5>
+  </div>
+  '
+  html = '
+    <html>
+      <body>
+          <div id="people-list">
+          <div class="person">
+            <h5>Eastwood</h5>
+            <p>Clint</p>
+            <span>Born on 31.05.1930</span>
+          </div>
+      </body>
+    </html>
+ '
+ json = HtmlScraper::Scraper.new(template: template).parse(html)
+```
+will result in:
+```
+{:surname=>"EASTWOOD" }
+```
+Regular expressions and ruby code can be both combined:
+```ruby
+  template = '<div id="people-list">
+    <div class="person">
+      <h5>{{ surname/\w{4}/ = $.upcase }}</h5>
+  </div>
+  '
+  html = '
+    <html>
+      <body>
+          <div id="people-list">
+          <div class="person">
+            <h5>Eastwood</h5>
+            <p>Clint</p>
+            <span>Born on 31.05.1930</span>
+          </div>
+      </body>
+    </html>
+ '
+ json = HtmlScraper::Scraper.new(template: template).parse(html)
+```
+will result in:
+```
+{:surname=>"EAST" }
+```
 ## Development
 After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -103,7 +207,7 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/html_scraper.
+Bug reports and pull requests are welcome on GitHub at https://github.com/bduran82/html_scraper.
 ## License

data/html_scraper.gemspec CHANGED

@@ -8,8 +8,8 @@ Gem::Specification.new do |spec|
   spec.version       = HtmlScraper::VERSION
   spec.authors       = ['Bernat Duran']
-  spec.summary       = 'Parses the data contained in a html web page to json using a user-defined'
-  spec.homepage      = 'https://github.com/bduran82/html_scraper.git'
+  spec.summary       = 'Parses an html document to a json structure following a template'
+  spec.homepage      = 'https://github.com/bduran82/html_scraper'
   spec.license       = 'MIT'
   spec.files         = `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
@@ -24,4 +24,5 @@ Gem::Specification.new do |spec|
   spec.add_development_dependency 'rake', '~> 10.0'
   spec.add_development_dependency 'minitest', '~> 5.0'
   spec.add_development_dependency 'pry'
+  spec.add_development_dependency 'pry-nav'
 end

data/lib/html_scraper/scraper.rb CHANGED

@@ -12,9 +12,7 @@ module HtmlScraper
     def parse(html)
       html_template = Nokogiri::HTML(@template)
       return {} if html_template.root.nil?
-      template_root = html_template.root.children.first
-      html_root = Nokogiri::HTML(html).root
-      return inspect(template_root, html_root)
+      return inspect(html_template.root, Nokogiri::HTML(html))
     end
     def inspect(template_node, html_node)
@@ -36,15 +34,28 @@ module HtmlScraper
     end
     def parse_node(template_node, html_node)
-      expression = template_node.xpath('./text()').text
-      result = evaluate_expressions(expression, html_node.text)
-      children_results = template_node.children.map { |t_node| inspect(t_node, html_node) }
-      return result.merge(children_results.reduce(&:merge))
+      return [
+        evaluate_attributes(template_node, html_node),
+        evaluate_text(template_node, html_node),
+        template_node.children.map { |t_node| inspect(t_node, html_node) }.reduce({}, &:merge)
+      ].reduce(&:merge)
     end
     private :parse_node
+    def evaluate_attributes(template_node, html_node)
+      return template_node.attributes.map do |name, attr|
+        evaluate_expressions(attr.value, html_node.attributes[name]&.value)
+      end.reduce({}, &:merge)
+    end
+    private :evaluate_attributes
+    def evaluate_text(template_node, html_node)
+      return evaluate_expressions(template_node.xpath('./text()').text, html_node.text)
+    end
+    private :evaluate_text
     def evaluate_expressions(expression, text)
-      result = expression.scan(/^\s*{{(.*)}}\s*$/).flatten.reduce({}) do |res, expr|
+      result = expression.scan(expr_regexp).flatten.reduce({}) do |res, expr|
         res.merge(Expression.new(expr).evaluate(text))
       end
@@ -54,7 +65,9 @@ module HtmlScraper
     def build_xpath(template_node)
       xpath = ".//#{template_node.name}"
-      attributes = template_node.attributes.reject { |k, _| k.start_with?('hs-') }
+      attributes = template_node.attributes.reject do |name, attr|
+        name.start_with?('hs-') || attr.value =~ expr_regexp
+      end
       if !attributes.blank?
         selector = attributes.map { |k, v| attribute_selector(k, v) }.join
         xpath = "#{xpath}#{selector}"
@@ -73,6 +86,11 @@ module HtmlScraper
     end
     private :attribute_selector
+    def expr_regexp
+      /^\s*{{(.*)}}\s*$/
+    end
+    private :expr_regexp
     def log(text)
       puts "#{'   ' * @depth}#{text}" if @verbose
     end

data/lib/html_scraper/version.rb CHANGED

@@ -1,3 +1,3 @@
 module HtmlScraper
-  VERSION = '0.1.1'.freeze
+  VERSION = '0.2.0'.freeze
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: html_scraper
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.2.0
 platform: ruby
 authors:
 - Bernat Duran
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2016-08-15 00:00:00.000000000 Z
+date: 2016-08-16 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -94,6 +94,20 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: pry-nav
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
 description:
 email:
 executables: []
@@ -103,6 +117,7 @@ files:
 - ".gitignore"
 - ".rubocop.yml"
 - ".travis.yml"
+- CHANGELOG.md
 - Gemfile
 - LICENSE.txt
 - README.md
@@ -114,7 +129,7 @@ files:
 - lib/html_scraper/expression.rb
 - lib/html_scraper/scraper.rb
 - lib/html_scraper/version.rb
-homepage: https://github.com/bduran82/html_scraper.git
+homepage: https://github.com/bduran82/html_scraper
 licenses:
 - MIT
 metadata: {}
@@ -137,5 +152,5 @@ rubyforge_project:
 rubygems_version: 2.5.1
 signing_key:
 specification_version: 4
-summary: Parses the data contained in a html web page to json using a user-defined
+summary: Parses an html document to a json structure following a template
 test_files: []