RubyGems - algolia_html_extractor - Versions diffs - 2.0.0 → 2.0.1 - Mend

algolia_html_extractor 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml +4 -4
data/CONTRIBUTING.md +19 -0
data/LICENSE.txt +20 -0
data/README.md +217 -0
data/lib/algolia_html_extractor.rb +144 -0
data/lib/version.rb +5 -0
metadata +7 -2

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: c4ba20d4606345086dc7e931cf49ba1ae0e8336a
-  data.tar.gz: be54441ad8f6d880aa34b654187204e26083533f
+  metadata.gz: fb67fbcfbfb26740f9d97027a7f2258a52730792
+  data.tar.gz: 83bf786d6369805a8e737d1264ef8f4ad198dade
 SHA512:
-  metadata.gz: 1c7194623efe86e9c5e963de77df37e5ff6281d30b6deacaa56cd40dd842098ff6f43ac22e9c40323490c35f0b05e818a6d9df3a2f947ea85249765bafd7ab20
-  data.tar.gz: c1fc426c29fe7566506c8fc8af03e4a269f4dc8ae9eec2e7e9f91a6606648ac9c68c632e78037bcd57dfb8ddf010340844e2fc20e1ec685f1421ba8c3af28ea2
+  metadata.gz: 19d71cda82dae127c2a3603fdc975a443248eff8be2acb9724a86ef0c669a509937bcb5c63e859175cbb42e767409e5d30df8d7fbd3ecc6e3f5cc899c4457d57
+  data.tar.gz: abb6f3a34e2818049ad04901997700834a6e2fcd6e5b7c974fd0fd2e558741c957f71d2a649cdc4e17c4f18d7202427c7df2ed461c8457b2a4f8dc0429c27f11

data/CONTRIBUTING.md ADDED

@@ -0,0 +1,19 @@
+## Releasing
+`rake build` will build
+# Tagging and releasing
+If you need to release a new version of the gem to RubyGems, you have to follow
+those steps:
+```
+# Bump the version (in develop)
+./scripts/bump_version minor
+# Update master and release
+./scripts/release
+# Install the gem locally (optional)
+rake install
+```

data/LICENSE.txt ADDED

@@ -0,0 +1,20 @@
+Copyright (c) 2016 Pixelastic
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,217 @@
+# algolia_html_extractor
+This gem can convert HTML content into JSON records ready to be pushed to
+Algolia.
+Each HTML page will yield an array of records (one for each `<p>` by default).
+Each record will contain its hierarchy in the page as well as other metadata
+that can be used to configure relevance.
+## Installation
+```ruby
+# Gemfile
+source 'http://rubygems.org'
+gem 'algolia_html_extractor', '~> 1.0'
+```
+## How to use
+```ruby
+require 'algolia_html_extractor'
+content = File.read('./index.html')
+page = AlgoliaHTMLExtractor.new(content)
+records = page.extract
+puts records
+```
+## Records
+`extract` will return an array of records. Each record will represent a `<p>`
+paragraph of the initial text, along with it textual version (HTML removed),
+heading hierarchy, and other interesting bits.
+## Example
+Let's take the following HTML as input and see what records we got as output:
+```html
+<!doctype html>
+<html>
+<body>
+  <h1 name="journey">The Hero's Journey</h1>
+  <p>Most stories always follow the same pattern.</p>
+  <h2 name="departure">Part One: Departure</h2>
+  <p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
+  <h3 name="calladventure">The call to Adventure</h3>
+  <p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
+  <h3 name="threshold">Crossing the Threshold</h3>
+  <p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>
+  <h2 name="initiations">Part Two: Initiation</h2>
+  <h3 name="trials">The Road of Trials</h3>
+  <p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
+  <h3 name="ultimate">The Ultimate Boon</h3>
+  <p>The hero has found something, either physical or metaphorical that changes him.</p>
+  <h2 name="return">Part Three: Return</h2>
+  <h3 name="refusal">Refusal to Return</h3>
+  <p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
+  <h3 name="master">Master of Two Worlds</h3>
+  <p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
+</body>
+</html>
+```
+Here is one of the records extracted:
+```ruby
+{
+  :uuid => "1f5923d5a60e998704f201bbe9964811",
+  :tag_name => "p",
+  :html => "<p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>",
+  :text => "The hero quits his job, hits the road, or whatever cuts him from his previous life.",
+  :node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
+  :anchor => 'threshold',
+  :hierarchy => {
+    :lvl0 => "The Hero's Journey",
+    :lvl1 => "Part One: Departure",
+    :lvl2 => "Crossing the Threshold",
+    :lvl3 => nil,
+    :lvl4 => nil,
+    :lvl5 => nil,
+    :lvl6 => nil
+  },
+  :weight => {
+    :heading => 70,
+    :position => 3
+  }
+}
+```
+Each record has a `uuid` that uniquely identify it (computed by a hash of all
+the other values).
+It also contains the HTML tag name in `tag_name` (by default `<p>`
+paragraphs are extracted, but see the [settings][3] on how to change it).
+`html` contains the whole `outerContent` of the element, including the wrapping
+tags and inner children. The `text` attribute contains the textual content,
+stripping out all HTML.
+`node` contains the [Nokogiri node][4] instance. The lib uses it internally to
+extract all the relevant information but is also exposed if you want to process
+the node further.
+The `anchor` attributes contains the HTML anchor closest to the element. Here it
+is `threshold` because this is the closest anchor in the hierarchy above.
+Anchors are searched in `name` and `id` attributes of headings.
+`hierarchy` then contains a snapshot of the current heading hierarchy of the
+paragraph. The `lvlX` syntax is used to be compatible with the records
+[DocSearch][5] is using.
+The `weight` attribute is used to provide an easy way to rank two records
+relative to each other.
+- `heading` gives the depth level in the hierarchy where the record is. Records
+  on top level will have a value of 100, those under a `h1` will have 90, and so
+  on. Because our record is under a `h3`, it has 70.
+- `position` is the position of the paragraph in the page. Here our paragraph is
+  the fourth paragraph of the page, so it will have a `position` of 3. It can
+  help you give more weight to the first items in the page.
+## Settings
+When instanciating `AlgoliaHTMLExtractor`, you can pass a secondary `options`
+argument. This attribute accepts one value, `css_selector`.
+```ruby
+page = AlgoliaHTMLExtractor.new(content, { css_selector: 'p,li' })
+```
+This lets you change the default selector. Here instead of `<p>` paragraph,
+the library will extract `<li>` list elements as well.
+# CONTRIBUTING
+I'm happy you'd like to contribute. All contributions are welcome, ranging from
+feature requests to pull requests, but also including typo fixing, documentation
+and generic bug reports.
+## Bug Reports and feature requests
+For any bug or ideas of new features, please start by checking in the
+[issues](https://github.com/pixelastic/html-hierarchy-extractor/issues) tab if
+it hasn't been discussed already. If not, feel free to open a new issue.
+## Pull Requests
+All PR are welcome, from small typo fixes to large codebase changes. If you
+think you'll need to change a lot of code in a lot of files, I would suggest you
+to open an issue first so we can discuss before you start working on something.
+All PR should be based on the `develop` branch (`master` only ever contains the
+last released change).
+## Git Hooks
+If you start working on the actual code, you should install the git hooks.
+```
+cp ./scripts/git_hooks/* ./.git/hooks
+```
+This will add a `pre-commit` and `pre-push` scripts that will respectively check
+that all files are lint-free before committing, and pass all tests before
+pushing. If any of those two hooks give your errors, you should fix the code
+before commiting or pushing.
+Having those steps helps keeping the codebase clean as much as possible, and
+avoid polluting discussion in PR about style.
+## Development
+First thing you should do to get all your dependencies up to date is run `bundle
+install` before running any other command.
+## Lint
+`rake lint` will check all the files for potential linting issue. It uses
+Rubocop, and the configuration can be found in `.rubocop.yml`.
+## Test
+`rake test` will run all the tests.
+`rake coverage` will do the same, but also adding the code coverage files to
+`./coverage`. This should be useful in a CI environment.
+`rake watch` will run Guard that will do a live run of all your tests. Every
+update to a file (code or test) will re-run all the bound tests. This is highly
+recommended for TDD.
+## Using a local version of the gem
+If you want to test a local version of the gem in your local project, I suggest
+updating your project `Gemfile` to point to the correct local directory
+```ruby
+gem "html-hierarchy-extractor", :path => "/path/to/local/gem/folder"
+```
+You should also run `rake gemspec` from the `html-hierarchy-extractor`
+repository the first time and if you added/deleted any file or dependency.
+## History
+This gem was previously named `html-hierarchy-extractor` but has been renamed to
+`algolia_html_extractor` to both make its intent clearer and follow gem naming
+convention. That's also why this gem directly starts at v2.0.
+[1]: https://www.algolia.com/
+[2]: https://community.algolia.com/docsearch/
+[3]: #Settings
+[4]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
+[5]: https://community.algolia.com/docsearch/

data/lib/algolia_html_extractor.rb ADDED

@@ -0,0 +1,144 @@
+require 'nokogiri'
+require 'digest/md5'
+# Extract content from an HTML page in the form of items with associated
+# hierarchy data
+class AlgoliaHTMLExtractor
+  def initialize(input, options: {})
+    @dom = Nokogiri::HTML(input)
+    default_options = {
+      css_selector: 'p'
+    }
+    @options = default_options.merge(options)
+  end
+  # Returns the outer HTML of a given node
+  #
+  # eg.
+  # <p>foo</p> => <p>foo</p>
+  def extract_html(node)
+    node.to_s.strip
+  end
+  # Returns the inner HTML of a given node
+  #
+  # eg.
+  # <p>foo</p> => foo
+  def extract_text(node)
+    node.content
+  end
+  # Returns the tag name of a given node
+  #
+  # eg
+  # <p>foo</p> => p
+  def extract_tag_name(node)
+    node.name.downcase
+  end
+  # Returns the anchor to the node
+  #
+  # eg.
+  # <h1 name="anchor">Foo</h1> => anchor
+  # <h1 id="anchor">Foo</h1> => anchor
+  # <h1><a name="anchor">Foo</a></h1> => anchor
+  def extract_anchor(node)
+    anchor = node.attr('name') || node.attr('id') || nil
+    return anchor unless anchor.nil?
+    # No anchor found directly in the header, search on children
+    subelement = node.css('[name],[id]')
+    return extract_anchor(subelement[0]) unless subelement.empty?
+    nil
+  end
+  ##
+  # Generate a unique identifier for the item
+  def uuid(item)
+    # We first get all the keys of the object, sorted alphabetically...
+    ordered_keys = item.keys.sort
+    # ...then we build a huge array of "key=value" pairs...
+    ordered_array = ordered_keys.map do |key|
+      value = item[key]
+      # We apply the method recursively on other hashes
+      value = uuid(value) if value.is_a?(Hash)
+      "#{key}=#{value}"
+    end
+    # ...then we build a unique md5 hash of it
+    Digest::MD5.hexdigest(ordered_array.join(','))
+  end
+  ##
+  # Get a relative numeric value of the importance of the heading
+  # 100 for top level, then -10 per heading
+  def heading_weight(heading_level)
+    weight = 100
+    return weight if heading_level.nil?
+    weight - ((heading_level + 1) * 10)
+  end
+  def extract
+    heading_selector = 'h1,h2,h3,h4,h5,h6'
+    # We select all nodes that match either the headings or the elements to
+    # extract. This will allow us to loop over it in order it appears in the DOM
+    all_selector = "#{heading_selector},#{@options[:css_selector]}"
+    items = []
+    current_hierarchy = {
+      lvl0: nil,
+      lvl1: nil,
+      lvl2: nil,
+      lvl3: nil,
+      lvl4: nil,
+      lvl5: nil
+    }
+    current_position = 0 # Position of the DOM node in the tree
+    current_lvl = nil # Current closest hierarchy level
+    current_anchor = nil # Current closest anchor
+    @dom.css(all_selector).each do |node|
+      # If it's a heading, we update our current hierarchy
+      if node.matches?(heading_selector)
+        # Which level heading is it?
+        current_lvl = extract_tag_name(node).gsub(/^h/, '').to_i - 1
+        # Update this level, and set all the following ones to nil
+        current_hierarchy["lvl#{current_lvl}".to_sym] = extract_text(node)
+        (current_lvl + 1..6).each do |lvl|
+          current_hierarchy["lvl#{lvl}".to_sym] = nil
+        end
+        # Update the anchor, if the new heading has one
+        new_anchor = extract_anchor(node)
+        current_anchor = new_anchor if new_anchor
+      end
+      # Stop if node is not to be extracted
+      next unless node.matches?(@options[:css_selector])
+      # Stop if node is empty
+      text = extract_text(node)
+      next if text.empty?
+      item = {
+        html: extract_html(node),
+        text: text,
+        tag_name: extract_tag_name(node),
+        hierarchy: current_hierarchy.clone,
+        anchor: current_anchor,
+        node: node,
+        weight: {
+          position: current_position,
+          heading: heading_weight(current_lvl)
+        }
+      }
+      item[:uuid] = uuid(item)
+      items << item
+      current_position += 1
+    end
+    items
+  end
+end

data/lib/version.rb ADDED

@@ -0,0 +1,5 @@
+# Expose gem version
+# rubocop:disable Style/SingleLineMethods
+class AlgoliaHTMLExtractorVersion
+  def self.to_s; '2.0.1' end
+end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: algolia_html_extractor
 version: !ruby/object:Gem::Version
-  version: 2.0.0
+  version: 2.0.1
 platform: ruby
 authors:
 - Tim Carry
@@ -198,7 +198,12 @@ email: tim@pixelastic.com
 executables: []
 extensions: []
 extra_rdoc_files: []
-files: []
+files:
+- CONTRIBUTING.md
+- LICENSE.txt
+- README.md
+- lib/algolia_html_extractor.rb
+- lib/version.rb
 homepage: https://github.com/algolia/html-extractor
 licenses:
 - MIT