html-hierarchy-extractor 1.0.0 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 9c0a516a852828d433ba0495206acc9febbd1670
4
- data.tar.gz: 444cedeb76c06fd048526cb02c7fcac294927540
3
+ metadata.gz: d78e25992e261f127a8eced889d1e69299ca1e25
4
+ data.tar.gz: 17f14590068badca5030a219be8b24b3066bebf2
5
5
  SHA512:
6
- metadata.gz: 7e6505db7a21b42db30d4afffa496358642c1eb6332174f5ada9418f973056c0b0f9762b6458f68c02a1eb035700fe9746d6dbc92a613b4a5797a4b54512f2cc
7
- data.tar.gz: bcba7859c0e37030d6a209bef9b3980f35ea9dc08283f6d86445400cb115aa92f11f6ef1331ed0b7d353b077bf9e29542e730bde118e7479f96a3dbe96164833
6
+ metadata.gz: 4aa0dce7497b5523450646a46ed02239311170f413fb1ba464e70a316a9533138032585ce70f527819ce20c238a9249d70e1ff6c8fe23b044b20687a43ed3831
7
+ data.tar.gz: 7dc1676005759231f4934e5302aeb1ccf5dc7074e1d4d7201cf2de100bb731845cb43a78bb26574e4bc97429a7622a639641e065efef5f86728deed413459719
data/README.md CHANGED
@@ -1,17 +1,141 @@
1
1
  # html-hierarchy-extractor
2
2
 
3
3
  This gems lets you extract the hierarchy of headings and content from any HTML
4
- page into and array of elements.
4
+ page into an array of elements.
5
5
 
6
- It is intended to be used with Algolia to improve relevance of search results
7
- inside large HTML pages.
6
+ Intended to be used with [Algolia][1] to improve relevance of search
7
+ results inside large HTML pages. The records created are compatible with the
8
+ [DocSearch][2] format.
8
9
 
9
- Note: This repo is still a work in progress, and follows the RDD (Readme Driven
10
- Development) principle. All you see in the Readme might not be implemented yet.
10
+ ## Installation
11
+
12
+ ```ruby
13
+ # Gemfile
14
+ source 'http://rubygems.org'
15
+
16
+ gem 'html-hierarchy-extractor', '~> 1.0'
17
+ ```
11
18
 
12
19
  ## How to use
13
20
 
14
21
  ```ruby
15
- page = HTMLHierarchyExtractor(html) # Or filepath
16
- page.extract
22
+ require 'html-hierarchy-extractor'
23
+
24
+ content = File.read('./index.html')
25
+ page = HTMLHierarchyExtractor.new(content)
26
+ records = page.extract
27
+ puts records
28
+ ```
29
+
30
+ ## Records
31
+
32
+ `extract` will return an array of recordes. Each record will represent a `<p>`
33
+ paragraph of the initial text, along with it textual version (HTML removed),
34
+ heading hierarchy, and other interesting bits.
35
+
36
+ ## Example
37
+
38
+ Let's take the following HTML as input and see what recordes we got as output:
39
+
40
+ ```html
41
+ <!doctype html>
42
+ <html>
43
+ <body>
44
+ <h1 name="journey">The Hero's Journey</h1>
45
+ <p>Most stories always follow the same pattern.</p>
46
+ <h2 name="departure">Part One: Departure</h2>
47
+ <p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
48
+ <h3 name="calladventure">The call to Adventure</h3>
49
+ <p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
50
+ <h3 name="threshold">Crossing the Threshold</h3>
51
+ <p>The hero quits his job, hit the road, or whatever cuts him from his previous life.</p>
52
+ <h2 name="initiations">Part Two: Initiation</h2>
53
+ <h3 name="trials">The Road of Trials</h3>
54
+ <p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
55
+ <h3 name="ultimate">The Ultimate Boon</h3>
56
+ <p>The hero has found something, either physical or metaphorical that changes him.</p>
57
+ <h2 name="return">Part Three: Return</h2>
58
+ <h3 name="refusal">Refusal to Return</h3>
59
+ <p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
60
+ <h3 name="master">Master of Two Worlds</h3>
61
+ <p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
62
+ </body>
63
+ </html>
64
+ ```
65
+
66
+ Here is one of the recordes extracted:
67
+
68
+ ```ruby
69
+ {
70
+ :uuid => "1f5923d5a60e998704f201bbe9964811",
71
+ :tag_name => "p",
72
+ :html => "<p>The hero quit his jobs, hit the road, or whatever cuts him from his previous life.</p>",
73
+ :text => "The hero quit his jobs, hit the road, or whatever cuts him from his previous life.",
74
+ :node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
75
+ :anchor => nil,
76
+ :hierarchy => {
77
+ :lvl0 => "The Hero's Journey",
78
+ :lvl1 => "Part One: Departure",
79
+ :lvl2 => "Crossing the Threshold",
80
+ :lvl3 => nil,
81
+ :lvl4 => nil,
82
+ :lvl5 => nil,
83
+ :lvl6 => nil
84
+ },
85
+ :weight => {
86
+ :heading => 70,
87
+ :position => 3
88
+ }
89
+ }
17
90
  ```
91
+
92
+ Each record has a `uuid` that uniquely identify it (computed by a hash of all
93
+ the other values).
94
+
95
+ It also contains the HTML tag name in `tag_name` (by default `<p>`
96
+ paragraphs are extracted, but see the [settings][3] on how to change it).
97
+
98
+ `html` contains the whole `outerContent` of the element, including the wrapping
99
+ tags and inner children. The `text` attribute contains the textual content,
100
+ stripping out all HTML.
101
+
102
+ `node` contains the [Nokogiri node][4] instance. The lib uses it internally to
103
+ extract all the relevant information ut is also exposed if you want to process
104
+ the node further.
105
+
106
+ The `anchor` attributes contains the HTML anchor closest to the element. Here it
107
+ is `threshold` because this is the closest anchor in the hierarchy above.
108
+ Anchors are searched in `name` and `id` attributes of headings.
109
+
110
+ `hierarchy` then contains a snapshot of the current heading hierarchy of the
111
+ paragraph. The `lvlX` syntax is used to be compatible with the records
112
+ [DocSearch][5] is using.
113
+
114
+ The `weight` attribute is used to provide an easy way to rank two records
115
+ relative to each other.
116
+
117
+ - `heading` gives the depth level in the hierarchy where the record is. Records
118
+ on top level will have a value of 100, those under a `h1` will have 90, and so
119
+ on. Because our record is under a `h3`, it has 70.
120
+ - `position` is the position of the paragraph in the page. Here our paragraph is
121
+ the fourth paragraph of the page, so it will have a `position` of 3. It can
122
+ help you give more weight to the first items in the page.
123
+
124
+ ## Settings
125
+
126
+ When instanciating `HTMLHierarchyExtractor`, you can pass a secondary `options`
127
+ argument. This attribute accepts one value, `css_selector`.
128
+
129
+ ```ruby
130
+ page = HTMLHierarchyExtractor.new(content, { css_selector: 'p,li' })
131
+ ```
132
+
133
+ This lets you change the default selector. Here instead of `<p>` paragraph,
134
+ the library will extract `<li>` list elements as well.
135
+
136
+
137
+ [1]: https://www.algolia.com/
138
+ [2]: https://community.algolia.com/docsearch/
139
+ [3]: #Settings
140
+ [4]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
141
+ [5]: https://community.algolia.com/docsearch/
@@ -2,11 +2,11 @@
2
2
  # DO NOT EDIT THIS FILE DIRECTLY
3
3
  # Instead, edit Jeweler::Tasks in Rakefile, and run 'rake gemspec'
4
4
  # -*- encoding: utf-8 -*-
5
- # stub: html-hierarchy-extractor 1.0.0 ruby lib
5
+ # stub: html-hierarchy-extractor 1.0.1 ruby lib
6
6
 
7
7
  Gem::Specification.new do |s|
8
8
  s.name = "html-hierarchy-extractor"
9
- s.version = "1.0.0"
9
+ s.version = "1.0.1"
10
10
 
11
11
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
12
12
  s.require_paths = ["lib"]
@@ -1,6 +1,6 @@
1
1
  # Expose gem version
2
2
  class HTMLHierarchyExtractorVersion
3
3
  def self.to_s
4
- '1.0.0'
4
+ '1.0.1'
5
5
  end
6
6
  end
@@ -4,13 +4,10 @@ set -e
4
4
 
5
5
  git checkout master
6
6
  git pull
7
- bundle install
8
7
 
9
8
  git rebase develop
10
9
  bundle install
11
10
  rake release
12
11
 
13
12
  git checkout develop
14
- bundle install
15
13
  git rebase master
16
- bundle install
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: html-hierarchy-extractor
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tim Carry