html-hierarchy-extractor 1.0.0 → 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +131 -7
- data/html-hierarchy-extractor.gemspec +2 -2
- data/lib/version.rb +1 -1
- data/scripts/release +0 -3
- metadata +1 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d78e25992e261f127a8eced889d1e69299ca1e25
|
4
|
+
data.tar.gz: 17f14590068badca5030a219be8b24b3066bebf2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4aa0dce7497b5523450646a46ed02239311170f413fb1ba464e70a316a9533138032585ce70f527819ce20c238a9249d70e1ff6c8fe23b044b20687a43ed3831
|
7
|
+
data.tar.gz: 7dc1676005759231f4934e5302aeb1ccf5dc7074e1d4d7201cf2de100bb731845cb43a78bb26574e4bc97429a7622a639641e065efef5f86728deed413459719
|
data/README.md
CHANGED
@@ -1,17 +1,141 @@
|
|
1
1
|
# html-hierarchy-extractor
|
2
2
|
|
3
3
|
This gems lets you extract the hierarchy of headings and content from any HTML
|
4
|
-
page into
|
4
|
+
page into an array of elements.
|
5
5
|
|
6
|
-
|
7
|
-
inside large HTML pages.
|
6
|
+
Intended to be used with [Algolia][1] to improve relevance of search
|
7
|
+
results inside large HTML pages. The records created are compatible with the
|
8
|
+
[DocSearch][2] format.
|
8
9
|
|
9
|
-
|
10
|
-
|
10
|
+
## Installation
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
# Gemfile
|
14
|
+
source 'http://rubygems.org'
|
15
|
+
|
16
|
+
gem 'html-hierarchy-extractor', '~> 1.0'
|
17
|
+
```
|
11
18
|
|
12
19
|
## How to use
|
13
20
|
|
14
21
|
```ruby
|
15
|
-
|
16
|
-
|
22
|
+
require 'html-hierarchy-extractor'
|
23
|
+
|
24
|
+
content = File.read('./index.html')
|
25
|
+
page = HTMLHierarchyExtractor.new(content)
|
26
|
+
records = page.extract
|
27
|
+
puts records
|
28
|
+
```
|
29
|
+
|
30
|
+
## Records
|
31
|
+
|
32
|
+
`extract` will return an array of recordes. Each record will represent a `<p>`
|
33
|
+
paragraph of the initial text, along with it textual version (HTML removed),
|
34
|
+
heading hierarchy, and other interesting bits.
|
35
|
+
|
36
|
+
## Example
|
37
|
+
|
38
|
+
Let's take the following HTML as input and see what recordes we got as output:
|
39
|
+
|
40
|
+
```html
|
41
|
+
<!doctype html>
|
42
|
+
<html>
|
43
|
+
<body>
|
44
|
+
<h1 name="journey">The Hero's Journey</h1>
|
45
|
+
<p>Most stories always follow the same pattern.</p>
|
46
|
+
<h2 name="departure">Part One: Departure</h2>
|
47
|
+
<p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
|
48
|
+
<h3 name="calladventure">The call to Adventure</h3>
|
49
|
+
<p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
|
50
|
+
<h3 name="threshold">Crossing the Threshold</h3>
|
51
|
+
<p>The hero quits his job, hit the road, or whatever cuts him from his previous life.</p>
|
52
|
+
<h2 name="initiations">Part Two: Initiation</h2>
|
53
|
+
<h3 name="trials">The Road of Trials</h3>
|
54
|
+
<p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
|
55
|
+
<h3 name="ultimate">The Ultimate Boon</h3>
|
56
|
+
<p>The hero has found something, either physical or metaphorical that changes him.</p>
|
57
|
+
<h2 name="return">Part Three: Return</h2>
|
58
|
+
<h3 name="refusal">Refusal to Return</h3>
|
59
|
+
<p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
|
60
|
+
<h3 name="master">Master of Two Worlds</h3>
|
61
|
+
<p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
|
62
|
+
</body>
|
63
|
+
</html>
|
64
|
+
```
|
65
|
+
|
66
|
+
Here is one of the recordes extracted:
|
67
|
+
|
68
|
+
```ruby
|
69
|
+
{
|
70
|
+
:uuid => "1f5923d5a60e998704f201bbe9964811",
|
71
|
+
:tag_name => "p",
|
72
|
+
:html => "<p>The hero quit his jobs, hit the road, or whatever cuts him from his previous life.</p>",
|
73
|
+
:text => "The hero quit his jobs, hit the road, or whatever cuts him from his previous life.",
|
74
|
+
:node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
|
75
|
+
:anchor => nil,
|
76
|
+
:hierarchy => {
|
77
|
+
:lvl0 => "The Hero's Journey",
|
78
|
+
:lvl1 => "Part One: Departure",
|
79
|
+
:lvl2 => "Crossing the Threshold",
|
80
|
+
:lvl3 => nil,
|
81
|
+
:lvl4 => nil,
|
82
|
+
:lvl5 => nil,
|
83
|
+
:lvl6 => nil
|
84
|
+
},
|
85
|
+
:weight => {
|
86
|
+
:heading => 70,
|
87
|
+
:position => 3
|
88
|
+
}
|
89
|
+
}
|
17
90
|
```
|
91
|
+
|
92
|
+
Each record has a `uuid` that uniquely identify it (computed by a hash of all
|
93
|
+
the other values).
|
94
|
+
|
95
|
+
It also contains the HTML tag name in `tag_name` (by default `<p>`
|
96
|
+
paragraphs are extracted, but see the [settings][3] on how to change it).
|
97
|
+
|
98
|
+
`html` contains the whole `outerContent` of the element, including the wrapping
|
99
|
+
tags and inner children. The `text` attribute contains the textual content,
|
100
|
+
stripping out all HTML.
|
101
|
+
|
102
|
+
`node` contains the [Nokogiri node][4] instance. The lib uses it internally to
|
103
|
+
extract all the relevant information ut is also exposed if you want to process
|
104
|
+
the node further.
|
105
|
+
|
106
|
+
The `anchor` attributes contains the HTML anchor closest to the element. Here it
|
107
|
+
is `threshold` because this is the closest anchor in the hierarchy above.
|
108
|
+
Anchors are searched in `name` and `id` attributes of headings.
|
109
|
+
|
110
|
+
`hierarchy` then contains a snapshot of the current heading hierarchy of the
|
111
|
+
paragraph. The `lvlX` syntax is used to be compatible with the records
|
112
|
+
[DocSearch][5] is using.
|
113
|
+
|
114
|
+
The `weight` attribute is used to provide an easy way to rank two records
|
115
|
+
relative to each other.
|
116
|
+
|
117
|
+
- `heading` gives the depth level in the hierarchy where the record is. Records
|
118
|
+
on top level will have a value of 100, those under a `h1` will have 90, and so
|
119
|
+
on. Because our record is under a `h3`, it has 70.
|
120
|
+
- `position` is the position of the paragraph in the page. Here our paragraph is
|
121
|
+
the fourth paragraph of the page, so it will have a `position` of 3. It can
|
122
|
+
help you give more weight to the first items in the page.
|
123
|
+
|
124
|
+
## Settings
|
125
|
+
|
126
|
+
When instanciating `HTMLHierarchyExtractor`, you can pass a secondary `options`
|
127
|
+
argument. This attribute accepts one value, `css_selector`.
|
128
|
+
|
129
|
+
```ruby
|
130
|
+
page = HTMLHierarchyExtractor.new(content, { css_selector: 'p,li' })
|
131
|
+
```
|
132
|
+
|
133
|
+
This lets you change the default selector. Here instead of `<p>` paragraph,
|
134
|
+
the library will extract `<li>` list elements as well.
|
135
|
+
|
136
|
+
|
137
|
+
[1]: https://www.algolia.com/
|
138
|
+
[2]: https://community.algolia.com/docsearch/
|
139
|
+
[3]: #Settings
|
140
|
+
[4]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
|
141
|
+
[5]: https://community.algolia.com/docsearch/
|
@@ -2,11 +2,11 @@
|
|
2
2
|
# DO NOT EDIT THIS FILE DIRECTLY
|
3
3
|
# Instead, edit Jeweler::Tasks in Rakefile, and run 'rake gemspec'
|
4
4
|
# -*- encoding: utf-8 -*-
|
5
|
-
# stub: html-hierarchy-extractor 1.0.
|
5
|
+
# stub: html-hierarchy-extractor 1.0.1 ruby lib
|
6
6
|
|
7
7
|
Gem::Specification.new do |s|
|
8
8
|
s.name = "html-hierarchy-extractor"
|
9
|
-
s.version = "1.0.
|
9
|
+
s.version = "1.0.1"
|
10
10
|
|
11
11
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
12
12
|
s.require_paths = ["lib"]
|
data/lib/version.rb
CHANGED
data/scripts/release
CHANGED