html-hierarchy-extractor 1.0.0 → 1.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +131 -7
- data/html-hierarchy-extractor.gemspec +2 -2
- data/lib/version.rb +1 -1
- data/scripts/release +0 -3
- metadata +1 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d78e25992e261f127a8eced889d1e69299ca1e25
|
4
|
+
data.tar.gz: 17f14590068badca5030a219be8b24b3066bebf2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4aa0dce7497b5523450646a46ed02239311170f413fb1ba464e70a316a9533138032585ce70f527819ce20c238a9249d70e1ff6c8fe23b044b20687a43ed3831
|
7
|
+
data.tar.gz: 7dc1676005759231f4934e5302aeb1ccf5dc7074e1d4d7201cf2de100bb731845cb43a78bb26574e4bc97429a7622a639641e065efef5f86728deed413459719
|
data/README.md
CHANGED
@@ -1,17 +1,141 @@
|
|
1
1
|
# html-hierarchy-extractor
|
2
2
|
|
3
3
|
This gems lets you extract the hierarchy of headings and content from any HTML
|
4
|
-
page into
|
4
|
+
page into an array of elements.
|
5
5
|
|
6
|
-
|
7
|
-
inside large HTML pages.
|
6
|
+
Intended to be used with [Algolia][1] to improve relevance of search
|
7
|
+
results inside large HTML pages. The records created are compatible with the
|
8
|
+
[DocSearch][2] format.
|
8
9
|
|
9
|
-
|
10
|
-
|
10
|
+
## Installation
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
# Gemfile
|
14
|
+
source 'http://rubygems.org'
|
15
|
+
|
16
|
+
gem 'html-hierarchy-extractor', '~> 1.0'
|
17
|
+
```
|
11
18
|
|
12
19
|
## How to use
|
13
20
|
|
14
21
|
```ruby
|
15
|
-
|
16
|
-
|
22
|
+
require 'html-hierarchy-extractor'
|
23
|
+
|
24
|
+
content = File.read('./index.html')
|
25
|
+
page = HTMLHierarchyExtractor.new(content)
|
26
|
+
records = page.extract
|
27
|
+
puts records
|
28
|
+
```
|
29
|
+
|
30
|
+
## Records
|
31
|
+
|
32
|
+
`extract` will return an array of recordes. Each record will represent a `<p>`
|
33
|
+
paragraph of the initial text, along with it textual version (HTML removed),
|
34
|
+
heading hierarchy, and other interesting bits.
|
35
|
+
|
36
|
+
## Example
|
37
|
+
|
38
|
+
Let's take the following HTML as input and see what recordes we got as output:
|
39
|
+
|
40
|
+
```html
|
41
|
+
<!doctype html>
|
42
|
+
<html>
|
43
|
+
<body>
|
44
|
+
<h1 name="journey">The Hero's Journey</h1>
|
45
|
+
<p>Most stories always follow the same pattern.</p>
|
46
|
+
<h2 name="departure">Part One: Departure</h2>
|
47
|
+
<p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
|
48
|
+
<h3 name="calladventure">The call to Adventure</h3>
|
49
|
+
<p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
|
50
|
+
<h3 name="threshold">Crossing the Threshold</h3>
|
51
|
+
<p>The hero quits his job, hit the road, or whatever cuts him from his previous life.</p>
|
52
|
+
<h2 name="initiations">Part Two: Initiation</h2>
|
53
|
+
<h3 name="trials">The Road of Trials</h3>
|
54
|
+
<p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
|
55
|
+
<h3 name="ultimate">The Ultimate Boon</h3>
|
56
|
+
<p>The hero has found something, either physical or metaphorical that changes him.</p>
|
57
|
+
<h2 name="return">Part Three: Return</h2>
|
58
|
+
<h3 name="refusal">Refusal to Return</h3>
|
59
|
+
<p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
|
60
|
+
<h3 name="master">Master of Two Worlds</h3>
|
61
|
+
<p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
|
62
|
+
</body>
|
63
|
+
</html>
|
64
|
+
```
|
65
|
+
|
66
|
+
Here is one of the recordes extracted:
|
67
|
+
|
68
|
+
```ruby
|
69
|
+
{
|
70
|
+
:uuid => "1f5923d5a60e998704f201bbe9964811",
|
71
|
+
:tag_name => "p",
|
72
|
+
:html => "<p>The hero quit his jobs, hit the road, or whatever cuts him from his previous life.</p>",
|
73
|
+
:text => "The hero quit his jobs, hit the road, or whatever cuts him from his previous life.",
|
74
|
+
:node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
|
75
|
+
:anchor => nil,
|
76
|
+
:hierarchy => {
|
77
|
+
:lvl0 => "The Hero's Journey",
|
78
|
+
:lvl1 => "Part One: Departure",
|
79
|
+
:lvl2 => "Crossing the Threshold",
|
80
|
+
:lvl3 => nil,
|
81
|
+
:lvl4 => nil,
|
82
|
+
:lvl5 => nil,
|
83
|
+
:lvl6 => nil
|
84
|
+
},
|
85
|
+
:weight => {
|
86
|
+
:heading => 70,
|
87
|
+
:position => 3
|
88
|
+
}
|
89
|
+
}
|
17
90
|
```
|
91
|
+
|
92
|
+
Each record has a `uuid` that uniquely identify it (computed by a hash of all
|
93
|
+
the other values).
|
94
|
+
|
95
|
+
It also contains the HTML tag name in `tag_name` (by default `<p>`
|
96
|
+
paragraphs are extracted, but see the [settings][3] on how to change it).
|
97
|
+
|
98
|
+
`html` contains the whole `outerContent` of the element, including the wrapping
|
99
|
+
tags and inner children. The `text` attribute contains the textual content,
|
100
|
+
stripping out all HTML.
|
101
|
+
|
102
|
+
`node` contains the [Nokogiri node][4] instance. The lib uses it internally to
|
103
|
+
extract all the relevant information ut is also exposed if you want to process
|
104
|
+
the node further.
|
105
|
+
|
106
|
+
The `anchor` attributes contains the HTML anchor closest to the element. Here it
|
107
|
+
is `threshold` because this is the closest anchor in the hierarchy above.
|
108
|
+
Anchors are searched in `name` and `id` attributes of headings.
|
109
|
+
|
110
|
+
`hierarchy` then contains a snapshot of the current heading hierarchy of the
|
111
|
+
paragraph. The `lvlX` syntax is used to be compatible with the records
|
112
|
+
[DocSearch][5] is using.
|
113
|
+
|
114
|
+
The `weight` attribute is used to provide an easy way to rank two records
|
115
|
+
relative to each other.
|
116
|
+
|
117
|
+
- `heading` gives the depth level in the hierarchy where the record is. Records
|
118
|
+
on top level will have a value of 100, those under a `h1` will have 90, and so
|
119
|
+
on. Because our record is under a `h3`, it has 70.
|
120
|
+
- `position` is the position of the paragraph in the page. Here our paragraph is
|
121
|
+
the fourth paragraph of the page, so it will have a `position` of 3. It can
|
122
|
+
help you give more weight to the first items in the page.
|
123
|
+
|
124
|
+
## Settings
|
125
|
+
|
126
|
+
When instanciating `HTMLHierarchyExtractor`, you can pass a secondary `options`
|
127
|
+
argument. This attribute accepts one value, `css_selector`.
|
128
|
+
|
129
|
+
```ruby
|
130
|
+
page = HTMLHierarchyExtractor.new(content, { css_selector: 'p,li' })
|
131
|
+
```
|
132
|
+
|
133
|
+
This lets you change the default selector. Here instead of `<p>` paragraph,
|
134
|
+
the library will extract `<li>` list elements as well.
|
135
|
+
|
136
|
+
|
137
|
+
[1]: https://www.algolia.com/
|
138
|
+
[2]: https://community.algolia.com/docsearch/
|
139
|
+
[3]: #Settings
|
140
|
+
[4]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
|
141
|
+
[5]: https://community.algolia.com/docsearch/
|
@@ -2,11 +2,11 @@
|
|
2
2
|
# DO NOT EDIT THIS FILE DIRECTLY
|
3
3
|
# Instead, edit Jeweler::Tasks in Rakefile, and run 'rake gemspec'
|
4
4
|
# -*- encoding: utf-8 -*-
|
5
|
-
# stub: html-hierarchy-extractor 1.0.
|
5
|
+
# stub: html-hierarchy-extractor 1.0.1 ruby lib
|
6
6
|
|
7
7
|
Gem::Specification.new do |s|
|
8
8
|
s.name = "html-hierarchy-extractor"
|
9
|
-
s.version = "1.0.
|
9
|
+
s.version = "1.0.1"
|
10
10
|
|
11
11
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
12
12
|
s.require_paths = ["lib"]
|
data/lib/version.rb
CHANGED
data/scripts/release
CHANGED