algolia_html_extractor 2.0.0 → 2.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CONTRIBUTING.md +19 -0
- data/LICENSE.txt +20 -0
- data/README.md +217 -0
- data/lib/algolia_html_extractor.rb +144 -0
- data/lib/version.rb +5 -0
- metadata +7 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: fb67fbcfbfb26740f9d97027a7f2258a52730792
|
4
|
+
data.tar.gz: 83bf786d6369805a8e737d1264ef8f4ad198dade
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 19d71cda82dae127c2a3603fdc975a443248eff8be2acb9724a86ef0c669a509937bcb5c63e859175cbb42e767409e5d30df8d7fbd3ecc6e3f5cc899c4457d57
|
7
|
+
data.tar.gz: abb6f3a34e2818049ad04901997700834a6e2fcd6e5b7c974fd0fd2e558741c957f71d2a649cdc4e17c4f18d7202427c7df2ed461c8457b2a4f8dc0429c27f11
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
## Releasing
|
2
|
+
|
3
|
+
`rake build` will build
|
4
|
+
|
5
|
+
# Tagging and releasing
|
6
|
+
|
7
|
+
If you need to release a new version of the gem to RubyGems, you have to follow
|
8
|
+
those steps:
|
9
|
+
|
10
|
+
```
|
11
|
+
# Bump the version (in develop)
|
12
|
+
./scripts/bump_version minor
|
13
|
+
|
14
|
+
# Update master and release
|
15
|
+
./scripts/release
|
16
|
+
|
17
|
+
# Install the gem locally (optional)
|
18
|
+
rake install
|
19
|
+
```
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2016 Pixelastic
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,217 @@
|
|
1
|
+
# algolia_html_extractor
|
2
|
+
|
3
|
+
This gem can convert HTML content into JSON records ready to be pushed to
|
4
|
+
Algolia.
|
5
|
+
|
6
|
+
Each HTML page will yield an array of records (one for each `<p>` by default).
|
7
|
+
Each record will contain its hierarchy in the page as well as other metadata
|
8
|
+
that can be used to configure relevance.
|
9
|
+
|
10
|
+
## Installation
|
11
|
+
|
12
|
+
```ruby
|
13
|
+
# Gemfile
|
14
|
+
source 'http://rubygems.org'
|
15
|
+
|
16
|
+
gem 'algolia_html_extractor', '~> 1.0'
|
17
|
+
```
|
18
|
+
|
19
|
+
## How to use
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
require 'algolia_html_extractor'
|
23
|
+
|
24
|
+
content = File.read('./index.html')
|
25
|
+
page = AlgoliaHTMLExtractor.new(content)
|
26
|
+
records = page.extract
|
27
|
+
puts records
|
28
|
+
```
|
29
|
+
|
30
|
+
## Records
|
31
|
+
|
32
|
+
`extract` will return an array of records. Each record will represent a `<p>`
|
33
|
+
paragraph of the initial text, along with it textual version (HTML removed),
|
34
|
+
heading hierarchy, and other interesting bits.
|
35
|
+
|
36
|
+
## Example
|
37
|
+
|
38
|
+
Let's take the following HTML as input and see what records we got as output:
|
39
|
+
|
40
|
+
```html
|
41
|
+
<!doctype html>
|
42
|
+
<html>
|
43
|
+
<body>
|
44
|
+
<h1 name="journey">The Hero's Journey</h1>
|
45
|
+
<p>Most stories always follow the same pattern.</p>
|
46
|
+
<h2 name="departure">Part One: Departure</h2>
|
47
|
+
<p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
|
48
|
+
<h3 name="calladventure">The call to Adventure</h3>
|
49
|
+
<p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
|
50
|
+
<h3 name="threshold">Crossing the Threshold</h3>
|
51
|
+
<p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>
|
52
|
+
<h2 name="initiations">Part Two: Initiation</h2>
|
53
|
+
<h3 name="trials">The Road of Trials</h3>
|
54
|
+
<p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
|
55
|
+
<h3 name="ultimate">The Ultimate Boon</h3>
|
56
|
+
<p>The hero has found something, either physical or metaphorical that changes him.</p>
|
57
|
+
<h2 name="return">Part Three: Return</h2>
|
58
|
+
<h3 name="refusal">Refusal to Return</h3>
|
59
|
+
<p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
|
60
|
+
<h3 name="master">Master of Two Worlds</h3>
|
61
|
+
<p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
|
62
|
+
</body>
|
63
|
+
</html>
|
64
|
+
```
|
65
|
+
|
66
|
+
Here is one of the records extracted:
|
67
|
+
|
68
|
+
```ruby
|
69
|
+
{
|
70
|
+
:uuid => "1f5923d5a60e998704f201bbe9964811",
|
71
|
+
:tag_name => "p",
|
72
|
+
:html => "<p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>",
|
73
|
+
:text => "The hero quits his job, hits the road, or whatever cuts him from his previous life.",
|
74
|
+
:node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
|
75
|
+
:anchor => 'threshold',
|
76
|
+
:hierarchy => {
|
77
|
+
:lvl0 => "The Hero's Journey",
|
78
|
+
:lvl1 => "Part One: Departure",
|
79
|
+
:lvl2 => "Crossing the Threshold",
|
80
|
+
:lvl3 => nil,
|
81
|
+
:lvl4 => nil,
|
82
|
+
:lvl5 => nil,
|
83
|
+
:lvl6 => nil
|
84
|
+
},
|
85
|
+
:weight => {
|
86
|
+
:heading => 70,
|
87
|
+
:position => 3
|
88
|
+
}
|
89
|
+
}
|
90
|
+
```
|
91
|
+
|
92
|
+
Each record has a `uuid` that uniquely identify it (computed by a hash of all
|
93
|
+
the other values).
|
94
|
+
|
95
|
+
It also contains the HTML tag name in `tag_name` (by default `<p>`
|
96
|
+
paragraphs are extracted, but see the [settings][3] on how to change it).
|
97
|
+
|
98
|
+
`html` contains the whole `outerContent` of the element, including the wrapping
|
99
|
+
tags and inner children. The `text` attribute contains the textual content,
|
100
|
+
stripping out all HTML.
|
101
|
+
|
102
|
+
`node` contains the [Nokogiri node][4] instance. The lib uses it internally to
|
103
|
+
extract all the relevant information but is also exposed if you want to process
|
104
|
+
the node further.
|
105
|
+
|
106
|
+
The `anchor` attributes contains the HTML anchor closest to the element. Here it
|
107
|
+
is `threshold` because this is the closest anchor in the hierarchy above.
|
108
|
+
Anchors are searched in `name` and `id` attributes of headings.
|
109
|
+
|
110
|
+
`hierarchy` then contains a snapshot of the current heading hierarchy of the
|
111
|
+
paragraph. The `lvlX` syntax is used to be compatible with the records
|
112
|
+
[DocSearch][5] is using.
|
113
|
+
|
114
|
+
The `weight` attribute is used to provide an easy way to rank two records
|
115
|
+
relative to each other.
|
116
|
+
|
117
|
+
- `heading` gives the depth level in the hierarchy where the record is. Records
|
118
|
+
on top level will have a value of 100, those under a `h1` will have 90, and so
|
119
|
+
on. Because our record is under a `h3`, it has 70.
|
120
|
+
- `position` is the position of the paragraph in the page. Here our paragraph is
|
121
|
+
the fourth paragraph of the page, so it will have a `position` of 3. It can
|
122
|
+
help you give more weight to the first items in the page.
|
123
|
+
|
124
|
+
## Settings
|
125
|
+
|
126
|
+
When instanciating `AlgoliaHTMLExtractor`, you can pass a secondary `options`
|
127
|
+
argument. This attribute accepts one value, `css_selector`.
|
128
|
+
|
129
|
+
```ruby
|
130
|
+
page = AlgoliaHTMLExtractor.new(content, { css_selector: 'p,li' })
|
131
|
+
```
|
132
|
+
|
133
|
+
This lets you change the default selector. Here instead of `<p>` paragraph,
|
134
|
+
the library will extract `<li>` list elements as well.
|
135
|
+
|
136
|
+
# CONTRIBUTING
|
137
|
+
|
138
|
+
I'm happy you'd like to contribute. All contributions are welcome, ranging from
|
139
|
+
feature requests to pull requests, but also including typo fixing, documentation
|
140
|
+
and generic bug reports.
|
141
|
+
|
142
|
+
## Bug Reports and feature requests
|
143
|
+
|
144
|
+
For any bug or ideas of new features, please start by checking in the
|
145
|
+
[issues](https://github.com/pixelastic/html-hierarchy-extractor/issues) tab if
|
146
|
+
it hasn't been discussed already. If not, feel free to open a new issue.
|
147
|
+
|
148
|
+
## Pull Requests
|
149
|
+
|
150
|
+
All PR are welcome, from small typo fixes to large codebase changes. If you
|
151
|
+
think you'll need to change a lot of code in a lot of files, I would suggest you
|
152
|
+
to open an issue first so we can discuss before you start working on something.
|
153
|
+
|
154
|
+
All PR should be based on the `develop` branch (`master` only ever contains the
|
155
|
+
last released change).
|
156
|
+
|
157
|
+
## Git Hooks
|
158
|
+
|
159
|
+
If you start working on the actual code, you should install the git hooks.
|
160
|
+
|
161
|
+
```
|
162
|
+
cp ./scripts/git_hooks/* ./.git/hooks
|
163
|
+
```
|
164
|
+
|
165
|
+
This will add a `pre-commit` and `pre-push` scripts that will respectively check
|
166
|
+
that all files are lint-free before committing, and pass all tests before
|
167
|
+
pushing. If any of those two hooks give your errors, you should fix the code
|
168
|
+
before commiting or pushing.
|
169
|
+
|
170
|
+
Having those steps helps keeping the codebase clean as much as possible, and
|
171
|
+
avoid polluting discussion in PR about style.
|
172
|
+
|
173
|
+
## Development
|
174
|
+
|
175
|
+
First thing you should do to get all your dependencies up to date is run `bundle
|
176
|
+
install` before running any other command.
|
177
|
+
|
178
|
+
## Lint
|
179
|
+
|
180
|
+
`rake lint` will check all the files for potential linting issue. It uses
|
181
|
+
Rubocop, and the configuration can be found in `.rubocop.yml`.
|
182
|
+
|
183
|
+
## Test
|
184
|
+
|
185
|
+
`rake test` will run all the tests.
|
186
|
+
|
187
|
+
`rake coverage` will do the same, but also adding the code coverage files to
|
188
|
+
`./coverage`. This should be useful in a CI environment.
|
189
|
+
|
190
|
+
`rake watch` will run Guard that will do a live run of all your tests. Every
|
191
|
+
update to a file (code or test) will re-run all the bound tests. This is highly
|
192
|
+
recommended for TDD.
|
193
|
+
|
194
|
+
## Using a local version of the gem
|
195
|
+
|
196
|
+
If you want to test a local version of the gem in your local project, I suggest
|
197
|
+
updating your project `Gemfile` to point to the correct local directory
|
198
|
+
|
199
|
+
```ruby
|
200
|
+
gem "html-hierarchy-extractor", :path => "/path/to/local/gem/folder"
|
201
|
+
```
|
202
|
+
|
203
|
+
You should also run `rake gemspec` from the `html-hierarchy-extractor`
|
204
|
+
repository the first time and if you added/deleted any file or dependency.
|
205
|
+
|
206
|
+
## History
|
207
|
+
|
208
|
+
This gem was previously named `html-hierarchy-extractor` but has been renamed to
|
209
|
+
`algolia_html_extractor` to both make its intent clearer and follow gem naming
|
210
|
+
convention. That's also why this gem directly starts at v2.0.
|
211
|
+
|
212
|
+
|
213
|
+
[1]: https://www.algolia.com/
|
214
|
+
[2]: https://community.algolia.com/docsearch/
|
215
|
+
[3]: #Settings
|
216
|
+
[4]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
|
217
|
+
[5]: https://community.algolia.com/docsearch/
|
@@ -0,0 +1,144 @@
|
|
1
|
+
require 'nokogiri'
|
2
|
+
require 'digest/md5'
|
3
|
+
|
4
|
+
# Extract content from an HTML page in the form of items with associated
|
5
|
+
# hierarchy data
|
6
|
+
class AlgoliaHTMLExtractor
|
7
|
+
def initialize(input, options: {})
|
8
|
+
@dom = Nokogiri::HTML(input)
|
9
|
+
default_options = {
|
10
|
+
css_selector: 'p'
|
11
|
+
}
|
12
|
+
@options = default_options.merge(options)
|
13
|
+
end
|
14
|
+
|
15
|
+
# Returns the outer HTML of a given node
|
16
|
+
#
|
17
|
+
# eg.
|
18
|
+
# <p>foo</p> => <p>foo</p>
|
19
|
+
def extract_html(node)
|
20
|
+
node.to_s.strip
|
21
|
+
end
|
22
|
+
|
23
|
+
# Returns the inner HTML of a given node
|
24
|
+
#
|
25
|
+
# eg.
|
26
|
+
# <p>foo</p> => foo
|
27
|
+
def extract_text(node)
|
28
|
+
node.content
|
29
|
+
end
|
30
|
+
|
31
|
+
# Returns the tag name of a given node
|
32
|
+
#
|
33
|
+
# eg
|
34
|
+
# <p>foo</p> => p
|
35
|
+
def extract_tag_name(node)
|
36
|
+
node.name.downcase
|
37
|
+
end
|
38
|
+
|
39
|
+
# Returns the anchor to the node
|
40
|
+
#
|
41
|
+
# eg.
|
42
|
+
# <h1 name="anchor">Foo</h1> => anchor
|
43
|
+
# <h1 id="anchor">Foo</h1> => anchor
|
44
|
+
# <h1><a name="anchor">Foo</a></h1> => anchor
|
45
|
+
def extract_anchor(node)
|
46
|
+
anchor = node.attr('name') || node.attr('id') || nil
|
47
|
+
return anchor unless anchor.nil?
|
48
|
+
|
49
|
+
# No anchor found directly in the header, search on children
|
50
|
+
subelement = node.css('[name],[id]')
|
51
|
+
return extract_anchor(subelement[0]) unless subelement.empty?
|
52
|
+
|
53
|
+
nil
|
54
|
+
end
|
55
|
+
|
56
|
+
##
|
57
|
+
# Generate a unique identifier for the item
|
58
|
+
def uuid(item)
|
59
|
+
# We first get all the keys of the object, sorted alphabetically...
|
60
|
+
ordered_keys = item.keys.sort
|
61
|
+
|
62
|
+
# ...then we build a huge array of "key=value" pairs...
|
63
|
+
ordered_array = ordered_keys.map do |key|
|
64
|
+
value = item[key]
|
65
|
+
# We apply the method recursively on other hashes
|
66
|
+
value = uuid(value) if value.is_a?(Hash)
|
67
|
+
"#{key}=#{value}"
|
68
|
+
end
|
69
|
+
|
70
|
+
# ...then we build a unique md5 hash of it
|
71
|
+
Digest::MD5.hexdigest(ordered_array.join(','))
|
72
|
+
end
|
73
|
+
|
74
|
+
##
|
75
|
+
# Get a relative numeric value of the importance of the heading
|
76
|
+
# 100 for top level, then -10 per heading
|
77
|
+
def heading_weight(heading_level)
|
78
|
+
weight = 100
|
79
|
+
return weight if heading_level.nil?
|
80
|
+
weight - ((heading_level + 1) * 10)
|
81
|
+
end
|
82
|
+
|
83
|
+
def extract
|
84
|
+
heading_selector = 'h1,h2,h3,h4,h5,h6'
|
85
|
+
# We select all nodes that match either the headings or the elements to
|
86
|
+
# extract. This will allow us to loop over it in order it appears in the DOM
|
87
|
+
all_selector = "#{heading_selector},#{@options[:css_selector]}"
|
88
|
+
|
89
|
+
items = []
|
90
|
+
current_hierarchy = {
|
91
|
+
lvl0: nil,
|
92
|
+
lvl1: nil,
|
93
|
+
lvl2: nil,
|
94
|
+
lvl3: nil,
|
95
|
+
lvl4: nil,
|
96
|
+
lvl5: nil
|
97
|
+
}
|
98
|
+
current_position = 0 # Position of the DOM node in the tree
|
99
|
+
current_lvl = nil # Current closest hierarchy level
|
100
|
+
current_anchor = nil # Current closest anchor
|
101
|
+
|
102
|
+
@dom.css(all_selector).each do |node|
|
103
|
+
# If it's a heading, we update our current hierarchy
|
104
|
+
if node.matches?(heading_selector)
|
105
|
+
# Which level heading is it?
|
106
|
+
current_lvl = extract_tag_name(node).gsub(/^h/, '').to_i - 1
|
107
|
+
# Update this level, and set all the following ones to nil
|
108
|
+
current_hierarchy["lvl#{current_lvl}".to_sym] = extract_text(node)
|
109
|
+
(current_lvl + 1..6).each do |lvl|
|
110
|
+
current_hierarchy["lvl#{lvl}".to_sym] = nil
|
111
|
+
end
|
112
|
+
# Update the anchor, if the new heading has one
|
113
|
+
new_anchor = extract_anchor(node)
|
114
|
+
current_anchor = new_anchor if new_anchor
|
115
|
+
end
|
116
|
+
|
117
|
+
# Stop if node is not to be extracted
|
118
|
+
next unless node.matches?(@options[:css_selector])
|
119
|
+
|
120
|
+
# Stop if node is empty
|
121
|
+
text = extract_text(node)
|
122
|
+
next if text.empty?
|
123
|
+
|
124
|
+
item = {
|
125
|
+
html: extract_html(node),
|
126
|
+
text: text,
|
127
|
+
tag_name: extract_tag_name(node),
|
128
|
+
hierarchy: current_hierarchy.clone,
|
129
|
+
anchor: current_anchor,
|
130
|
+
node: node,
|
131
|
+
weight: {
|
132
|
+
position: current_position,
|
133
|
+
heading: heading_weight(current_lvl)
|
134
|
+
}
|
135
|
+
}
|
136
|
+
item[:uuid] = uuid(item)
|
137
|
+
items << item
|
138
|
+
|
139
|
+
current_position += 1
|
140
|
+
end
|
141
|
+
|
142
|
+
items
|
143
|
+
end
|
144
|
+
end
|
data/lib/version.rb
ADDED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: algolia_html_extractor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.0.
|
4
|
+
version: 2.0.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tim Carry
|
@@ -198,7 +198,12 @@ email: tim@pixelastic.com
|
|
198
198
|
executables: []
|
199
199
|
extensions: []
|
200
200
|
extra_rdoc_files: []
|
201
|
-
files:
|
201
|
+
files:
|
202
|
+
- CONTRIBUTING.md
|
203
|
+
- LICENSE.txt
|
204
|
+
- README.md
|
205
|
+
- lib/algolia_html_extractor.rb
|
206
|
+
- lib/version.rb
|
202
207
|
homepage: https://github.com/algolia/html-extractor
|
203
208
|
licenses:
|
204
209
|
- MIT
|