algolia_html_extractor 2.0.0 → 2.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: c4ba20d4606345086dc7e931cf49ba1ae0e8336a
4
- data.tar.gz: be54441ad8f6d880aa34b654187204e26083533f
3
+ metadata.gz: fb67fbcfbfb26740f9d97027a7f2258a52730792
4
+ data.tar.gz: 83bf786d6369805a8e737d1264ef8f4ad198dade
5
5
  SHA512:
6
- metadata.gz: 1c7194623efe86e9c5e963de77df37e5ff6281d30b6deacaa56cd40dd842098ff6f43ac22e9c40323490c35f0b05e818a6d9df3a2f947ea85249765bafd7ab20
7
- data.tar.gz: c1fc426c29fe7566506c8fc8af03e4a269f4dc8ae9eec2e7e9f91a6606648ac9c68c632e78037bcd57dfb8ddf010340844e2fc20e1ec685f1421ba8c3af28ea2
6
+ metadata.gz: 19d71cda82dae127c2a3603fdc975a443248eff8be2acb9724a86ef0c669a509937bcb5c63e859175cbb42e767409e5d30df8d7fbd3ecc6e3f5cc899c4457d57
7
+ data.tar.gz: abb6f3a34e2818049ad04901997700834a6e2fcd6e5b7c974fd0fd2e558741c957f71d2a649cdc4e17c4f18d7202427c7df2ed461c8457b2a4f8dc0429c27f11
@@ -0,0 +1,19 @@
1
+ ## Releasing
2
+
3
+ `rake build` will build
4
+
5
+ # Tagging and releasing
6
+
7
+ If you need to release a new version of the gem to RubyGems, you have to follow
8
+ those steps:
9
+
10
+ ```
11
+ # Bump the version (in develop)
12
+ ./scripts/bump_version minor
13
+
14
+ # Update master and release
15
+ ./scripts/release
16
+
17
+ # Install the gem locally (optional)
18
+ rake install
19
+ ```
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2016 Pixelastic
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,217 @@
1
+ # algolia_html_extractor
2
+
3
+ This gem can convert HTML content into JSON records ready to be pushed to
4
+ Algolia.
5
+
6
+ Each HTML page will yield an array of records (one for each `<p>` by default).
7
+ Each record will contain its hierarchy in the page as well as other metadata
8
+ that can be used to configure relevance.
9
+
10
+ ## Installation
11
+
12
+ ```ruby
13
+ # Gemfile
14
+ source 'http://rubygems.org'
15
+
16
+ gem 'algolia_html_extractor', '~> 1.0'
17
+ ```
18
+
19
+ ## How to use
20
+
21
+ ```ruby
22
+ require 'algolia_html_extractor'
23
+
24
+ content = File.read('./index.html')
25
+ page = AlgoliaHTMLExtractor.new(content)
26
+ records = page.extract
27
+ puts records
28
+ ```
29
+
30
+ ## Records
31
+
32
+ `extract` will return an array of records. Each record will represent a `<p>`
33
+ paragraph of the initial text, along with it textual version (HTML removed),
34
+ heading hierarchy, and other interesting bits.
35
+
36
+ ## Example
37
+
38
+ Let's take the following HTML as input and see what records we got as output:
39
+
40
+ ```html
41
+ <!doctype html>
42
+ <html>
43
+ <body>
44
+ <h1 name="journey">The Hero's Journey</h1>
45
+ <p>Most stories always follow the same pattern.</p>
46
+ <h2 name="departure">Part One: Departure</h2>
47
+ <p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
48
+ <h3 name="calladventure">The call to Adventure</h3>
49
+ <p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
50
+ <h3 name="threshold">Crossing the Threshold</h3>
51
+ <p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>
52
+ <h2 name="initiations">Part Two: Initiation</h2>
53
+ <h3 name="trials">The Road of Trials</h3>
54
+ <p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
55
+ <h3 name="ultimate">The Ultimate Boon</h3>
56
+ <p>The hero has found something, either physical or metaphorical that changes him.</p>
57
+ <h2 name="return">Part Three: Return</h2>
58
+ <h3 name="refusal">Refusal to Return</h3>
59
+ <p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
60
+ <h3 name="master">Master of Two Worlds</h3>
61
+ <p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
62
+ </body>
63
+ </html>
64
+ ```
65
+
66
+ Here is one of the records extracted:
67
+
68
+ ```ruby
69
+ {
70
+ :uuid => "1f5923d5a60e998704f201bbe9964811",
71
+ :tag_name => "p",
72
+ :html => "<p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>",
73
+ :text => "The hero quits his job, hits the road, or whatever cuts him from his previous life.",
74
+ :node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
75
+ :anchor => 'threshold',
76
+ :hierarchy => {
77
+ :lvl0 => "The Hero's Journey",
78
+ :lvl1 => "Part One: Departure",
79
+ :lvl2 => "Crossing the Threshold",
80
+ :lvl3 => nil,
81
+ :lvl4 => nil,
82
+ :lvl5 => nil,
83
+ :lvl6 => nil
84
+ },
85
+ :weight => {
86
+ :heading => 70,
87
+ :position => 3
88
+ }
89
+ }
90
+ ```
91
+
92
+ Each record has a `uuid` that uniquely identify it (computed by a hash of all
93
+ the other values).
94
+
95
+ It also contains the HTML tag name in `tag_name` (by default `<p>`
96
+ paragraphs are extracted, but see the [settings][3] on how to change it).
97
+
98
+ `html` contains the whole `outerContent` of the element, including the wrapping
99
+ tags and inner children. The `text` attribute contains the textual content,
100
+ stripping out all HTML.
101
+
102
+ `node` contains the [Nokogiri node][4] instance. The lib uses it internally to
103
+ extract all the relevant information but is also exposed if you want to process
104
+ the node further.
105
+
106
+ The `anchor` attributes contains the HTML anchor closest to the element. Here it
107
+ is `threshold` because this is the closest anchor in the hierarchy above.
108
+ Anchors are searched in `name` and `id` attributes of headings.
109
+
110
+ `hierarchy` then contains a snapshot of the current heading hierarchy of the
111
+ paragraph. The `lvlX` syntax is used to be compatible with the records
112
+ [DocSearch][5] is using.
113
+
114
+ The `weight` attribute is used to provide an easy way to rank two records
115
+ relative to each other.
116
+
117
+ - `heading` gives the depth level in the hierarchy where the record is. Records
118
+ on top level will have a value of 100, those under a `h1` will have 90, and so
119
+ on. Because our record is under a `h3`, it has 70.
120
+ - `position` is the position of the paragraph in the page. Here our paragraph is
121
+ the fourth paragraph of the page, so it will have a `position` of 3. It can
122
+ help you give more weight to the first items in the page.
123
+
124
+ ## Settings
125
+
126
+ When instanciating `AlgoliaHTMLExtractor`, you can pass a secondary `options`
127
+ argument. This attribute accepts one value, `css_selector`.
128
+
129
+ ```ruby
130
+ page = AlgoliaHTMLExtractor.new(content, { css_selector: 'p,li' })
131
+ ```
132
+
133
+ This lets you change the default selector. Here instead of `<p>` paragraph,
134
+ the library will extract `<li>` list elements as well.
135
+
136
+ # CONTRIBUTING
137
+
138
+ I'm happy you'd like to contribute. All contributions are welcome, ranging from
139
+ feature requests to pull requests, but also including typo fixing, documentation
140
+ and generic bug reports.
141
+
142
+ ## Bug Reports and feature requests
143
+
144
+ For any bug or ideas of new features, please start by checking in the
145
+ [issues](https://github.com/pixelastic/html-hierarchy-extractor/issues) tab if
146
+ it hasn't been discussed already. If not, feel free to open a new issue.
147
+
148
+ ## Pull Requests
149
+
150
+ All PR are welcome, from small typo fixes to large codebase changes. If you
151
+ think you'll need to change a lot of code in a lot of files, I would suggest you
152
+ to open an issue first so we can discuss before you start working on something.
153
+
154
+ All PR should be based on the `develop` branch (`master` only ever contains the
155
+ last released change).
156
+
157
+ ## Git Hooks
158
+
159
+ If you start working on the actual code, you should install the git hooks.
160
+
161
+ ```
162
+ cp ./scripts/git_hooks/* ./.git/hooks
163
+ ```
164
+
165
+ This will add a `pre-commit` and `pre-push` scripts that will respectively check
166
+ that all files are lint-free before committing, and pass all tests before
167
+ pushing. If any of those two hooks give your errors, you should fix the code
168
+ before commiting or pushing.
169
+
170
+ Having those steps helps keeping the codebase clean as much as possible, and
171
+ avoid polluting discussion in PR about style.
172
+
173
+ ## Development
174
+
175
+ First thing you should do to get all your dependencies up to date is run `bundle
176
+ install` before running any other command.
177
+
178
+ ## Lint
179
+
180
+ `rake lint` will check all the files for potential linting issue. It uses
181
+ Rubocop, and the configuration can be found in `.rubocop.yml`.
182
+
183
+ ## Test
184
+
185
+ `rake test` will run all the tests.
186
+
187
+ `rake coverage` will do the same, but also adding the code coverage files to
188
+ `./coverage`. This should be useful in a CI environment.
189
+
190
+ `rake watch` will run Guard that will do a live run of all your tests. Every
191
+ update to a file (code or test) will re-run all the bound tests. This is highly
192
+ recommended for TDD.
193
+
194
+ ## Using a local version of the gem
195
+
196
+ If you want to test a local version of the gem in your local project, I suggest
197
+ updating your project `Gemfile` to point to the correct local directory
198
+
199
+ ```ruby
200
+ gem "html-hierarchy-extractor", :path => "/path/to/local/gem/folder"
201
+ ```
202
+
203
+ You should also run `rake gemspec` from the `html-hierarchy-extractor`
204
+ repository the first time and if you added/deleted any file or dependency.
205
+
206
+ ## History
207
+
208
+ This gem was previously named `html-hierarchy-extractor` but has been renamed to
209
+ `algolia_html_extractor` to both make its intent clearer and follow gem naming
210
+ convention. That's also why this gem directly starts at v2.0.
211
+
212
+
213
+ [1]: https://www.algolia.com/
214
+ [2]: https://community.algolia.com/docsearch/
215
+ [3]: #Settings
216
+ [4]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
217
+ [5]: https://community.algolia.com/docsearch/
@@ -0,0 +1,144 @@
1
+ require 'nokogiri'
2
+ require 'digest/md5'
3
+
4
+ # Extract content from an HTML page in the form of items with associated
5
+ # hierarchy data
6
+ class AlgoliaHTMLExtractor
7
+ def initialize(input, options: {})
8
+ @dom = Nokogiri::HTML(input)
9
+ default_options = {
10
+ css_selector: 'p'
11
+ }
12
+ @options = default_options.merge(options)
13
+ end
14
+
15
+ # Returns the outer HTML of a given node
16
+ #
17
+ # eg.
18
+ # <p>foo</p> => <p>foo</p>
19
+ def extract_html(node)
20
+ node.to_s.strip
21
+ end
22
+
23
+ # Returns the inner HTML of a given node
24
+ #
25
+ # eg.
26
+ # <p>foo</p> => foo
27
+ def extract_text(node)
28
+ node.content
29
+ end
30
+
31
+ # Returns the tag name of a given node
32
+ #
33
+ # eg
34
+ # <p>foo</p> => p
35
+ def extract_tag_name(node)
36
+ node.name.downcase
37
+ end
38
+
39
+ # Returns the anchor to the node
40
+ #
41
+ # eg.
42
+ # <h1 name="anchor">Foo</h1> => anchor
43
+ # <h1 id="anchor">Foo</h1> => anchor
44
+ # <h1><a name="anchor">Foo</a></h1> => anchor
45
+ def extract_anchor(node)
46
+ anchor = node.attr('name') || node.attr('id') || nil
47
+ return anchor unless anchor.nil?
48
+
49
+ # No anchor found directly in the header, search on children
50
+ subelement = node.css('[name],[id]')
51
+ return extract_anchor(subelement[0]) unless subelement.empty?
52
+
53
+ nil
54
+ end
55
+
56
+ ##
57
+ # Generate a unique identifier for the item
58
+ def uuid(item)
59
+ # We first get all the keys of the object, sorted alphabetically...
60
+ ordered_keys = item.keys.sort
61
+
62
+ # ...then we build a huge array of "key=value" pairs...
63
+ ordered_array = ordered_keys.map do |key|
64
+ value = item[key]
65
+ # We apply the method recursively on other hashes
66
+ value = uuid(value) if value.is_a?(Hash)
67
+ "#{key}=#{value}"
68
+ end
69
+
70
+ # ...then we build a unique md5 hash of it
71
+ Digest::MD5.hexdigest(ordered_array.join(','))
72
+ end
73
+
74
+ ##
75
+ # Get a relative numeric value of the importance of the heading
76
+ # 100 for top level, then -10 per heading
77
+ def heading_weight(heading_level)
78
+ weight = 100
79
+ return weight if heading_level.nil?
80
+ weight - ((heading_level + 1) * 10)
81
+ end
82
+
83
+ def extract
84
+ heading_selector = 'h1,h2,h3,h4,h5,h6'
85
+ # We select all nodes that match either the headings or the elements to
86
+ # extract. This will allow us to loop over it in order it appears in the DOM
87
+ all_selector = "#{heading_selector},#{@options[:css_selector]}"
88
+
89
+ items = []
90
+ current_hierarchy = {
91
+ lvl0: nil,
92
+ lvl1: nil,
93
+ lvl2: nil,
94
+ lvl3: nil,
95
+ lvl4: nil,
96
+ lvl5: nil
97
+ }
98
+ current_position = 0 # Position of the DOM node in the tree
99
+ current_lvl = nil # Current closest hierarchy level
100
+ current_anchor = nil # Current closest anchor
101
+
102
+ @dom.css(all_selector).each do |node|
103
+ # If it's a heading, we update our current hierarchy
104
+ if node.matches?(heading_selector)
105
+ # Which level heading is it?
106
+ current_lvl = extract_tag_name(node).gsub(/^h/, '').to_i - 1
107
+ # Update this level, and set all the following ones to nil
108
+ current_hierarchy["lvl#{current_lvl}".to_sym] = extract_text(node)
109
+ (current_lvl + 1..6).each do |lvl|
110
+ current_hierarchy["lvl#{lvl}".to_sym] = nil
111
+ end
112
+ # Update the anchor, if the new heading has one
113
+ new_anchor = extract_anchor(node)
114
+ current_anchor = new_anchor if new_anchor
115
+ end
116
+
117
+ # Stop if node is not to be extracted
118
+ next unless node.matches?(@options[:css_selector])
119
+
120
+ # Stop if node is empty
121
+ text = extract_text(node)
122
+ next if text.empty?
123
+
124
+ item = {
125
+ html: extract_html(node),
126
+ text: text,
127
+ tag_name: extract_tag_name(node),
128
+ hierarchy: current_hierarchy.clone,
129
+ anchor: current_anchor,
130
+ node: node,
131
+ weight: {
132
+ position: current_position,
133
+ heading: heading_weight(current_lvl)
134
+ }
135
+ }
136
+ item[:uuid] = uuid(item)
137
+ items << item
138
+
139
+ current_position += 1
140
+ end
141
+
142
+ items
143
+ end
144
+ end
@@ -0,0 +1,5 @@
1
+ # Expose gem version
2
+ # rubocop:disable Style/SingleLineMethods
3
+ class AlgoliaHTMLExtractorVersion
4
+ def self.to_s; '2.0.1' end
5
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: algolia_html_extractor
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.0
4
+ version: 2.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tim Carry
@@ -198,7 +198,12 @@ email: tim@pixelastic.com
198
198
  executables: []
199
199
  extensions: []
200
200
  extra_rdoc_files: []
201
- files: []
201
+ files:
202
+ - CONTRIBUTING.md
203
+ - LICENSE.txt
204
+ - README.md
205
+ - lib/algolia_html_extractor.rb
206
+ - lib/version.rb
202
207
  homepage: https://github.com/algolia/html-extractor
203
208
  licenses:
204
209
  - MIT