html-hierarchy-extractor 1.0.11 → 1.0.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 4ae1be739dd1c3c223adf0a43dee8c2db518f79d
4
- data.tar.gz: 610da86bcb1051eaa2326cb3a0d9cf14fe82d790
3
+ metadata.gz: 0b21edfe3c30bf5534e1eba5acd9ce1e6d0ce940
4
+ data.tar.gz: 93333cb4915ad7d03335d69d8b4b2ea28644ab39
5
5
  SHA512:
6
- metadata.gz: d0ccd7c0fe90ada0c414f6af2d1b928d69390866e50409556e70a7545485adc4fba97c8988aa589de817d5005f7d80b4322a8d5f9b6c2657aa75924fd7ff493e
7
- data.tar.gz: c59ee344ad419cfb887b1d1fc6c45c33f0a774433799baae158320797db4b91d06ef151cb89eb4de36698345a0d3172e4c2a4251346a001ffbe501ae5271d399
6
+ metadata.gz: 4ab4be055f1c0270e665daf224d322a4eaa7dd72d2b07e80593835b75bd6980a2e383533feb60fb2ec905f818643f2f8ab72e2a90ffa96cb7e3fad90190eae46
7
+ data.tar.gz: eb0a01ba3102aa484b3386b0642304dd51c2eadb30dcf346edae7e3c6acf23a366212575e37ff7d0783be1d15fba44e4a45afa1ad6f9a73ccb3d0c8d9e717e92
@@ -0,0 +1,19 @@
1
+ ## Releasing
2
+
3
+ `rake build` will build
4
+
5
+ # Tagging and releasing
6
+
7
+ If you need to release a new version of the gem to RubyGems, you have to follow
8
+ those steps:
9
+
10
+ ```
11
+ # Bump the version (in develop)
12
+ ./scripts/bump_version minor
13
+
14
+ # Update master and release
15
+ ./scripts/release
16
+
17
+ # Install the gem locally (optional)
18
+ rake install
19
+ ```
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2016 Pixelastic
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,222 @@
1
+ # html-hierarchy-extractor
2
+
3
+ ## ⚠ DEPRECATION NOTICE
4
+
5
+ This gem has been deprecated in favor of [algolia_html_extractor][1]. No further
6
+ development will be happening on that gem. The new gem took over where this one
7
+ stopped.
8
+
9
+ ## Description
10
+
11
+ This gems lets you extract the hierarchy of headings and content from any HTML
12
+ page into an array of elements.
13
+
14
+
15
+ Intended to be used with [Algolia][2] to improve relevance of search
16
+ results inside large HTML pages. The records created are compatible with the
17
+ [DocSearch][3] format.
18
+
19
+ ## Installation
20
+
21
+ ```ruby
22
+ # Gemfile
23
+ source 'http://rubygems.org'
24
+
25
+ gem 'html-hierarchy-extractor', '~> 1.0'
26
+ ```
27
+
28
+ ## How to use
29
+
30
+ ```ruby
31
+ require 'html-hierarchy-extractor'
32
+
33
+ content = File.read('./index.html')
34
+ page = HTMLHierarchyExtractor.new(content)
35
+ records = page.extract
36
+ puts records
37
+ ```
38
+
39
+ ## Records
40
+
41
+ `extract` will return an array of records. Each record will represent a `<p>`
42
+ paragraph of the initial text, along with it textual version (HTML removed),
43
+ heading hierarchy, and other interesting bits.
44
+
45
+ ## Example
46
+
47
+ Let's take the following HTML as input and see what records we got as output:
48
+
49
+ ```html
50
+ <!doctype html>
51
+ <html>
52
+ <body>
53
+ <h1 name="journey">The Hero's Journey</h1>
54
+ <p>Most stories always follow the same pattern.</p>
55
+ <h2 name="departure">Part One: Departure</h2>
56
+ <p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
57
+ <h3 name="calladventure">The call to Adventure</h3>
58
+ <p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
59
+ <h3 name="threshold">Crossing the Threshold</h3>
60
+ <p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>
61
+ <h2 name="initiations">Part Two: Initiation</h2>
62
+ <h3 name="trials">The Road of Trials</h3>
63
+ <p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
64
+ <h3 name="ultimate">The Ultimate Boon</h3>
65
+ <p>The hero has found something, either physical or metaphorical that changes him.</p>
66
+ <h2 name="return">Part Three: Return</h2>
67
+ <h3 name="refusal">Refusal to Return</h3>
68
+ <p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
69
+ <h3 name="master">Master of Two Worlds</h3>
70
+ <p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
71
+ </body>
72
+ </html>
73
+ ```
74
+
75
+ Here is one of the records extracted:
76
+
77
+ ```ruby
78
+ {
79
+ :uuid => "1f5923d5a60e998704f201bbe9964811",
80
+ :tag_name => "p",
81
+ :html => "<p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>",
82
+ :text => "The hero quits his job, hits the road, or whatever cuts him from his previous life.",
83
+ :node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
84
+ :anchor => 'threshold',
85
+ :hierarchy => {
86
+ :lvl0 => "The Hero's Journey",
87
+ :lvl1 => "Part One: Departure",
88
+ :lvl2 => "Crossing the Threshold",
89
+ :lvl3 => nil,
90
+ :lvl4 => nil,
91
+ :lvl5 => nil,
92
+ :lvl6 => nil
93
+ },
94
+ :weight => {
95
+ :heading => 70,
96
+ :position => 3
97
+ }
98
+ }
99
+ ```
100
+
101
+ Each record has a `uuid` that uniquely identify it (computed by a hash of all
102
+ the other values).
103
+
104
+ It also contains the HTML tag name in `tag_name` (by default `<p>`
105
+ paragraphs are extracted, but see the [settings][4] on how to change it).
106
+
107
+ `html` contains the whole `outerContent` of the element, including the wrapping
108
+ tags and inner children. The `text` attribute contains the textual content,
109
+ stripping out all HTML.
110
+
111
+ `node` contains the [Nokogiri node][5] instance. The lib uses it internally to
112
+ extract all the relevant information but is also exposed if you want to process
113
+ the node further.
114
+
115
+ The `anchor` attributes contains the HTML anchor closest to the element. Here it
116
+ is `threshold` because this is the closest anchor in the hierarchy above.
117
+ Anchors are searched in `name` and `id` attributes of headings.
118
+
119
+ `hierarchy` then contains a snapshot of the current heading hierarchy of the
120
+ paragraph. The `lvlX` syntax is used to be compatible with the records
121
+ [DocSearch][6] is using.
122
+
123
+ The `weight` attribute is used to provide an easy way to rank two records
124
+ relative to each other.
125
+
126
+ - `heading` gives the depth level in the hierarchy where the record is. Records
127
+ on top level will have a value of 100, those under a `h1` will have 90, and so
128
+ on. Because our record is under a `h3`, it has 70.
129
+ - `position` is the position of the paragraph in the page. Here our paragraph is
130
+ the fourth paragraph of the page, so it will have a `position` of 3. It can
131
+ help you give more weight to the first items in the page.
132
+
133
+ ## Settings
134
+
135
+ When instanciating `HTMLHierarchyExtractor`, you can pass a secondary `options`
136
+ argument. This attribute accepts one value, `css_selector`.
137
+
138
+ ```ruby
139
+ page = HTMLHierarchyExtractor.new(content, { css_selector: 'p,li' })
140
+ ```
141
+
142
+ This lets you change the default selector. Here instead of `<p>` paragraph,
143
+ the library will extract `<li>` list elements as well.
144
+
145
+ # CONTRIBUTING
146
+
147
+ I'm happy you'd like to contribute. All contributions are welcome, ranging from
148
+ feature requests to pull requests, but also including typo fixing, documentation
149
+ and generic bug reports.
150
+
151
+ ## Bug Reports and feature requests
152
+
153
+ For any bug or ideas of new features, please start by checking in the
154
+ [issues][7] tab if
155
+ it hasn't been discussed already. If not, feel free to open a new issue.
156
+
157
+ ## Pull Requests
158
+
159
+ All PR are welcome, from small typo fixes to large codebase changes. If you
160
+ think you'll need to change a lot of code in a lot of files, I would suggest you
161
+ to open an issue first so we can discuss before you start working on something.
162
+
163
+ All PR should be based on the `develop` branch (`master` only ever contains the
164
+ last released change).
165
+
166
+ ## Git Hooks
167
+
168
+ If you start working on the actual code, you should install the git hooks.
169
+
170
+ ```
171
+ cp ./scripts/git_hooks/* ./.git/hooks
172
+ ```
173
+
174
+ This will add a `pre-commit` and `pre-push` scripts that will respectively check
175
+ that all files are lint-free before committing, and pass all tests before
176
+ pushing. If any of those two hooks give your errors, you should fix the code
177
+ before committing or pushing.
178
+
179
+ Having those steps helps keeping the codebase clean as much as possible, and
180
+ avoid polluting discussion in PR about style.
181
+
182
+ ## Development
183
+
184
+ First thing you should do to get all your dependencies up to date is run `bundle
185
+ install` before running any other command.
186
+
187
+ ## Lint
188
+
189
+ `rake lint` will check all the files for potential linting issue. It uses
190
+ Rubocop, and the configuration can be found in `.rubocop.yml`.
191
+
192
+ ## Test
193
+
194
+ `rake test` will run all the tests.
195
+
196
+ `rake coverage` will do the same, but also adding the code coverage files to
197
+ `./coverage`. This should be useful in a CI environment.
198
+
199
+ `rake watch` will run Guard that will do a live run of all your tests. Every
200
+ update to a file (code or test) will re-run all the bound tests. This is highly
201
+ recommended for TDD.
202
+
203
+ ## Using a local version of the gem
204
+
205
+ If you want to test a local version of the gem in your local project, I suggest
206
+ updating your project `Gemfile` to point to the correct local directory
207
+
208
+ ```ruby
209
+ gem "html-hierarchy-extractor", :path => "/path/to/local/gem/folder"
210
+ ```
211
+
212
+ You should also run `rake gemspec` from the `html-hierarchy-extractor`
213
+ repository the first time and if you added/deleted any file or dependency.
214
+
215
+
216
+ [1]: https://github.com/algolia/html-extractor
217
+ [2]: https://www.algolia.com/
218
+ [3]: https://community.algolia.com/docsearch/
219
+ [4]: #Settings
220
+ [5]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
221
+ [6]: https://community.algolia.com/docsearch/
222
+ [7]: https://github.com/pixelastic/html-hierarchy-extractor/issues
@@ -0,0 +1,148 @@
1
+ require 'nokogiri'
2
+ require 'digest/md5'
3
+
4
+ # Extract content from an HTML page in the form of items with associated
5
+ # hierarchy data
6
+ class HTMLHierarchyExtractor
7
+ def initialize(input, options: {})
8
+ @dom = Nokogiri::HTML(input)
9
+ default_options = {
10
+ css_selector: 'p'
11
+ }
12
+ @options = default_options.merge(options)
13
+
14
+ warn '[DEPRECATION] The gem html-hierarchy-extractor has been renamed '\
15
+ 'to algolia_html_extractor and will no longer be supported. '\
16
+ 'Please switch to algolia_html_extractor as soon as possible.'
17
+ end
18
+
19
+ # Returns the outer HTML of a given node
20
+ #
21
+ # eg.
22
+ # <p>foo</p> => <p>foo</p>
23
+ def extract_html(node)
24
+ node.to_s.strip
25
+ end
26
+
27
+ # Returns the inner HTML of a given node
28
+ #
29
+ # eg.
30
+ # <p>foo</p> => foo
31
+ def extract_text(node)
32
+ node.content
33
+ end
34
+
35
+ # Returns the tag name of a given node
36
+ #
37
+ # eg
38
+ # <p>foo</p> => p
39
+ def extract_tag_name(node)
40
+ node.name.downcase
41
+ end
42
+
43
+ # Returns the anchor to the node
44
+ #
45
+ # eg.
46
+ # <h1 name="anchor">Foo</h1> => anchor
47
+ # <h1 id="anchor">Foo</h1> => anchor
48
+ # <h1><a name="anchor">Foo</a></h1> => anchor
49
+ def extract_anchor(node)
50
+ anchor = node.attr('name') || node.attr('id') || nil
51
+ return anchor unless anchor.nil?
52
+
53
+ # No anchor found directly in the header, search on children
54
+ subelement = node.css('[name],[id]')
55
+ return extract_anchor(subelement[0]) unless subelement.empty?
56
+
57
+ nil
58
+ end
59
+
60
+ ##
61
+ # Generate a unique identifier for the item
62
+ def uuid(item)
63
+ # We first get all the keys of the object, sorted alphabetically...
64
+ ordered_keys = item.keys.sort
65
+
66
+ # ...then we build a huge array of "key=value" pairs...
67
+ ordered_array = ordered_keys.map do |key|
68
+ value = item[key]
69
+ # We apply the method recursively on other hashes
70
+ value = uuid(value) if value.is_a?(Hash)
71
+ "#{key}=#{value}"
72
+ end
73
+
74
+ # ...then we build a unique md5 hash of it
75
+ Digest::MD5.hexdigest(ordered_array.join(','))
76
+ end
77
+
78
+ ##
79
+ # Get a relative numeric value of the importance of the heading
80
+ # 100 for top level, then -10 per heading
81
+ def heading_weight(heading_level)
82
+ weight = 100
83
+ return weight if heading_level.nil?
84
+ weight - ((heading_level + 1) * 10)
85
+ end
86
+
87
+ def extract
88
+ heading_selector = 'h1,h2,h3,h4,h5,h6'
89
+ # We select all nodes that match either the headings or the elements to
90
+ # extract. This will allow us to loop over it in order it appears in the DOM
91
+ all_selector = "#{heading_selector},#{@options[:css_selector]}"
92
+
93
+ items = []
94
+ current_hierarchy = {
95
+ lvl0: nil,
96
+ lvl1: nil,
97
+ lvl2: nil,
98
+ lvl3: nil,
99
+ lvl4: nil,
100
+ lvl5: nil
101
+ }
102
+ current_position = 0 # Position of the DOM node in the tree
103
+ current_lvl = nil # Current closest hierarchy level
104
+ current_anchor = nil # Current closest anchor
105
+
106
+ @dom.css(all_selector).each do |node|
107
+ # If it's a heading, we update our current hierarchy
108
+ if node.matches?(heading_selector)
109
+ # Which level heading is it?
110
+ current_lvl = extract_tag_name(node).gsub(/^h/, '').to_i - 1
111
+ # Update this level, and set all the following ones to nil
112
+ current_hierarchy["lvl#{current_lvl}".to_sym] = extract_text(node)
113
+ (current_lvl + 1..6).each do |lvl|
114
+ current_hierarchy["lvl#{lvl}".to_sym] = nil
115
+ end
116
+ # Update the anchor, if the new heading has one
117
+ new_anchor = extract_anchor(node)
118
+ current_anchor = new_anchor if new_anchor
119
+ end
120
+
121
+ # Stop if node is not to be extracted
122
+ next unless node.matches?(@options[:css_selector])
123
+
124
+ # Stop if node is empty
125
+ text = extract_text(node)
126
+ next if text.empty?
127
+
128
+ item = {
129
+ html: extract_html(node),
130
+ text: text,
131
+ tag_name: extract_tag_name(node),
132
+ hierarchy: current_hierarchy.clone,
133
+ anchor: current_anchor,
134
+ node: node,
135
+ weight: {
136
+ position: current_position,
137
+ heading: heading_weight(current_lvl)
138
+ }
139
+ }
140
+ item[:uuid] = uuid(item)
141
+ items << item
142
+
143
+ current_position += 1
144
+ end
145
+
146
+ items
147
+ end
148
+ end
@@ -0,0 +1,5 @@
1
+ # Expose gem version
2
+ # rubocop:disable Style/SingleLineMethods
3
+ class HTMLHierarchyExtractorVersion
4
+ def self.to_s; '1.0.12' end
5
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: html-hierarchy-extractor
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.11
4
+ version: 1.0.12
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tim Carry
@@ -197,7 +197,12 @@ email: tim@pixelastic.com
197
197
  executables: []
198
198
  extensions: []
199
199
  extra_rdoc_files: []
200
- files: []
200
+ files:
201
+ - CONTRIBUTING.md
202
+ - LICENSE.txt
203
+ - README.md
204
+ - lib/html-hierarchy-extractor.rb
205
+ - lib/version.rb
201
206
  homepage: https://github.com/pixelastic/html-hierarchy-extractor
202
207
  licenses:
203
208
  - MIT