html-hierarchy-extractor 1.0.11 → 1.0.12

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 4ae1be739dd1c3c223adf0a43dee8c2db518f79d
4
- data.tar.gz: 610da86bcb1051eaa2326cb3a0d9cf14fe82d790
3
+ metadata.gz: 0b21edfe3c30bf5534e1eba5acd9ce1e6d0ce940
4
+ data.tar.gz: 93333cb4915ad7d03335d69d8b4b2ea28644ab39
5
5
  SHA512:
6
- metadata.gz: d0ccd7c0fe90ada0c414f6af2d1b928d69390866e50409556e70a7545485adc4fba97c8988aa589de817d5005f7d80b4322a8d5f9b6c2657aa75924fd7ff493e
7
- data.tar.gz: c59ee344ad419cfb887b1d1fc6c45c33f0a774433799baae158320797db4b91d06ef151cb89eb4de36698345a0d3172e4c2a4251346a001ffbe501ae5271d399
6
+ metadata.gz: 4ab4be055f1c0270e665daf224d322a4eaa7dd72d2b07e80593835b75bd6980a2e383533feb60fb2ec905f818643f2f8ab72e2a90ffa96cb7e3fad90190eae46
7
+ data.tar.gz: eb0a01ba3102aa484b3386b0642304dd51c2eadb30dcf346edae7e3c6acf23a366212575e37ff7d0783be1d15fba44e4a45afa1ad6f9a73ccb3d0c8d9e717e92
@@ -0,0 +1,19 @@
1
+ ## Releasing
2
+
3
+ `rake build` will build
4
+
5
+ # Tagging and releasing
6
+
7
+ If you need to release a new version of the gem to RubyGems, you have to follow
8
+ those steps:
9
+
10
+ ```
11
+ # Bump the version (in develop)
12
+ ./scripts/bump_version minor
13
+
14
+ # Update master and release
15
+ ./scripts/release
16
+
17
+ # Install the gem locally (optional)
18
+ rake install
19
+ ```
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2016 Pixelastic
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,222 @@
1
+ # html-hierarchy-extractor
2
+
3
+ ## ⚠ DEPRECATION NOTICE
4
+
5
+ This gem has been deprecated in favor of [algolia_html_extractor][1]. No further
6
+ development will be happening on that gem. The new gem took over where this one
7
+ stopped.
8
+
9
+ ## Description
10
+
11
+ This gems lets you extract the hierarchy of headings and content from any HTML
12
+ page into an array of elements.
13
+
14
+
15
+ Intended to be used with [Algolia][2] to improve relevance of search
16
+ results inside large HTML pages. The records created are compatible with the
17
+ [DocSearch][3] format.
18
+
19
+ ## Installation
20
+
21
+ ```ruby
22
+ # Gemfile
23
+ source 'http://rubygems.org'
24
+
25
+ gem 'html-hierarchy-extractor', '~> 1.0'
26
+ ```
27
+
28
+ ## How to use
29
+
30
+ ```ruby
31
+ require 'html-hierarchy-extractor'
32
+
33
+ content = File.read('./index.html')
34
+ page = HTMLHierarchyExtractor.new(content)
35
+ records = page.extract
36
+ puts records
37
+ ```
38
+
39
+ ## Records
40
+
41
+ `extract` will return an array of records. Each record will represent a `<p>`
42
+ paragraph of the initial text, along with it textual version (HTML removed),
43
+ heading hierarchy, and other interesting bits.
44
+
45
+ ## Example
46
+
47
+ Let's take the following HTML as input and see what records we got as output:
48
+
49
+ ```html
50
+ <!doctype html>
51
+ <html>
52
+ <body>
53
+ <h1 name="journey">The Hero's Journey</h1>
54
+ <p>Most stories always follow the same pattern.</p>
55
+ <h2 name="departure">Part One: Departure</h2>
56
+ <p>A story starts in a mundane world, and helps identify the hero. It helps puts all the achievements of the story into perspective.</p>
57
+ <h3 name="calladventure">The call to Adventure</h3>
58
+ <p>Some out-of-the-ordinary event pushes the hero to start his journey.</p>
59
+ <h3 name="threshold">Crossing the Threshold</h3>
60
+ <p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>
61
+ <h2 name="initiations">Part Two: Initiation</h2>
62
+ <h3 name="trials">The Road of Trials</h3>
63
+ <p>The road is filled with dangers. The hero as to find his inner strength to overcome them.</p>
64
+ <h3 name="ultimate">The Ultimate Boon</h3>
65
+ <p>The hero has found something, either physical or metaphorical that changes him.</p>
66
+ <h2 name="return">Part Three: Return</h2>
67
+ <h3 name="refusal">Refusal to Return</h3>
68
+ <p>The hero does not want to go back to his previous life at first. But then, an event will make him change his mind.</p>
69
+ <h3 name="master">Master of Two Worlds</h3>
70
+ <p>Armed with his new power/weapon, the hero can go back to its initial world and fix all the issues he had there.</p>
71
+ </body>
72
+ </html>
73
+ ```
74
+
75
+ Here is one of the records extracted:
76
+
77
+ ```ruby
78
+ {
79
+ :uuid => "1f5923d5a60e998704f201bbe9964811",
80
+ :tag_name => "p",
81
+ :html => "<p>The hero quits his job, hits the road, or whatever cuts him from his previous life.</p>",
82
+ :text => "The hero quits his job, hits the road, or whatever cuts him from his previous life.",
83
+ :node => #<Nokogiri::XML::Element:0x11a5850 name="p">,
84
+ :anchor => 'threshold',
85
+ :hierarchy => {
86
+ :lvl0 => "The Hero's Journey",
87
+ :lvl1 => "Part One: Departure",
88
+ :lvl2 => "Crossing the Threshold",
89
+ :lvl3 => nil,
90
+ :lvl4 => nil,
91
+ :lvl5 => nil,
92
+ :lvl6 => nil
93
+ },
94
+ :weight => {
95
+ :heading => 70,
96
+ :position => 3
97
+ }
98
+ }
99
+ ```
100
+
101
+ Each record has a `uuid` that uniquely identify it (computed by a hash of all
102
+ the other values).
103
+
104
+ It also contains the HTML tag name in `tag_name` (by default `<p>`
105
+ paragraphs are extracted, but see the [settings][4] on how to change it).
106
+
107
+ `html` contains the whole `outerContent` of the element, including the wrapping
108
+ tags and inner children. The `text` attribute contains the textual content,
109
+ stripping out all HTML.
110
+
111
+ `node` contains the [Nokogiri node][5] instance. The lib uses it internally to
112
+ extract all the relevant information but is also exposed if you want to process
113
+ the node further.
114
+
115
+ The `anchor` attributes contains the HTML anchor closest to the element. Here it
116
+ is `threshold` because this is the closest anchor in the hierarchy above.
117
+ Anchors are searched in `name` and `id` attributes of headings.
118
+
119
+ `hierarchy` then contains a snapshot of the current heading hierarchy of the
120
+ paragraph. The `lvlX` syntax is used to be compatible with the records
121
+ [DocSearch][6] is using.
122
+
123
+ The `weight` attribute is used to provide an easy way to rank two records
124
+ relative to each other.
125
+
126
+ - `heading` gives the depth level in the hierarchy where the record is. Records
127
+ on top level will have a value of 100, those under a `h1` will have 90, and so
128
+ on. Because our record is under a `h3`, it has 70.
129
+ - `position` is the position of the paragraph in the page. Here our paragraph is
130
+ the fourth paragraph of the page, so it will have a `position` of 3. It can
131
+ help you give more weight to the first items in the page.
132
+
133
+ ## Settings
134
+
135
+ When instanciating `HTMLHierarchyExtractor`, you can pass a secondary `options`
136
+ argument. This attribute accepts one value, `css_selector`.
137
+
138
+ ```ruby
139
+ page = HTMLHierarchyExtractor.new(content, { css_selector: 'p,li' })
140
+ ```
141
+
142
+ This lets you change the default selector. Here instead of `<p>` paragraph,
143
+ the library will extract `<li>` list elements as well.
144
+
145
+ # CONTRIBUTING
146
+
147
+ I'm happy you'd like to contribute. All contributions are welcome, ranging from
148
+ feature requests to pull requests, but also including typo fixing, documentation
149
+ and generic bug reports.
150
+
151
+ ## Bug Reports and feature requests
152
+
153
+ For any bug or ideas of new features, please start by checking in the
154
+ [issues][7] tab if
155
+ it hasn't been discussed already. If not, feel free to open a new issue.
156
+
157
+ ## Pull Requests
158
+
159
+ All PR are welcome, from small typo fixes to large codebase changes. If you
160
+ think you'll need to change a lot of code in a lot of files, I would suggest you
161
+ to open an issue first so we can discuss before you start working on something.
162
+
163
+ All PR should be based on the `develop` branch (`master` only ever contains the
164
+ last released change).
165
+
166
+ ## Git Hooks
167
+
168
+ If you start working on the actual code, you should install the git hooks.
169
+
170
+ ```
171
+ cp ./scripts/git_hooks/* ./.git/hooks
172
+ ```
173
+
174
+ This will add a `pre-commit` and `pre-push` scripts that will respectively check
175
+ that all files are lint-free before committing, and pass all tests before
176
+ pushing. If any of those two hooks give your errors, you should fix the code
177
+ before committing or pushing.
178
+
179
+ Having those steps helps keeping the codebase clean as much as possible, and
180
+ avoid polluting discussion in PR about style.
181
+
182
+ ## Development
183
+
184
+ First thing you should do to get all your dependencies up to date is run `bundle
185
+ install` before running any other command.
186
+
187
+ ## Lint
188
+
189
+ `rake lint` will check all the files for potential linting issue. It uses
190
+ Rubocop, and the configuration can be found in `.rubocop.yml`.
191
+
192
+ ## Test
193
+
194
+ `rake test` will run all the tests.
195
+
196
+ `rake coverage` will do the same, but also adding the code coverage files to
197
+ `./coverage`. This should be useful in a CI environment.
198
+
199
+ `rake watch` will run Guard that will do a live run of all your tests. Every
200
+ update to a file (code or test) will re-run all the bound tests. This is highly
201
+ recommended for TDD.
202
+
203
+ ## Using a local version of the gem
204
+
205
+ If you want to test a local version of the gem in your local project, I suggest
206
+ updating your project `Gemfile` to point to the correct local directory
207
+
208
+ ```ruby
209
+ gem "html-hierarchy-extractor", :path => "/path/to/local/gem/folder"
210
+ ```
211
+
212
+ You should also run `rake gemspec` from the `html-hierarchy-extractor`
213
+ repository the first time and if you added/deleted any file or dependency.
214
+
215
+
216
+ [1]: https://github.com/algolia/html-extractor
217
+ [2]: https://www.algolia.com/
218
+ [3]: https://community.algolia.com/docsearch/
219
+ [4]: #Settings
220
+ [5]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
221
+ [6]: https://community.algolia.com/docsearch/
222
+ [7]: https://github.com/pixelastic/html-hierarchy-extractor/issues
@@ -0,0 +1,148 @@
1
+ require 'nokogiri'
2
+ require 'digest/md5'
3
+
4
+ # Extract content from an HTML page in the form of items with associated
5
+ # hierarchy data
6
+ class HTMLHierarchyExtractor
7
+ def initialize(input, options: {})
8
+ @dom = Nokogiri::HTML(input)
9
+ default_options = {
10
+ css_selector: 'p'
11
+ }
12
+ @options = default_options.merge(options)
13
+
14
+ warn '[DEPRECATION] The gem html-hierarchy-extractor has been renamed '\
15
+ 'to algolia_html_extractor and will no longer be supported. '\
16
+ 'Please switch to algolia_html_extractor as soon as possible.'
17
+ end
18
+
19
+ # Returns the outer HTML of a given node
20
+ #
21
+ # eg.
22
+ # <p>foo</p> => <p>foo</p>
23
+ def extract_html(node)
24
+ node.to_s.strip
25
+ end
26
+
27
+ # Returns the inner HTML of a given node
28
+ #
29
+ # eg.
30
+ # <p>foo</p> => foo
31
+ def extract_text(node)
32
+ node.content
33
+ end
34
+
35
+ # Returns the tag name of a given node
36
+ #
37
+ # eg
38
+ # <p>foo</p> => p
39
+ def extract_tag_name(node)
40
+ node.name.downcase
41
+ end
42
+
43
+ # Returns the anchor to the node
44
+ #
45
+ # eg.
46
+ # <h1 name="anchor">Foo</h1> => anchor
47
+ # <h1 id="anchor">Foo</h1> => anchor
48
+ # <h1><a name="anchor">Foo</a></h1> => anchor
49
+ def extract_anchor(node)
50
+ anchor = node.attr('name') || node.attr('id') || nil
51
+ return anchor unless anchor.nil?
52
+
53
+ # No anchor found directly in the header, search on children
54
+ subelement = node.css('[name],[id]')
55
+ return extract_anchor(subelement[0]) unless subelement.empty?
56
+
57
+ nil
58
+ end
59
+
60
+ ##
61
+ # Generate a unique identifier for the item
62
+ def uuid(item)
63
+ # We first get all the keys of the object, sorted alphabetically...
64
+ ordered_keys = item.keys.sort
65
+
66
+ # ...then we build a huge array of "key=value" pairs...
67
+ ordered_array = ordered_keys.map do |key|
68
+ value = item[key]
69
+ # We apply the method recursively on other hashes
70
+ value = uuid(value) if value.is_a?(Hash)
71
+ "#{key}=#{value}"
72
+ end
73
+
74
+ # ...then we build a unique md5 hash of it
75
+ Digest::MD5.hexdigest(ordered_array.join(','))
76
+ end
77
+
78
+ ##
79
+ # Get a relative numeric value of the importance of the heading
80
+ # 100 for top level, then -10 per heading
81
+ def heading_weight(heading_level)
82
+ weight = 100
83
+ return weight if heading_level.nil?
84
+ weight - ((heading_level + 1) * 10)
85
+ end
86
+
87
+ def extract
88
+ heading_selector = 'h1,h2,h3,h4,h5,h6'
89
+ # We select all nodes that match either the headings or the elements to
90
+ # extract. This will allow us to loop over it in order it appears in the DOM
91
+ all_selector = "#{heading_selector},#{@options[:css_selector]}"
92
+
93
+ items = []
94
+ current_hierarchy = {
95
+ lvl0: nil,
96
+ lvl1: nil,
97
+ lvl2: nil,
98
+ lvl3: nil,
99
+ lvl4: nil,
100
+ lvl5: nil
101
+ }
102
+ current_position = 0 # Position of the DOM node in the tree
103
+ current_lvl = nil # Current closest hierarchy level
104
+ current_anchor = nil # Current closest anchor
105
+
106
+ @dom.css(all_selector).each do |node|
107
+ # If it's a heading, we update our current hierarchy
108
+ if node.matches?(heading_selector)
109
+ # Which level heading is it?
110
+ current_lvl = extract_tag_name(node).gsub(/^h/, '').to_i - 1
111
+ # Update this level, and set all the following ones to nil
112
+ current_hierarchy["lvl#{current_lvl}".to_sym] = extract_text(node)
113
+ (current_lvl + 1..6).each do |lvl|
114
+ current_hierarchy["lvl#{lvl}".to_sym] = nil
115
+ end
116
+ # Update the anchor, if the new heading has one
117
+ new_anchor = extract_anchor(node)
118
+ current_anchor = new_anchor if new_anchor
119
+ end
120
+
121
+ # Stop if node is not to be extracted
122
+ next unless node.matches?(@options[:css_selector])
123
+
124
+ # Stop if node is empty
125
+ text = extract_text(node)
126
+ next if text.empty?
127
+
128
+ item = {
129
+ html: extract_html(node),
130
+ text: text,
131
+ tag_name: extract_tag_name(node),
132
+ hierarchy: current_hierarchy.clone,
133
+ anchor: current_anchor,
134
+ node: node,
135
+ weight: {
136
+ position: current_position,
137
+ heading: heading_weight(current_lvl)
138
+ }
139
+ }
140
+ item[:uuid] = uuid(item)
141
+ items << item
142
+
143
+ current_position += 1
144
+ end
145
+
146
+ items
147
+ end
148
+ end
@@ -0,0 +1,5 @@
1
+ # Expose gem version
2
+ # rubocop:disable Style/SingleLineMethods
3
+ class HTMLHierarchyExtractorVersion
4
+ def self.to_s; '1.0.12' end
5
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: html-hierarchy-extractor
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.11
4
+ version: 1.0.12
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tim Carry
@@ -197,7 +197,12 @@ email: tim@pixelastic.com
197
197
  executables: []
198
198
  extensions: []
199
199
  extra_rdoc_files: []
200
- files: []
200
+ files:
201
+ - CONTRIBUTING.md
202
+ - LICENSE.txt
203
+ - README.md
204
+ - lib/html-hierarchy-extractor.rb
205
+ - lib/version.rb
201
206
  homepage: https://github.com/pixelastic/html-hierarchy-extractor
202
207
  licenses:
203
208
  - MIT