algolia_html_extractor 2.1.0 → 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +21 -11
- data/lib/algolia_html_extractor.rb +3 -3
- data/lib/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bbf8df27c69c4d6f2f16de4bd7cf18fcd703fb43
|
4
|
+
data.tar.gz: a01708af7fe1a3c42d364a099e443ac05f6f8a75
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9d9d8af70a4310d871a96fd34a789de3ce0df0ba4621cf237727fcc514dbbfb9fd3d26a35ae3df6fd9b6574752e290d4254bdea7f1622cadba99a07a6a870adf
|
7
|
+
data.tar.gz: e74cc7ca6db7fddc84c903715a44c70df47fb27f303ee1635579b89f47269fab168e9933582fef73269ad0e24fdeae97caa5c1924c57a0553242c33407f7492c
|
data/README.md
CHANGED
@@ -1,5 +1,11 @@
|
|
1
1
|
# algolia_html_extractor
|
2
2
|
|
3
|
+
[![Gem Version][1]](http://badge.fury.io/rb/algolia_html_extractor)
|
4
|
+
[![Build Status][2]](https://travis-ci.org/algolia/html-extractor)
|
5
|
+
[![Coverage Status][3]](https://coveralls.io/github/algolia/html-extractor?branch=master)
|
6
|
+
[![Code Climate][4]](https://codeclimate.com/github/algolia/html-extractor)
|
7
|
+
![Ruby >= 2.3.0][5]
|
8
|
+
|
3
9
|
This gem can convert HTML content into JSON records ready to be pushed to
|
4
10
|
Algolia.
|
5
11
|
|
@@ -93,13 +99,13 @@ Each record has a `objectID` that uniquely identify it (computed by a hash of al
|
|
93
99
|
the other values).
|
94
100
|
|
95
101
|
It also contains the HTML tag name in `tag_name` (by default `<p>`
|
96
|
-
paragraphs are extracted, but see the [settings][
|
102
|
+
paragraphs are extracted, but see the [settings][6] on how to change it).
|
97
103
|
|
98
104
|
`html` contains the whole `outerContent` of the element, including the wrapping
|
99
105
|
tags and inner children. The `text` attribute contains the textual content,
|
100
106
|
stripping out all HTML.
|
101
107
|
|
102
|
-
`node` contains the [Nokogiri node][
|
108
|
+
`node` contains the [Nokogiri node][7] instance. The lib uses it internally to
|
103
109
|
extract all the relevant information but is also exposed if you want to process
|
104
110
|
the node further.
|
105
111
|
|
@@ -109,7 +115,7 @@ Anchors are searched in `name` and `id` attributes of headings.
|
|
109
115
|
|
110
116
|
`hierarchy` then contains a snapshot of the current heading hierarchy of the
|
111
117
|
paragraph. The `lvlX` syntax is used to be compatible with the records
|
112
|
-
[DocSearch][
|
118
|
+
[DocSearch][8] is using.
|
113
119
|
|
114
120
|
The `weight` attribute is used to provide an easy way to rank two records
|
115
121
|
relative to each other.
|
@@ -142,7 +148,7 @@ and generic bug reports.
|
|
142
148
|
## Bug Reports and feature requests
|
143
149
|
|
144
150
|
For any bug or ideas of new features, please start by checking in the
|
145
|
-
[issues]
|
151
|
+
[issues][9] tab if
|
146
152
|
it hasn't been discussed already. If not, feel free to open a new issue.
|
147
153
|
|
148
154
|
## Pull Requests
|
@@ -165,7 +171,7 @@ cp ./scripts/git_hooks/* ./.git/hooks
|
|
165
171
|
This will add a `pre-commit` and `pre-push` scripts that will respectively check
|
166
172
|
that all files are lint-free before committing, and pass all tests before
|
167
173
|
pushing. If any of those two hooks give your errors, you should fix the code
|
168
|
-
before
|
174
|
+
before committing or pushing.
|
169
175
|
|
170
176
|
Having those steps helps keeping the codebase clean as much as possible, and
|
171
177
|
avoid polluting discussion in PR about style.
|
@@ -182,7 +188,7 @@ Rubocop, and the configuration can be found in `.rubocop.yml`.
|
|
182
188
|
|
183
189
|
## Test
|
184
190
|
|
185
|
-
`rake test` will run all the tests.
|
191
|
+
`rake test` will run all the tests.
|
186
192
|
|
187
193
|
`rake coverage` will do the same, but also adding the code coverage files to
|
188
194
|
`./coverage`. This should be useful in a CI environment.
|
@@ -210,8 +216,12 @@ This gem was previously named `html-hierarchy-extractor` but has been renamed to
|
|
210
216
|
convention. That's also why this gem directly starts at v2.0.
|
211
217
|
|
212
218
|
|
213
|
-
[1]: https://
|
214
|
-
[2]: https://
|
215
|
-
[3]:
|
216
|
-
[4]:
|
217
|
-
[5]: https://
|
219
|
+
[1]: https://badge.fury.io/rb/algolia_html_extractor.svg
|
220
|
+
[2]: https://travis-ci.org/algolia/html-extractor.svg?branch=master
|
221
|
+
[3]: https://coveralls.io/repos/algolia/html-extractor/badge.svg?branch=master&service=github
|
222
|
+
[4]: https://codeclimate.com/github/algolia/html-extractor/badges/gpa.svg
|
223
|
+
[5]: https://img.shields.io/badge/ruby-%3E%3D%202.3.0-green.svg
|
224
|
+
[6]: #Settings
|
225
|
+
[7]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
|
226
|
+
[8]: https://community.algolia.com/docsearch/
|
227
|
+
[9]: https://github.com/pixelastic/html-hierarchy-extractor/issues
|
@@ -118,12 +118,12 @@ class AlgoliaHTMLExtractor
|
|
118
118
|
next unless node.matches?(@options[:css_selector])
|
119
119
|
|
120
120
|
# Stop if node is empty
|
121
|
-
|
122
|
-
next if
|
121
|
+
content = extract_text(node)
|
122
|
+
next if content.empty?
|
123
123
|
|
124
124
|
item = {
|
125
125
|
html: extract_html(node),
|
126
|
-
|
126
|
+
content: content,
|
127
127
|
tag_name: extract_tag_name(node),
|
128
128
|
hierarchy: current_hierarchy.clone,
|
129
129
|
anchor: current_anchor,
|
data/lib/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: algolia_html_extractor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.
|
4
|
+
version: 2.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tim Carry
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2017-
|
11
|
+
date: 2017-12-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|