algolia_html_extractor 2.1.0 → 2.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +21 -11
- data/lib/algolia_html_extractor.rb +3 -3
- data/lib/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bbf8df27c69c4d6f2f16de4bd7cf18fcd703fb43
|
4
|
+
data.tar.gz: a01708af7fe1a3c42d364a099e443ac05f6f8a75
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 9d9d8af70a4310d871a96fd34a789de3ce0df0ba4621cf237727fcc514dbbfb9fd3d26a35ae3df6fd9b6574752e290d4254bdea7f1622cadba99a07a6a870adf
|
7
|
+
data.tar.gz: e74cc7ca6db7fddc84c903715a44c70df47fb27f303ee1635579b89f47269fab168e9933582fef73269ad0e24fdeae97caa5c1924c57a0553242c33407f7492c
|
data/README.md
CHANGED
@@ -1,5 +1,11 @@
|
|
1
1
|
# algolia_html_extractor
|
2
2
|
|
3
|
+
[![Gem Version][1]](http://badge.fury.io/rb/algolia_html_extractor)
|
4
|
+
[![Build Status][2]](https://travis-ci.org/algolia/html-extractor)
|
5
|
+
[![Coverage Status][3]](https://coveralls.io/github/algolia/html-extractor?branch=master)
|
6
|
+
[![Code Climate][4]](https://codeclimate.com/github/algolia/html-extractor)
|
7
|
+
![Ruby >= 2.3.0][5]
|
8
|
+
|
3
9
|
This gem can convert HTML content into JSON records ready to be pushed to
|
4
10
|
Algolia.
|
5
11
|
|
@@ -93,13 +99,13 @@ Each record has a `objectID` that uniquely identify it (computed by a hash of al
|
|
93
99
|
the other values).
|
94
100
|
|
95
101
|
It also contains the HTML tag name in `tag_name` (by default `<p>`
|
96
|
-
paragraphs are extracted, but see the [settings][
|
102
|
+
paragraphs are extracted, but see the [settings][6] on how to change it).
|
97
103
|
|
98
104
|
`html` contains the whole `outerContent` of the element, including the wrapping
|
99
105
|
tags and inner children. The `text` attribute contains the textual content,
|
100
106
|
stripping out all HTML.
|
101
107
|
|
102
|
-
`node` contains the [Nokogiri node][
|
108
|
+
`node` contains the [Nokogiri node][7] instance. The lib uses it internally to
|
103
109
|
extract all the relevant information but is also exposed if you want to process
|
104
110
|
the node further.
|
105
111
|
|
@@ -109,7 +115,7 @@ Anchors are searched in `name` and `id` attributes of headings.
|
|
109
115
|
|
110
116
|
`hierarchy` then contains a snapshot of the current heading hierarchy of the
|
111
117
|
paragraph. The `lvlX` syntax is used to be compatible with the records
|
112
|
-
[DocSearch][
|
118
|
+
[DocSearch][8] is using.
|
113
119
|
|
114
120
|
The `weight` attribute is used to provide an easy way to rank two records
|
115
121
|
relative to each other.
|
@@ -142,7 +148,7 @@ and generic bug reports.
|
|
142
148
|
## Bug Reports and feature requests
|
143
149
|
|
144
150
|
For any bug or ideas of new features, please start by checking in the
|
145
|
-
[issues]
|
151
|
+
[issues][9] tab if
|
146
152
|
it hasn't been discussed already. If not, feel free to open a new issue.
|
147
153
|
|
148
154
|
## Pull Requests
|
@@ -165,7 +171,7 @@ cp ./scripts/git_hooks/* ./.git/hooks
|
|
165
171
|
This will add a `pre-commit` and `pre-push` scripts that will respectively check
|
166
172
|
that all files are lint-free before committing, and pass all tests before
|
167
173
|
pushing. If any of those two hooks give your errors, you should fix the code
|
168
|
-
before
|
174
|
+
before committing or pushing.
|
169
175
|
|
170
176
|
Having those steps helps keeping the codebase clean as much as possible, and
|
171
177
|
avoid polluting discussion in PR about style.
|
@@ -182,7 +188,7 @@ Rubocop, and the configuration can be found in `.rubocop.yml`.
|
|
182
188
|
|
183
189
|
## Test
|
184
190
|
|
185
|
-
`rake test` will run all the tests.
|
191
|
+
`rake test` will run all the tests.
|
186
192
|
|
187
193
|
`rake coverage` will do the same, but also adding the code coverage files to
|
188
194
|
`./coverage`. This should be useful in a CI environment.
|
@@ -210,8 +216,12 @@ This gem was previously named `html-hierarchy-extractor` but has been renamed to
|
|
210
216
|
convention. That's also why this gem directly starts at v2.0.
|
211
217
|
|
212
218
|
|
213
|
-
[1]: https://
|
214
|
-
[2]: https://
|
215
|
-
[3]:
|
216
|
-
[4]:
|
217
|
-
[5]: https://
|
219
|
+
[1]: https://badge.fury.io/rb/algolia_html_extractor.svg
|
220
|
+
[2]: https://travis-ci.org/algolia/html-extractor.svg?branch=master
|
221
|
+
[3]: https://coveralls.io/repos/algolia/html-extractor/badge.svg?branch=master&service=github
|
222
|
+
[4]: https://codeclimate.com/github/algolia/html-extractor/badges/gpa.svg
|
223
|
+
[5]: https://img.shields.io/badge/ruby-%3E%3D%202.3.0-green.svg
|
224
|
+
[6]: #Settings
|
225
|
+
[7]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
|
226
|
+
[8]: https://community.algolia.com/docsearch/
|
227
|
+
[9]: https://github.com/pixelastic/html-hierarchy-extractor/issues
|
@@ -118,12 +118,12 @@ class AlgoliaHTMLExtractor
|
|
118
118
|
next unless node.matches?(@options[:css_selector])
|
119
119
|
|
120
120
|
# Stop if node is empty
|
121
|
-
|
122
|
-
next if
|
121
|
+
content = extract_text(node)
|
122
|
+
next if content.empty?
|
123
123
|
|
124
124
|
item = {
|
125
125
|
html: extract_html(node),
|
126
|
-
|
126
|
+
content: content,
|
127
127
|
tag_name: extract_tag_name(node),
|
128
128
|
hierarchy: current_hierarchy.clone,
|
129
129
|
anchor: current_anchor,
|
data/lib/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: algolia_html_extractor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.
|
4
|
+
version: 2.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Tim Carry
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2017-
|
11
|
+
date: 2017-12-19 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: awesome_print
|