algolia_html_extractor 2.2.2 → 2.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +21 -19
- data/lib/algolia_html_extractor.rb +1 -1
- data/lib/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA1:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 2935d48da689f8b8064f82fe45d08294973b537b
|
|
4
|
+
data.tar.gz: e4eaf98057448f77c4ffa4472e479d8f331ccf48
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 2b97bd8f04d84d81bf5adbc4063163c43e3b5c53f2cfa8297b3fac07d0b3542ea3ca135068dbf9e63b3444b5604a2cd34645f79328491b14c7445eeafc897ab8
|
|
7
|
+
data.tar.gz: 1c7e7f1edfb75945ef179dffb5138f3fdb408273653e2c2612fcd7ff73c5d86eebf68aa1911bc65d4974ab867667b36a9d9e26a672ff50d94981e5eae6f88fbc
|
data/README.md
CHANGED
|
@@ -1,10 +1,11 @@
|
|
|
1
1
|
# algolia_html_extractor
|
|
2
2
|
|
|
3
|
-
[![
|
|
4
|
-
|
|
5
|
-
[![
|
|
6
|
-
[![
|
|
7
|
-
![
|
|
3
|
+
[![gem version][1]](https://rubygems.org/gems/algolia_html_extractor)
|
|
4
|
+
![ruby][2]
|
|
5
|
+
[![build master][3]](https://travis-ci.org/algolia/html-extractor)
|
|
6
|
+
[![coverage master][4]](https://coveralls.io/github/algolia/html-extractor?branch=master)
|
|
7
|
+
[![build develop][5]](https://travis-ci.org/algolia/html-extractor)
|
|
8
|
+
[![coverage develop][6]](https://coveralls.io/github/algolia/html-extractor?branch=develop)
|
|
8
9
|
|
|
9
10
|
This gem can convert HTML content into JSON records ready to be pushed to
|
|
10
11
|
Algolia.
|
|
@@ -88,7 +89,7 @@ Here is one of the records extracted:
|
|
|
88
89
|
:lvl5 => nil,
|
|
89
90
|
:lvl6 => nil
|
|
90
91
|
},
|
|
91
|
-
:
|
|
92
|
+
:custom_ranking => {
|
|
92
93
|
:heading => 70,
|
|
93
94
|
:position => 3
|
|
94
95
|
}
|
|
@@ -99,13 +100,13 @@ Each record has a `objectID` that uniquely identify it (computed by a hash of al
|
|
|
99
100
|
the other values).
|
|
100
101
|
|
|
101
102
|
It also contains the HTML tag name in `tag_name` (by default `<p>`
|
|
102
|
-
paragraphs are extracted, but see the [settings][
|
|
103
|
+
paragraphs are extracted, but see the [settings][7] on how to change it).
|
|
103
104
|
|
|
104
105
|
`html` contains the whole `outerContent` of the element, including the wrapping
|
|
105
106
|
tags and inner children. The `text` attribute contains the textual content,
|
|
106
107
|
stripping out all HTML.
|
|
107
108
|
|
|
108
|
-
`node` contains the [Nokogiri node][
|
|
109
|
+
`node` contains the [Nokogiri node][8] instance. The lib uses it internally to
|
|
109
110
|
extract all the relevant information but is also exposed if you want to process
|
|
110
111
|
the node further.
|
|
111
112
|
|
|
@@ -115,9 +116,9 @@ Anchors are searched in `name` and `id` attributes of headings.
|
|
|
115
116
|
|
|
116
117
|
`hierarchy` then contains a snapshot of the current heading hierarchy of the
|
|
117
118
|
paragraph. The `lvlX` syntax is used to be compatible with the records
|
|
118
|
-
[DocSearch][
|
|
119
|
+
[DocSearch][9] is using.
|
|
119
120
|
|
|
120
|
-
The `
|
|
121
|
+
The `custom_ranking` attribute is used to provide an easy way to rank two records
|
|
121
122
|
relative to each other.
|
|
122
123
|
|
|
123
124
|
- `heading` gives the depth level in the hierarchy where the record is. Records
|
|
@@ -148,7 +149,7 @@ and generic bug reports.
|
|
|
148
149
|
## Bug Reports and feature requests
|
|
149
150
|
|
|
150
151
|
For any bug or ideas of new features, please start by checking in the
|
|
151
|
-
[issues][
|
|
152
|
+
[issues][10] tab if
|
|
152
153
|
it hasn't been discussed already. If not, feel free to open a new issue.
|
|
153
154
|
|
|
154
155
|
## Pull Requests
|
|
@@ -217,11 +218,12 @@ convention. That's also why this gem directly starts at v2.0.
|
|
|
217
218
|
|
|
218
219
|
|
|
219
220
|
[1]: https://badge.fury.io/rb/algolia_html_extractor.svg
|
|
220
|
-
[2]: https://
|
|
221
|
-
[3]: https://
|
|
222
|
-
[4]: https://
|
|
223
|
-
[5]: https://img.shields.io/badge/
|
|
224
|
-
[6]:
|
|
225
|
-
[7]:
|
|
226
|
-
[8]:
|
|
227
|
-
[9]: https://
|
|
221
|
+
[2]: https://img.shields.io/badge/ruby-%3E%3D%202.3.0-green.svg
|
|
222
|
+
[3]: https://img.shields.io/badge/dynamic/json.svg?label=build%3Amaster&query=value&uri=https%3A%2F%2Fimg.shields.io%2Ftravis%2Falgolia%2Fhtml-extractor.json%3Fbranch%3Dmaster
|
|
223
|
+
[4]: https://img.shields.io/badge/dynamic/json.svg?label=coverage%3Amaster&colorB=&prefix=&suffix=%25&query=$.covered_percent&uri=https%3A%2F%2Fcoveralls.io%2Fgithub%2Falgolia%2Fhtml-extractor.json%3Fbranch%3Dmaster
|
|
224
|
+
[5]: https://img.shields.io/badge/dynamic/json.svg?label=build%3Adevelop&query=value&uri=https%3A%2F%2Fimg.shields.io%2Ftravis%2Falgolia%2Fhtml-extractor.json%3Fbranch%3Ddevelop
|
|
225
|
+
[6]: https://img.shields.io/badge/dynamic/json.svg?label=coverage%3Adevelop&colorB=&prefix=&suffix=%25&query=$.covered_percent&uri=https%3A%2F%2Fcoveralls.io%2Fgithub%2Falgolia%2Fhtml-extractor.json%3Fbranch%3Ddevelop
|
|
226
|
+
[7]: #Settings
|
|
227
|
+
[8]: http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/Node
|
|
228
|
+
[9]: https://community.algolia.com/docsearch/
|
|
229
|
+
[10]: https://github.com/pixelastic/html-hierarchy-extractor/issues
|
data/lib/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: algolia_html_extractor
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 2.
|
|
4
|
+
version: 2.3.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Tim Carry
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2018-
|
|
11
|
+
date: 2018-03-12 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: awesome_print
|