mechanizer 1.10 → 1.11
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +13 -11
- data/Rakefile +1 -1
- data/lib/mechanizer/noko.rb +5 -2
- data/lib/mechanizer/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ef9ac4cda832d55e693e46c28e9dd569cad29e096545338ea364bee468f7f02f
|
4
|
+
data.tar.gz: b58af71643298b06fb45ee3cbb2aaa2df3abc2579826e42c6dc197d739aa8943
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e3034c83bdbb741c0a3f65ec798edf4daf7eff600a233ae6bc301e241ee85b774538c7725e51893375e390e9c17c120683dca1b22986aa1dcf237e6eb19b7901
|
7
|
+
data.tar.gz: d8ca61e01ce5c34fbb906aea820e16d947f79ebcf75bc64cddfc7bf1fe9dbaa744438cb507e9551eceda67f724e9ebb256b3285a481228b2f6b6f3ce5cb84312
|
data/README.md
CHANGED
@@ -7,13 +7,15 @@
|
|
7
7
|
|
8
8
|
Light, easy to use wrapper for Mechanize and NokoGiri. No configuration or error handling to worry about. Simply enter the target URL and Mechanizer scrapes the page for you to easily parse.
|
9
9
|
|
10
|
-
|
10
|
+
### Recommended Gems
|
11
11
|
Note: URL MUST be in proper format and be valid, example:
|
12
|
+
|
12
13
|
Correct: https://www.example.com
|
14
|
+
|
13
15
|
Incorrect: www.example.com, example.com, https://example.com
|
14
16
|
|
15
|
-
|
16
|
-
|
17
|
+
1. If you need to pre-format your URLs, try using `CrmFormatter gem`
|
18
|
+
2. If you need to verify your URLs, try using `UrlVerifier gem`, which includes the `CrmFormatter gem` inside of it.
|
17
19
|
|
18
20
|
Then, feed the results from those gems into this gem. The documentation below assumes the URLs are correctly formatted and have been verified before passing them through the `Mechanizer gem`.
|
19
21
|
|
@@ -35,14 +37,14 @@ Or install it yourself as:
|
|
35
37
|
|
36
38
|
## Usage
|
37
39
|
|
38
|
-
|
40
|
+
### 1. Instantiate & Pass URL
|
39
41
|
|
40
42
|
```
|
41
43
|
noko = Mechanizer::Noko.new
|
42
44
|
noko_hash = noko.scrape({url: 'https://www.wikipedia.org'})
|
43
45
|
```
|
44
46
|
|
45
|
-
|
47
|
+
### 2. To Customize Timeout:
|
46
48
|
Default timeout is set to 60. You can adjust that time or omit it if 60 is fine.
|
47
49
|
|
48
50
|
```
|
@@ -51,7 +53,7 @@ args = {url: 'https://www.wikipedia.org', timeout: 30}
|
|
51
53
|
noko_hash = noko.scrape(args)
|
52
54
|
```
|
53
55
|
|
54
|
-
|
56
|
+
### 3. Noko Result in Hash Format
|
55
57
|
|
56
58
|
```
|
57
59
|
err_msg = noko_hash[:err_msg]
|
@@ -59,7 +61,7 @@ page = noko_hash[:page]
|
|
59
61
|
texts_and_hrefs = noko_hash[:texts_and_hrefs]
|
60
62
|
```
|
61
63
|
|
62
|
-
|
64
|
+
### 4. Example Texts & Hrefs:
|
63
65
|
|
64
66
|
```
|
65
67
|
texts_and_hrefs = [
|
@@ -73,17 +75,17 @@ texts_and_hrefs = [
|
|
73
75
|
]
|
74
76
|
```
|
75
77
|
|
76
|
-
|
78
|
+
### 5. Example Parsing Page:
|
77
79
|
There are several ways to parse and manipulate `noko_hash[:page]`. Essentially, you can parse the page using its css classes and html tags. You can use either or both together. Some pages are very straight forward, but others can require a lot of skill. Here is a good reference guide: [Nokogiri Tutorials](http://www.nokogiri.org/tutorials). All Nokogiri methods are available through this wrapper. This wrapper simply helps you avoid setting up, manages and reduces errors, and helps to automate your scraping process.
|
78
80
|
|
79
|
-
|
81
|
+
For the Wikipedia URL in the example above, at the time of this README there is a group of icons on its homepage. If you right-click on any of them you can inspect. Look for any classes that interest you. In this example, it's `.other-project`. Simply paste it like below to get started. Remember, there are several ways to do this, so read the docs and explore what's available.
|
80
82
|
|
81
83
|
```
|
82
84
|
other_projects = page.css('.other-project')&.text
|
83
85
|
other_projects = other_projects.split("\n").reject(&:blank?)
|
84
86
|
```
|
85
87
|
|
86
|
-
|
88
|
+
### 6. Results from Parsing Page (from example 5):
|
87
89
|
|
88
90
|
```
|
89
91
|
other_projects = [
|
@@ -114,7 +116,7 @@ other_projects = [
|
|
114
116
|
]
|
115
117
|
```
|
116
118
|
|
117
|
-
|
119
|
+
### 7. Automating Your Scraping:
|
118
120
|
You may wish to automate your scraping for various reasons including:
|
119
121
|
|
120
122
|
* Verifing Inventory Items and Pricing (car dealers, retail, menus, etc.),
|
data/Rakefile
CHANGED
@@ -26,6 +26,7 @@ end
|
|
26
26
|
def run_mechanizer
|
27
27
|
noko = Mechanizer::Noko.new
|
28
28
|
args = {url: 'https://www.wikipedia.org', timeout: 30}
|
29
|
+
# args = {url: 'wikipedia', timeout: 30}
|
29
30
|
noko_hash = noko.scrape(args)
|
30
31
|
|
31
32
|
err_msg = noko_hash[:err_msg]
|
@@ -34,5 +35,4 @@ def run_mechanizer
|
|
34
35
|
|
35
36
|
other_projects = page.css('.other-project')&.text
|
36
37
|
other_projects = other_projects.split("\n").reject(&:blank?)
|
37
|
-
|
38
38
|
end
|
data/lib/mechanizer/noko.rb
CHANGED
@@ -5,7 +5,6 @@
|
|
5
5
|
# require 'open-uri'
|
6
6
|
# require 'whois'
|
7
7
|
# require 'delayed_job'
|
8
|
-
#
|
9
8
|
# require 'timeout'
|
10
9
|
# require 'net/ping'
|
11
10
|
|
@@ -72,7 +71,9 @@ module Mechanizer
|
|
72
71
|
end
|
73
72
|
|
74
73
|
def pre_noko_msg(url)
|
75
|
-
|
74
|
+
msg = "\n\n#{'='*40}\nSCRAPING: #{url}\nMax Wait Set: #{@timeout} Seconds\n\n"
|
75
|
+
puts msg
|
76
|
+
msg
|
76
77
|
end
|
77
78
|
|
78
79
|
def error_parser(err_msg)
|
@@ -86,6 +87,8 @@ module Mechanizer
|
|
86
87
|
err_msg = "Error: TCP"
|
87
88
|
elsif err_msg.include?("execution expired")
|
88
89
|
err_msg = "Error: Runtime"
|
90
|
+
elsif err_msg.include?("absolute URL needed")
|
91
|
+
err_msg = "Error: URL Not Absolute"
|
89
92
|
else
|
90
93
|
err_msg = "Error: Undefined"
|
91
94
|
end
|
data/lib/mechanizer/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: mechanizer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: '1.
|
4
|
+
version: '1.11'
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Adam Booth
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2018-07-
|
11
|
+
date: 2018-07-04 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|