http_crawler 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: f034023a8c50c41be3d4e423d39fa2ad44e75930
4
+ data.tar.gz: 745c8a86a328387f8b8c7ef68e351c5d63d4d61a
5
+ SHA512:
6
+ metadata.gz: aab1febfdc72a126e9edb1b496b661d236d3ae6689d5bf29350d4bffdbb4e9551e03487c2a7fc01d6ec7f15ff55d0872db8f90fb791474834982168aca88e531
7
+ data.tar.gz: 317de9a4ef0d5423b57de20cb9220cdecd598d431d67f03b025fd05c0a2ce2d013eec4e3da816ce50a35722c161aef01a90ee376add0e0ebc06c779b0e296370
data/.gitignore ADDED
@@ -0,0 +1,9 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
data/.idea/vcs.xml ADDED
@@ -0,0 +1,6 @@
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <project version="4">
3
+ <component name="VcsDirectoryMappings">
4
+ <mapping directory="$PROJECT_DIR$" vcs="Git" />
5
+ </component>
6
+ </project>
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --require spec_helper
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at 1336098842@qq.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [http://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: http://contributor-covenant.org
74
+ [version]: http://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,10 @@
1
+ source "https://rubygems.org"
2
+
3
+ git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
4
+ # gem 'rchardet', '~> 1.8'
5
+ # gem 'nokogiri', '~> 1.8.4'
6
+ #
7
+ # gem "ruby-readability", :require => 'readability'
8
+
9
+ # Specify your gem's dependencies in http_crawler.gemspec
10
+ gemspec
data/README.md ADDED
@@ -0,0 +1,55 @@
1
+ # HttpCrawler
2
+
3
+ Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/http_crawler`. To experiment with that code, run `bin/console` for an interactive prompt.
4
+
5
+ TODO: Delete this and the text above, and describe your gem
6
+
7
+ ## Installation
8
+
9
+ Add this line to your application's Gemfile:
10
+
11
+ ```ruby
12
+ gem 'http_crawler'
13
+ ```
14
+
15
+ And then execute:
16
+
17
+ $ bundle
18
+
19
+ Or install it yourself as:
20
+
21
+ $ gem install http_crawler
22
+
23
+
24
+ ## 示例:百度爬虫维护
25
+
26
+
27
+ ### 通过对象调用
28
+
29
+ ```ruby
30
+ client = HttpCrawler::Client::Baidu::Client.new
31
+ client.index # 抓取首页
32
+ ```
33
+
34
+ ### 通过别名调用
35
+ ```ruby
36
+ client = HttpCrawler::Client.for("baidu") #
37
+ client.index # 抓取首页
38
+ ```
39
+
40
+
41
+ ## 示例:测试API
42
+
43
+
44
+ ### 通过对象调用
45
+
46
+ ```ruby
47
+ client = HttpCrawler::Proxy::TestProxyApi::Client.new
48
+ client.index # 抓取首页
49
+ ```
50
+
51
+ ### 通过别名调用
52
+ ```ruby
53
+ client = HttpCrawler::Proxy.for("test_proxy_api") #
54
+ client.index # 抓取首页
55
+ ```
data/Rakefile ADDED
@@ -0,0 +1,2 @@
1
+ require "bundler/gem_tasks"
2
+ task :default => :spec
data/bin/console ADDED
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "http_crawler"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
data/bin/setup ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,45 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path("../lib", __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require "http_crawler/version"
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "http_crawler"
8
+ spec.version = HttpCrawler::VERSION
9
+ spec.authors = ["jagger"]
10
+ spec.email = ["1336098842@qq.com"]
11
+
12
+ spec.summary = %q{http 爬虫。}
13
+ spec.description = %q{初级开发工程师,基于net/http 写的爬虫扩展包。}
14
+ spec.homepage = "https://rubygems.org/gems/http_crawler"
15
+ spec.license = "MIT"
16
+
17
+ # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
18
+ # to allow pushing to a single host or delete this section to allow pushing to any host.
19
+ if spec.respond_to?(:metadata)
20
+ spec.metadata["allowed_push_host"] = "https://rubygems.org"
21
+ else
22
+ raise "RubyGems 2.0 or newer is required to protect against " \
23
+ "public gem pushes."
24
+ end
25
+
26
+ spec.files = `git ls-files -z`.split("\x0").reject do |f|
27
+ f.match(%r{^(test|spec|features)/})
28
+ end
29
+
30
+ spec.files += Dir['lib/**/*.rb']
31
+
32
+ spec.bindir = "exe"
33
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
34
+ spec.require_paths = ["lib"]
35
+
36
+ spec.add_development_dependency "rspec", "~> 3.8"
37
+ spec.add_development_dependency "bundler", "~> 1.15"
38
+ spec.add_development_dependency "rake", "~> 10.0"
39
+
40
+ spec.add_dependency "rchardet", "~> 1.8"
41
+ spec.add_dependency "nokogiri", "~> 1.8"
42
+ spec.add_dependency "ruby-readability", "~> 0.7.0"
43
+ spec.add_dependency "brotli", "~> 0.2.1"
44
+
45
+ end
@@ -0,0 +1,88 @@
1
+ load File.dirname(__FILE__) + '/http.rb'
2
+ load File.dirname(__FILE__) + '/object.rb'
3
+ load File.dirname(__FILE__) + '/string.rb'
4
+
5
+ module HttpCrawler
6
+ module Client
7
+
8
+ class << self
9
+
10
+ # 接收格式
11
+ # web_name = "biquge_duquanben"
12
+ # 返回 HttpCrawler::Web::BiqugeDuquanben::Client 实例
13
+ #
14
+ def for(web_name, *args)
15
+ "HttpCrawler::Web::#{web_name.camelize}::Client".constantize.new(*args)
16
+ end
17
+
18
+ #
19
+ # 接收格式
20
+ # module_name = "HttpCrawler::Web::BiqugeDuquanben"
21
+ # 返回 HttpCrawler::Web::BiqugeDuquanben::Client 实例
22
+ #
23
+ def for_module(module_name, *args)
24
+ "#{module_name}::Client".constantize.new(*args)
25
+ end
26
+ end
27
+
28
+ attr_reader :http, :uri
29
+
30
+ #
31
+ # init_uri 如果未初始化@uri,则会报错
32
+ # 继承类需要重定义 init_uri
33
+ #
34
+ def initialize
35
+ raise "Client uri为空" unless init_uri
36
+ @http = Crawler::HTTP.new(uri.host, uri.port)
37
+
38
+ @http.use_ssl = (uri.scheme == "https")
39
+
40
+ @http.open_timeout = 5
41
+ @http.read_timeout = 5
42
+ @http.proxy_key = "#{self.class}"
43
+ init_http
44
+
45
+ Rails.logger.debug "proxy_key => #{@http.proxy_key}"
46
+ end
47
+
48
+ # 初始化http参数
49
+ def init_http
50
+
51
+ end
52
+
53
+ # init_uri 如果未初始化@uri,则会报错
54
+ # 继承类需要实现 @uri = URI("http://host")
55
+ #
56
+ def init_uri
57
+ @uri = nil
58
+ end
59
+
60
+ def header
61
+ @header ||= init_header
62
+ end
63
+
64
+ def init_header
65
+ nil
66
+ end
67
+
68
+ def update_header(parameter = {})
69
+ nil
70
+ end
71
+
72
+ def update_proxy(proxy = {})
73
+ @http.update_proxy(proxy)
74
+ end
75
+
76
+ def auto_proxy=(value)
77
+ Rails.logger.debug "自动更新代理"
78
+ @http.auto_proxy = value
79
+ @http.update_proxy if (value == true && @http.proxy? == false)
80
+ end
81
+
82
+ # 是否验证码界面
83
+ def validation_page?(*arg)
84
+ false
85
+ end
86
+
87
+ end
88
+ end
@@ -0,0 +1,211 @@
1
+ load File.dirname(__FILE__) + '/net/http.rb'
2
+ load File.dirname(__FILE__) + '/net/response.rb'
3
+
4
+ module HttpCrawler
5
+ class HTTP < Net::HTTP
6
+
7
+ # 自动获取代理,true 表示自动获取代理 、false 表示不自动获取
8
+ attr_accessor :auto_proxy
9
+ # 代理API的别名 主要关联 Crawler::Proxy中维护的代理API
10
+ attr_accessor :proxy_api
11
+ # 调用自己的代理池所需要的主键 key
12
+ attr_accessor :proxy_key
13
+ # 请求错误后的重复最大请求次数
14
+ attr_accessor :max_error_num
15
+
16
+ def initialize(address, port = nil)
17
+ super(address, port)
18
+ @max_error_num = 2
19
+ @error_num = 0
20
+ @proxy_key = "default"
21
+ end
22
+
23
+ def http_error_sleep
24
+ sleep(0.5)
25
+ end
26
+
27
+ def server_error_sleep
28
+ sleep(3)
29
+ end
30
+
31
+ def proxy_api
32
+ @proxy_api ||= "my"
33
+ end
34
+ @@proxy_list = []
35
+ # 为 @http 重设代理
36
+ def proxy(p = {})
37
+
38
+ raise '代理设置 p_addr 不能为空' unless p["p_addr"]
39
+ raise '代理设置 p_port 不能为空' unless p["p_port"]
40
+
41
+ p["p_user"] ||= nil
42
+ p["p_pass"] ||= nil
43
+
44
+ Rails.logger.info("切换代理至 => #{p}")
45
+ # 设为 false 否则不会启用代理
46
+ @proxy_from_env = false
47
+
48
+ # 初始化代理数据
49
+ @proxy_address = p["p_addr"]
50
+ @proxy_port = p["p_port"]
51
+ @proxy_user = p["p_user"]
52
+ @proxy_pass = p["p_pass"]
53
+
54
+ end
55
+
56
+ # 通过调用 api 获取代理或者通过自定义设置代理
57
+ def get_proxy
58
+
59
+ while @@proxy_list.blank?
60
+ Rails.logger.debug("@@proxy_list 为空进行更新")
61
+ proxy_client = Crawler::Proxy.for(proxy_api)
62
+ proxy_r = proxy_client.get_proxy(key: proxy_key)
63
+ @@proxy_list << proxy_r.parsing
64
+ Rails.logger.debug("@@proxy_list => #{@@proxy_list}")
65
+ sleep(1)
66
+ end
67
+
68
+ p = @@proxy_list.delete_at(0)
69
+
70
+ Rails.logger.debug("当前IP => #{@proxy_address}:#{@proxy_port},获取最新代理 => #{p}")
71
+
72
+ unless p && p["p_addr"] && p["p_port"]
73
+ Rails.logger.warn "无最新代理等待5秒后重新获取"
74
+ sleep(5)
75
+ p = get_proxy
76
+ end
77
+
78
+ if (@proxy_address == p["p_addr"] && @proxy_port == p["p_port"])
79
+ Rails.logger.warn "无最新代理等待5秒后重新获取"
80
+ sleep(5)
81
+ p = get_proxy
82
+ end
83
+ p
84
+ end
85
+
86
+ def update_proxy(p = {})
87
+ if p.blank?
88
+ proxy(get_proxy)
89
+ else
90
+ proxy(p)
91
+ end
92
+ end
93
+
94
+ # 如果自动更新代理 则更新代理返回 true,否则返回false
95
+ def update_proxy?(p = {})
96
+ if auto_proxy
97
+ if p.blank?
98
+ proxy(get_proxy)
99
+ else
100
+ proxy(p)
101
+ end
102
+ return true
103
+ else
104
+ return false
105
+ end
106
+ end
107
+
108
+
109
+ # 重定向请求
110
+ def get_fetch(uri_or_path, initheader = nil, dest = nil, limit = 10, &block)
111
+ # You should choose a better exception.
112
+ raise ArgumentError, 'too many HTTP repeated' if limit == 0
113
+ # 更新uri_or_path
114
+ uri_or_path = URI.encode(uri_or_path) if String === uri_or_path && CharDet.detect(uri_or_path)["encoding"] != "ascii"
115
+
116
+ response = get(uri_or_path, initheader, dest, &block)
117
+ case response
118
+ when Net::HTTPSuccess then
119
+ response
120
+ when Net::HTTPRedirection then
121
+ location = response['location']
122
+ Rails.logger.warn "redirected to #{location}"
123
+ # 传入 location 进行跳转
124
+ get_fetch(location, initheader, dest, limit - 1, &block)
125
+ when Net::HTTPServerError then
126
+ Rails.logger.warn "Net::HTTPServerError 5XX to #{address}"
127
+ server_error_sleep
128
+ # 重新请求
129
+ get_fetch(uri_or_path, initheader, dest, &block)
130
+ else
131
+ server_error_sleep
132
+ response.error!
133
+ end
134
+ end
135
+
136
+ # 重定向请求
137
+ def post_fetch(uri_or_path, data, initheader = nil, dest = nil, &block)
138
+ # 更新uri_or_path 如果 uri_or_path 是 String类型 同时 又不是 ascii编码格式就进行转码
139
+ uri_or_path = URI.encode(uri_or_path) if String === uri_or_path && CharDet.detect(uri_or_path)["encoding"] != "ascii"
140
+ Rails.logger.debug "post_fetch => #{uri_or_path}"
141
+ response = post(uri_or_path, data, initheader, dest, &block)
142
+ case response
143
+ when Net::HTTPSuccess then
144
+ response
145
+ when Net::HTTPRedirection then
146
+ location = response['location']
147
+ Rails.logger.warn "redirected to #{location}"
148
+ # 传入 location 进行跳转
149
+ get_fetch(location, initheader, dest, 9, &block)
150
+ when Net::HTTPServerError then
151
+ Rails.logger.warn "Net::HTTPServerError 5XX to #{address}"
152
+ server_error_sleep
153
+ # 重新请求
154
+ post_fetch(uri_or_path, initheader, dest, &block)
155
+ else
156
+ server_error_sleep
157
+ response.error!
158
+ end
159
+ end
160
+
161
+ # def post_fetch
162
+
163
+ #
164
+ # 重写 发送请求的方法
165
+ #
166
+ def request(req, body = nil, &block)
167
+ begin
168
+ Rails.logger.debug("#{req.class} => #{use_ssl? ? "https://" : "http://" }#{address}:#{port}#{req.path}") if started?
169
+ super(req, body, &block)
170
+ rescue => error
171
+ if started?
172
+ # started? 是为了判断是否结束http请求,如果不添加则会处理2次异常
173
+ raise error
174
+ else
175
+ # 最大错误尝试次数
176
+ if @error_num < @max_error_num
177
+ @error_num += 1
178
+ http_error_sleep
179
+ retry # 这将把控制移到 begin 的开头
180
+ else
181
+ # 超过最大错误限制 判断错误类型
182
+ case error
183
+ when Net::HTTPFatalError
184
+ raise error
185
+ when EOFError
186
+ Rails.logger.warn "EOFError!"
187
+ if update_proxy?
188
+ proxy(get_proxy)
189
+ http_error_sleep
190
+ retry # 这将把控制移到 begin 的开头
191
+ else
192
+ raise error
193
+ end
194
+ when Timeout::Error
195
+ Rails.logger.warn "请求超时!"
196
+ if update_proxy?
197
+ @error_num = 0
198
+ http_error_sleep
199
+ retry # 这将把控制移到 begin 的开头
200
+ else
201
+ raise error
202
+ end
203
+ else
204
+ raise error
205
+ end
206
+ end
207
+ end
208
+ end # begin
209
+ end # def request(req, body = nil, &block)
210
+ end
211
+ end
@@ -0,0 +1,7 @@
1
+ module Net
2
+ class HTTP
3
+
4
+
5
+ end # class HTTP
6
+ end # module Net
7
+
@@ -0,0 +1,96 @@
1
+ module Net
2
+ class HTTPResponse
3
+
4
+ # 解压并转码 body 数据
5
+ def decoding_body
6
+
7
+ return @decoding_body if @decoding_body
8
+ return nil unless body
9
+
10
+ # 数据解压
11
+ case header['Content-Encoding']
12
+ when 'gzip' then
13
+ sio = StringIO.new(body)
14
+ gz = Zlib::GzipReader.new(sio)
15
+ @decoding_body = gz.read()
16
+ when 'br'
17
+ @decoding_body = Brotli.inflate(body)
18
+ when 'deflate'
19
+ # 可能错误代码 暂时没解决 deflate 编码格式
20
+ @decoding_body = Zlib::Inflate.inflate(body)
21
+ else
22
+ @decoding_body = body
23
+ end
24
+
25
+ # 判断解压后数据编码格式
26
+
27
+ # 从header取编码格式
28
+ encoding = header['Content-Type'][/charset=([^, ;"]*)/, 1]
29
+
30
+ # 从html中的 charset 取编码格式
31
+ encoding = @decoding_body[/charset=([^, ;"]*)/, 1] unless encoding
32
+
33
+ # 通过 CharDet 判断编码格式
34
+ encoding = CharDet.detect(@decoding_body)["encoding"] unless encoding
35
+
36
+ # 进行转码
37
+ begin
38
+ @decoding_body.force_encoding(encoding).encode!('utf-8') if encoding != @decoding_body.encoding
39
+ rescue => e
40
+ # 转码错误后再次使用 CharDet 判断编码格式后进行转码
41
+ cd = CharDet.detect(@decoding_body)["encoding"]
42
+ if (cd && cd != encoding)
43
+ @decoding_body.force_encoding(cd).encode!('utf-8') if encoding != @decoding_body.encoding
44
+ else
45
+ # 还是转码错误则抛出异常
46
+ raise e
47
+ end
48
+ end
49
+
50
+ @decoding_body
51
+ end
52
+
53
+ # def decoding_body
54
+
55
+ def html
56
+ @html ||= Nokogiri::HTML(decoding_body)
57
+ end
58
+
59
+ def json
60
+ @json ||= JSON.parse(decoding_body)
61
+ @json = JSON.parse(@json) if String === @json
62
+ @json
63
+ end
64
+
65
+ # 通过readability 解析数据
66
+ def readability
67
+ @readability ||= Readability::Document.new(decoding_body, {do_not_guess_encoding: true})
68
+ end
69
+
70
+ # 解析
71
+ def parsing
72
+ nil
73
+ end
74
+
75
+ def get_date(str)
76
+ time = Time.now
77
+ case str
78
+ when /^(\d{1,2})小时前$/
79
+ time = time - $1.to_i.hours
80
+ when /^(\d{1,2})月(\d{1,2})日$/
81
+ time = Time.local(time.year, $1.to_i, $2.to_i)
82
+ when /^(\d{4})年(\d{1,2})月(\d{1,2})日$/
83
+ time = Time.local($1.to_i, $2.to_i, $3.to_i)
84
+ when /^(\d{1,2})月(\d{1,2})日[ ]{0,3}(\d{1,2}):(\d{1,2})$/ # 09月30日 12:04
85
+ time = Time.local(time.year, $1.to_i, $2.to_i, $3.to_i, $4.to_i)
86
+ end
87
+ return time
88
+ end
89
+
90
+
91
+ # 是否是网站验证 true表示正常数据、false表示弹出网站验证
92
+ def web_verify(*arg)
93
+ true
94
+ end
95
+ end # class Net::HTTPResponse
96
+ end
@@ -0,0 +1,13 @@
1
+ class Object
2
+ def blank_0(&block)
3
+ if (self.blank?)
4
+ nil
5
+ else
6
+ if block_given?
7
+ block.call self
8
+ else
9
+ self
10
+ end
11
+ end
12
+ end
13
+ end
@@ -0,0 +1,3 @@
1
+ # 代理模块
2
+ 主要用于维护请求的代理
3
+
@@ -0,0 +1,7 @@
1
+ module HttpCrawler
2
+ module Proxy
3
+ module Client
4
+
5
+ end
6
+ end
7
+ end
@@ -0,0 +1,10 @@
1
+
2
+ module HttpCrawler
3
+ module Proxy
4
+ module Response
5
+
6
+ end
7
+ end
8
+ end
9
+
10
+
@@ -0,0 +1,18 @@
1
+ # 示例:测试API
2
+
3
+
4
+ ### 通过对象调用
5
+
6
+ ```ruby
7
+ client = HttpCrawler::Proxy::TestProxyApi::Client.new
8
+ client.get_proxy # 获取代理
9
+ ```
10
+
11
+ ### 通过别名调用
12
+ ```ruby
13
+ client = HttpCrawler::Proxy.for("test_proxy_api") #
14
+ client.get_proxy # 获取代理
15
+ ```
16
+
17
+ ### response.rb
18
+ 用于维护不同响应结果的处理方法
@@ -0,0 +1,29 @@
1
+ module HttpCrawler
2
+ module Proxy
3
+ module TestProxyApi
4
+ class Client
5
+
6
+ include(Crawler::Client)
7
+ include(Crawler::Proxy::Client)
8
+
9
+ class << self
10
+ def new(*args)
11
+ @client ||= super(*args)
12
+ end
13
+ end
14
+
15
+ def init_uri
16
+ @uri = URI("http://127.0.0.1:1111/")
17
+ end
18
+
19
+ # http://39.108.59.38:7772/Tools/proxyIP.ashx?OrderNumber=ccd4c8912691f28861a1ed048fec88dc&poolIndex=22717&cache=1&qty=2
20
+ def get_proxy(parameter = {})
21
+ r = http.get_fetch("/api/get_proxy")
22
+ r.extend(Crawler::Proxy::Laofu::Response::GetProxy)
23
+ end
24
+
25
+ end
26
+ end # module BiQuGe_DuQuanBen
27
+ end # module Web
28
+ end # module Crawler
29
+
@@ -0,0 +1,24 @@
1
+ # 查询
2
+ module HttpCrawler
3
+ module Proxy
4
+ module Laofu
5
+ module Response
6
+ module TestProxyApi
7
+ def parsing
8
+ array = []
9
+ decoding_body.scan(/([^\n\r:]*):([^\n\r]*)/) do |v1, v2|
10
+ if v1 =~ /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/
11
+ array[array.length] = {"p_addr" => v1, "p_port" => v2, "p_user" => nil, "p_pass" => nil}
12
+ else
13
+ Rails.logger.warn decoding_body
14
+ end
15
+ end
16
+ array
17
+ end
18
+ end # module GetProxy
19
+ end # module Response
20
+ end # module Laofu
21
+ end # module Proxy
22
+ end # module Crawler
23
+
24
+
@@ -0,0 +1,12 @@
1
+ # Dir[File.dirname(__FILE__) + '/response/*.rb'].each {|file| require file}
2
+
3
+ module HttpCrawler
4
+ module Proxy
5
+ module TestProxyApi
6
+ module Response
7
+
8
+ end
9
+ end
10
+ end
11
+ end
12
+
@@ -0,0 +1,18 @@
1
+ module HttpCrawler
2
+ module Proxy
3
+
4
+ class << self
5
+
6
+ # 接收格式
7
+ # web_name = "feilong"
8
+ # 返回 Crawler::Proxy::Feilong::Client 实例
9
+ #
10
+ def for(web_name, *arg)
11
+ "Crawler::Proxy::#{web_name.camelize}::Client".constantize.new(*arg)
12
+ end
13
+
14
+ end
15
+
16
+
17
+ end
18
+ end
@@ -0,0 +1,9 @@
1
+
2
+ class String
3
+ # 清除干扰数据
4
+ # 清除包含: 空格,回车
5
+ #
6
+ def del_inter
7
+ self.gsub(/(?:\n|\t| )/,"")
8
+ end
9
+ end
@@ -0,0 +1,3 @@
1
+ module HttpCrawler
2
+ VERSION = "0.2.0"
3
+ end
@@ -0,0 +1,4 @@
1
+ # 爬虫维护
2
+
3
+
4
+ 主要用于维护爬虫的网站
@@ -0,0 +1,19 @@
1
+ # 示例:百度爬虫维护
2
+
3
+
4
+ ### 通过对象调用
5
+
6
+ ```ruby
7
+ client = HttpCrawler::Client::Baidu::Client.new
8
+ client.index # 抓取首页
9
+ ```
10
+
11
+ ### 通过别名调用
12
+ ```ruby
13
+ client = HttpCrawler::Client.for("baidu") #
14
+ client.index # 抓取首页
15
+ ```
16
+
17
+
18
+ ### response.rb
19
+ 用于维护不同响应结果的处理方法
@@ -0,0 +1,25 @@
1
+ module HttpCrawler
2
+ module Web
3
+ module Baidu
4
+ class Client
5
+ include(Crawler::Client)
6
+
7
+ def init_http
8
+ @http.open_timeout = 3
9
+ @http.read_timeout = 3
10
+ end
11
+
12
+ def init_uri
13
+ @uri = URI("https://www.baidu.com/")
14
+ end
15
+
16
+ def index(parameter = {})
17
+ r = http.get_fetch("/", header)
18
+ r.extend(HttpCrawler::Web::Baidu::Response::Index)
19
+ end
20
+
21
+ end
22
+ end # module Baidu
23
+ end # module Web
24
+ end # module Crawler
25
+
@@ -0,0 +1,16 @@
1
+ # 查询
2
+ module HttpCrawler
3
+ module Web
4
+ module Baidu
5
+ module Response
6
+ module Index
7
+ def parsing(parameter = {})
8
+ html
9
+ end
10
+ end # module Search
11
+ end # module Response
12
+ end # module Qichacha
13
+ end # module Web
14
+ end # module Crawler
15
+
16
+
@@ -0,0 +1,10 @@
1
+ # Dir[File.dirname(__FILE__) + '/response/*.rb'].each {|file| require file}
2
+
3
+ module HttpCrawler
4
+ module Web
5
+ module Baidu
6
+ module Response
7
+ end
8
+ end
9
+ end
10
+ end
@@ -0,0 +1,7 @@
1
+ # Dir[File.dirname(__FILE__) + '/web/*.rb'].each {|file| require file}
2
+
3
+ module HttpCrawler
4
+ module Web
5
+
6
+ end
7
+ end
@@ -0,0 +1,9 @@
1
+ require 'net/http'
2
+ require 'json'
3
+ require 'digest/md5'
4
+ require 'nokogiri'
5
+
6
+
7
+ module HttpCrawler
8
+ # Your code goes here...
9
+ end
metadata ADDED
@@ -0,0 +1,175 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: http_crawler
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.2.0
5
+ platform: ruby
6
+ authors:
7
+ - jagger
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2018-12-28 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rspec
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '3.8'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '3.8'
27
+ - !ruby/object:Gem::Dependency
28
+ name: bundler
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.15'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '1.15'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rake
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '10.0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '10.0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: rchardet
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.8'
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: '1.8'
69
+ - !ruby/object:Gem::Dependency
70
+ name: nokogiri
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: '1.8'
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: '1.8'
83
+ - !ruby/object:Gem::Dependency
84
+ name: ruby-readability
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: 0.7.0
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - "~>"
95
+ - !ruby/object:Gem::Version
96
+ version: 0.7.0
97
+ - !ruby/object:Gem::Dependency
98
+ name: brotli
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - "~>"
102
+ - !ruby/object:Gem::Version
103
+ version: 0.2.1
104
+ type: :runtime
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: 0.2.1
111
+ description: 初级开发工程师,基于net/http 写的爬虫扩展包。
112
+ email:
113
+ - 1336098842@qq.com
114
+ executables: []
115
+ extensions: []
116
+ extra_rdoc_files: []
117
+ files:
118
+ - ".gitignore"
119
+ - ".idea/vcs.xml"
120
+ - ".rspec"
121
+ - CODE_OF_CONDUCT.md
122
+ - Gemfile
123
+ - README.md
124
+ - Rakefile
125
+ - bin/console
126
+ - bin/setup
127
+ - http_crawler.gemspec
128
+ - lib/http_crawler.rb
129
+ - lib/http_crawler/client.rb
130
+ - lib/http_crawler/http.rb
131
+ - lib/http_crawler/net/http.rb
132
+ - lib/http_crawler/net/response.rb
133
+ - lib/http_crawler/object.rb
134
+ - lib/http_crawler/proxy.rb
135
+ - lib/http_crawler/proxy/README.md
136
+ - lib/http_crawler/proxy/client.rb
137
+ - lib/http_crawler/proxy/response.rb
138
+ - lib/http_crawler/proxy/test_proxy_api/README.md
139
+ - lib/http_crawler/proxy/test_proxy_api/client.rb
140
+ - lib/http_crawler/proxy/test_proxy_api/response.rb
141
+ - lib/http_crawler/proxy/test_proxy_api/response/get_proxy.rb
142
+ - lib/http_crawler/string.rb
143
+ - lib/http_crawler/version.rb
144
+ - lib/http_crawler/web.rb
145
+ - lib/http_crawler/web/README.md
146
+ - lib/http_crawler/web/baidu/README.md
147
+ - lib/http_crawler/web/baidu/client.rb
148
+ - lib/http_crawler/web/baidu/response.rb
149
+ - lib/http_crawler/web/baidu/response/index.rb
150
+ homepage: https://rubygems.org/gems/http_crawler
151
+ licenses:
152
+ - MIT
153
+ metadata:
154
+ allowed_push_host: https://rubygems.org
155
+ post_install_message:
156
+ rdoc_options: []
157
+ require_paths:
158
+ - lib
159
+ required_ruby_version: !ruby/object:Gem::Requirement
160
+ requirements:
161
+ - - ">="
162
+ - !ruby/object:Gem::Version
163
+ version: '0'
164
+ required_rubygems_version: !ruby/object:Gem::Requirement
165
+ requirements:
166
+ - - ">="
167
+ - !ruby/object:Gem::Version
168
+ version: '0'
169
+ requirements: []
170
+ rubyforge_project:
171
+ rubygems_version: 2.6.14
172
+ signing_key:
173
+ specification_version: 4
174
+ summary: http 爬虫。
175
+ test_files: []