list_spider 2.3.0 → 2.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/Gemfile CHANGED
@@ -1,6 +1,6 @@
1
- source 'https://rubygems.org'
2
-
3
- git_source(:github) { |repo_name| "https://github.com/#{repo_name}" }
4
-
5
- # Specify your gem's dependencies in list_spider.gemspec
6
- gemspec
1
+ source 'https://rubygems.org'
2
+
3
+ git_source(:github) { |repo_name| "https://github.com/#{repo_name}" }
4
+
5
+ # Specify your gem's dependencies in list_spider.gemspec
6
+ gemspec
data/README.md CHANGED
@@ -1,181 +1,181 @@
1
- # 关于list_spider
2
-
3
- list_spider是一个基于[em-http-request](https://github.com/igrigorik/em-http-request)的爬虫工具。
4
-
5
- 许多情况下,爬虫的工作是爬取链接,解析返回数据,从中提取链接,继续爬取,list_spider就是适用这种场景的爬虫工具。
6
-
7
- ## 功能特点
8
- * 去重过滤 (使用本地文件路径做唯一性校验)。
9
-
10
- * 支持UTF-8编码转换。
11
-
12
- * 默认增量爬取,已爬取的不再重复爬取(可以通过选项强制重新获取)。
13
-
14
- * 自由设置最大并发数和爬取任务间隔时间。
15
-
16
- * 支持http所有选项设置。
17
-
18
- ## 开始
19
-
20
- ```ruby
21
- gem install list_spider
22
- ```
23
-
24
- 或者添加到Gemfile
25
-
26
- ```ruby
27
- gem 'list_spider'
28
- ```
29
-
30
- ## 使用方法
31
- ```ruby
32
- require 'list_spider'
33
-
34
- DOWNLOAD_DIR = 'coolshell/'.freeze
35
-
36
- @next_list = []
37
-
38
- def parse_index_item(e)
39
- content = File.read(e.local_path)
40
- doc = Nokogiri::HTML(content)
41
- list_group = doc.css('h2.entry-title')
42
- link_list = list_group.css('a')
43
-
44
- link_list.each do |link|
45
- href = link['href']
46
- local_path = DOWNLOAD_DIR + link.content + '.html'
47
- # 可以存入数据库后续处理
48
- @next_list << TaskStruct.new(href, local_path)
49
- end
50
- end
51
-
52
- task_list = []
53
- task_list << TaskStruct.new(
54
- 'https://coolshell.cn/',
55
- DOWNLOAD_DIR + 'index.html',
56
- parse_method: method(:parse_index_item)
57
- )
58
-
59
- ListSpider.get_list(task_list)
60
- ListSpider.get_list(@next_list, max: 60)
61
- ```
62
-
63
- ## 或者使用更简单的一步完成
64
- ```ruby
65
- require 'list_spider'
66
-
67
- DOWNLOAD_DIR = 'coolshell/'.freeze
68
-
69
- def parse_index_item(e)
70
- content = File.read(e.local_path)
71
- doc = Nokogiri::HTML(content)
72
- list_group = doc.css('h2.entry-title')
73
- link_list = list_group.css('a')
74
-
75
- link_list.each do |link|
76
- href = link['href']
77
- local_path = DOWNLOAD_DIR + link.content + '.html'
78
- ListSpider.add_task(TaskStruct.new(href, local_path))
79
- end
80
- end
81
-
82
- # get_one是封装了get_list的简化形式,方便一个任务时调用
83
- ListSpider.get_one(
84
- TaskStruct.new(
85
- 'https://coolshell.cn/',
86
- DOWNLOAD_DIR + 'index.html',
87
- parse_method: method(:parse_index_item)
88
- ),
89
- max: 60
90
- )
91
- ```
92
-
93
- ## get_list/get_one参数
94
- ```
95
- # down_list: 要请求的TaskStruct数组
96
- # interval: 任务间隔,默认为0。若参数为Range对象,则随机间隔Range范围内的秒数。若设为RANDOM_TIME则随机间隔3到10秒。
97
- # max: 最大并发数,默认为50。若设为NO_LIMIT_CONCURRENT,则所有请求任务全部一起并发执行
98
-
99
- get_list(down_list, interval: DEFAULT_INTERVAL, max: DEFAULT_CONCURRNET_MAX)
100
- get_one(task, interval: DEFAULT_INTERVAL, max: DEFAULT_CONCURRNET_MAX)
101
- ```
102
-
103
- ## 下面是TaskStruct可以设置的选项,与[em-http-request](https://github.com/igrigorik/em-http-request)基本一致
104
-
105
- ```ruby
106
- new(href, # 请求链接
107
- local_path, # 保存数据的本地路径(此路径作为去重标准)
108
- # http方法,取值::get, :head, :delete, :put, :post, :patch, :options
109
- http_method: :get,
110
- custom_data: nil, # 自定义数据
111
- parse_method: nil, # 解析保存文件的回调,参数是TaskStruct对象本身
112
- # 请求成功后的回调,此时可能没有保存文件,比如301,404
113
- # 参数是TaskStruct对象本身和对应的EventMachine::HttpRequest对象
114
- # http_req.response_header.status 状态码
115
- # http_req.response_header 返回头
116
- # http_req.response 返回体
117
- callback: nil,
118
- # 请求失败后的回调
119
- # 参数是TaskStruct对象本身和对应的EventMachine::HttpRequest对象
120
- errback: nil,
121
- stream_callback: nil, # 流数据处理回调
122
- convert_to_utf8: false, # 是否转换为utf8编码
123
- overwrite_exist: false, # 是否覆盖现有文件
124
- # 请求设置
125
- redirects: 3, # 重定向次数
126
- keepalive: nil, # (暂不支持复用)
127
- file: nil, # 要上传的文件路径
128
- path: nil, # 请求路径,在流水线方式请求时有用(暂不支持)
129
- query: nil, # 查询字符串,可以是string或hash类型
130
- body: nil, # 请求体,可以是string或hash类型
131
- head: nil, # 请求头
132
- # 连接设置
133
- connect_timeout: 60, # 连接超时时间
134
- inactivity_timeout: nil, # 连接后超时时间
135
- # ssl设置
136
- # ssl: {
137
- # :private_key_file => '/tmp/server.key',
138
- # :cert_chain_file => '/tmp/server.crt',
139
- # :verify_peer => false
140
- # }
141
- ssl: nil,
142
- # bind: {
143
- # :host => '123.123.123.123', # use a specific interface for outbound request
144
- # :port => '123'
145
- # }
146
- bind: nil,
147
- # 代理设置
148
- # proxy: {
149
- # :host => '127.0.0.1', # proxy address
150
- # :port => 9000, # proxy port
151
- # :type => :socks5 # default proxy mode is HTTP proxy, change to :socks5 if required
152
-
153
- # :authorization => ['user', 'pass'] # proxy authorization header
154
- # }
155
- proxy: nil)
156
- ```
157
-
158
- ## 回调函数形式
159
-
160
- ```ruby
161
- # 文件成功保存后调用,通过parse_method参数传入
162
- def parse_eresponse(task_struct)
163
- # ...
164
- end
165
-
166
- # http请求成功后调用,通过callback参数传入
167
- def call_back(task_struct, http_req)
168
- # http_req 是EventMachine::HttpRequest对象
169
- # http_req.response_header.status
170
- # ...
171
- end
172
-
173
- # http请求出错后调用,通过errback参数传入
174
- def err_back(task_struct, http_req)
175
- # ...
176
- end
177
- ```
178
-
179
- ## License
180
-
181
- (MIT License) - Copyright (c) 2016 Charles Zhang
1
+ # 关于list_spider
2
+
3
+ list_spider是一个基于[em-http-request](https://github.com/igrigorik/em-http-request)的爬虫工具。
4
+
5
+ 许多情况下,爬虫的工作是爬取链接,解析返回数据,从中提取链接,继续爬取,list_spider就是适用这种场景的爬虫工具。
6
+
7
+ ## 功能特点
8
+ * 去重过滤 (使用本地文件路径做唯一性校验)。
9
+
10
+ * 支持UTF-8编码转换。
11
+
12
+ * 默认增量爬取,已爬取的不再重复爬取(可以通过选项强制重新获取)。
13
+
14
+ * 自由设置最大并发数和爬取任务间隔时间。
15
+
16
+ * 支持http所有选项设置。
17
+
18
+ ## 开始
19
+
20
+ ```ruby
21
+ gem install list_spider
22
+ ```
23
+
24
+ 或者添加到Gemfile
25
+
26
+ ```ruby
27
+ gem 'list_spider'
28
+ ```
29
+
30
+ ## 使用方法
31
+ ```ruby
32
+ require 'list_spider'
33
+
34
+ DOWNLOAD_DIR = 'coolshell/'.freeze
35
+
36
+ @next_list = []
37
+
38
+ def parse_index_item(e)
39
+ content = File.read(e.local_path)
40
+ doc = Nokogiri::HTML(content)
41
+ list_group = doc.css('h2.entry-title')
42
+ link_list = list_group.css('a')
43
+
44
+ link_list.each do |link|
45
+ href = link['href']
46
+ local_path = DOWNLOAD_DIR + link.content + '.html'
47
+ # 可以存入数据库后续处理
48
+ @next_list << TaskStruct.new(href, local_path)
49
+ end
50
+ end
51
+
52
+ task_list = []
53
+ task_list << TaskStruct.new(
54
+ 'https://coolshell.cn/',
55
+ DOWNLOAD_DIR + 'index.html',
56
+ parse_method: method(:parse_index_item)
57
+ )
58
+
59
+ ListSpider.get_list(task_list)
60
+ ListSpider.get_list(@next_list, max: 60)
61
+ ```
62
+
63
+ ## 或者使用更简单的一步完成
64
+ ```ruby
65
+ require 'list_spider'
66
+
67
+ DOWNLOAD_DIR = 'coolshell/'.freeze
68
+
69
+ def parse_index_item(e)
70
+ content = File.read(e.local_path)
71
+ doc = Nokogiri::HTML(content)
72
+ list_group = doc.css('h2.entry-title')
73
+ link_list = list_group.css('a')
74
+
75
+ link_list.each do |link|
76
+ href = link['href']
77
+ local_path = DOWNLOAD_DIR + link.content + '.html'
78
+ ListSpider.add_task(TaskStruct.new(href, local_path))
79
+ end
80
+ end
81
+
82
+ # get_one是封装了get_list的简化形式,方便一个任务时调用
83
+ ListSpider.get_one(
84
+ TaskStruct.new(
85
+ 'https://coolshell.cn/',
86
+ DOWNLOAD_DIR + 'index.html',
87
+ parse_method: method(:parse_index_item)
88
+ ),
89
+ max: 60
90
+ )
91
+ ```
92
+
93
+ ## get_list/get_one参数
94
+ ```
95
+ # down_list: 要请求的TaskStruct数组
96
+ # interval: 任务间隔,默认为0。若参数为Range对象,则随机间隔Range范围内的秒数。若设为RANDOM_TIME则随机间隔3到10秒。
97
+ # max: 最大并发数,默认为50。若设为NO_LIMIT_CONCURRENT,则所有请求任务全部一起并发执行
98
+
99
+ get_list(down_list, interval: DEFAULT_INTERVAL, max: DEFAULT_CONCURRNET_MAX)
100
+ get_one(task, interval: DEFAULT_INTERVAL, max: DEFAULT_CONCURRNET_MAX)
101
+ ```
102
+
103
+ ## 下面是TaskStruct可以设置的选项,与[em-http-request](https://github.com/igrigorik/em-http-request)基本一致
104
+
105
+ ```ruby
106
+ new(href, # 请求链接
107
+ local_path, # 保存数据的本地路径(此路径作为去重标准)
108
+ # http方法,取值::get, :head, :delete, :put, :post, :patch, :options
109
+ http_method: :get,
110
+ custom_data: nil, # 自定义数据
111
+ parse_method: nil, # 解析保存文件的回调,参数是TaskStruct对象本身
112
+ # 请求成功后的回调,此时可能没有保存文件,比如301,404
113
+ # 参数是TaskStruct对象本身和对应的EventMachine::HttpRequest对象
114
+ # http_req.response_header.status 状态码
115
+ # http_req.response_header 返回头
116
+ # http_req.response 返回体
117
+ callback: nil,
118
+ # 请求失败后的回调
119
+ # 参数是TaskStruct对象本身和对应的EventMachine::HttpRequest对象
120
+ errback: nil,
121
+ stream_callback: nil, # 流数据处理回调
122
+ convert_to_utf8: false, # 是否转换为utf8编码
123
+ overwrite_exist: false, # 是否覆盖现有文件
124
+ # 请求设置
125
+ redirects: 3, # 重定向次数
126
+ keepalive: nil, # (暂不支持复用)
127
+ file: nil, # 要上传的文件路径
128
+ path: nil, # 请求路径,在流水线方式请求时有用(暂不支持)
129
+ query: nil, # 查询字符串,可以是string或hash类型
130
+ body: nil, # 请求体,可以是string或hash类型
131
+ head: nil, # 请求头
132
+ # 连接设置
133
+ connect_timeout: 60, # 连接超时时间
134
+ inactivity_timeout: nil, # 连接后超时时间
135
+ # ssl设置
136
+ # ssl: {
137
+ # :private_key_file => '/tmp/server.key',
138
+ # :cert_chain_file => '/tmp/server.crt',
139
+ # :verify_peer => false
140
+ # }
141
+ ssl: nil,
142
+ # bind: {
143
+ # :host => '123.123.123.123', # use a specific interface for outbound request
144
+ # :port => '123'
145
+ # }
146
+ bind: nil,
147
+ # 代理设置
148
+ # proxy: {
149
+ # :host => '127.0.0.1', # proxy address
150
+ # :port => 9000, # proxy port
151
+ # :type => :socks5 # default proxy mode is HTTP proxy, change to :socks5 if required
152
+
153
+ # :authorization => ['user', 'pass'] # proxy authorization header
154
+ # }
155
+ proxy: nil)
156
+ ```
157
+
158
+ ## 回调函数形式
159
+
160
+ ```ruby
161
+ # 文件成功保存后调用,通过parse_method参数传入
162
+ def parse_eresponse(task_struct)
163
+ # ...
164
+ end
165
+
166
+ # http请求成功后调用,通过callback参数传入
167
+ def call_back(task_struct, http_req)
168
+ # http_req 是EventMachine::HttpRequest对象
169
+ # http_req.response_header.status
170
+ # ...
171
+ end
172
+
173
+ # http请求出错后调用,通过errback参数传入
174
+ def err_back(task_struct, http_req)
175
+ # ...
176
+ end
177
+ ```
178
+
179
+ ## License
180
+
181
+ (MIT License) - Copyright (c) 2016 Charles Zhang
data/Rakefile CHANGED
@@ -1,2 +1,2 @@
1
- require 'bundler/gem_tasks'
2
- task default: :spec
1
+ require 'bundler/gem_tasks'
2
+ task default: :spec
data/bin/console CHANGED
@@ -1,14 +1,14 @@
1
- #!/usr/bin/env ruby
2
-
3
- require 'bundler/setup'
4
- require 'list_spider'
5
-
6
- # You can add fixtures and/or initialization code here to make experimenting
7
- # with your gem easier. You can also use a different console, if you like.
8
-
9
- # (If you use this, don't forget to add pry to your Gemfile!)
10
- # require "pry"
11
- # Pry.start
12
-
13
- require 'irb'
14
- IRB.start(__FILE__)
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'bundler/setup'
4
+ require 'list_spider'
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require 'irb'
14
+ IRB.start(__FILE__)
data/bin/setup CHANGED
@@ -1,8 +1,8 @@
1
- #!/usr/bin/env bash
2
- set -euo pipefail
3
- IFS=$'\n\t'
4
- set -vx
5
-
6
- bundle install
7
-
8
- # Do any other automated setup that you need to do here
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
data/check_code.sh CHANGED
@@ -1,3 +1,3 @@
1
- #!/bin/sh
2
-
1
+ #!/bin/sh
2
+
3
3
  rubocop -a -D -f simple -o rubocopresult
data/lib/file_filter.rb CHANGED
@@ -1,72 +1,72 @@
1
-
2
- class FileFilter
3
- # 4033
4
- # 920
5
- def initialize(dir_pattern, size_threshold: 1000,
6
- cust_judge: nil, process_block: nil)
7
- @dir_pattern = dir_pattern
8
- @size_threshold = size_threshold
9
- @cust_judge = cust_judge ? cust_judge : method(:default_judge)
10
- @total = 0
11
- @process_block = process_block
12
- end
13
-
14
- def default_judge(f)
15
- File.size(f) <= @size_threshold
16
- end
17
-
18
- def filter_file(f)
19
- if @cust_judge.call(f)
20
- @total += 1
21
- @process_block.call(f)
22
- end
23
- end
24
-
25
- def start
26
- Dir.glob(@dir_pattern) do |f|
27
- filter_file(f)
28
- end
29
- puts "total:#{@total}"
30
- end
31
-
32
- def self.delete(dir_pattern, size_threshold: 1000, cust_judge: nil)
33
- FileFilter.new(
34
- dir_pattern,
35
- size_threshold: size_threshold,
36
- cust_judge: cust_judge,
37
- process_block:
38
- proc do |f|
39
- puts "deleted file: #{f}"
40
- File.delete(f)
41
- end
42
- ).start
43
- end
44
-
45
- def self.check(dir_pattern, size_threshold: 1000, cust_judge: nil)
46
- FileFilter.new(
47
- dir_pattern,
48
- size_threshold: size_threshold,
49
- cust_judge: cust_judge,
50
- process_block:
51
- proc do |f|
52
- puts "filterd file: #{f}"
53
- end
54
- ).start
55
- end
56
-
57
- def self.check_save_result(dir_pattern, save_file_name: 'filtered_file.txt',
58
- size_threshold: 1000, cust_judge: nil)
59
- result_file = File.open(save_file_name, 'wt')
60
- FileFilter.new(
61
- dir_pattern,
62
- size_threshold: size_threshold,
63
- cust_judge: cust_judge,
64
- process_block:
65
- proc do |f|
66
- puts "filterd file: #{f}"
67
- result_file << f << "\n"
68
- end
69
- ).start
70
- result_file.close
71
- end
72
- end
1
+
2
+ class FileFilter
3
+ # 4033
4
+ # 920
5
+ def initialize(dir_pattern, size_threshold: 1000,
6
+ cust_judge: nil, process_block: nil)
7
+ @dir_pattern = dir_pattern
8
+ @size_threshold = size_threshold
9
+ @cust_judge = cust_judge ? cust_judge : method(:default_judge)
10
+ @total = 0
11
+ @process_block = process_block
12
+ end
13
+
14
+ def default_judge(f)
15
+ File.size(f) <= @size_threshold
16
+ end
17
+
18
+ def filter_file(f)
19
+ if @cust_judge.call(f)
20
+ @total += 1
21
+ @process_block.call(f)
22
+ end
23
+ end
24
+
25
+ def start
26
+ Dir.glob(@dir_pattern) do |f|
27
+ filter_file(f)
28
+ end
29
+ puts "total:#{@total}"
30
+ end
31
+
32
+ def self.delete(dir_pattern, size_threshold: 1000, cust_judge: nil)
33
+ FileFilter.new(
34
+ dir_pattern,
35
+ size_threshold: size_threshold,
36
+ cust_judge: cust_judge,
37
+ process_block:
38
+ proc do |f|
39
+ puts "deleted file: #{f}"
40
+ File.delete(f)
41
+ end
42
+ ).start
43
+ end
44
+
45
+ def self.check(dir_pattern, size_threshold: 1000, cust_judge: nil)
46
+ FileFilter.new(
47
+ dir_pattern,
48
+ size_threshold: size_threshold,
49
+ cust_judge: cust_judge,
50
+ process_block:
51
+ proc do |f|
52
+ puts "filterd file: #{f}"
53
+ end
54
+ ).start
55
+ end
56
+
57
+ def self.check_save_result(dir_pattern, save_file_name: 'filtered_file.txt',
58
+ size_threshold: 1000, cust_judge: nil)
59
+ result_file = File.open(save_file_name, 'wt')
60
+ FileFilter.new(
61
+ dir_pattern,
62
+ size_threshold: size_threshold,
63
+ cust_judge: cust_judge,
64
+ process_block:
65
+ proc do |f|
66
+ puts "filterd file: #{f}"
67
+ result_file << f << "\n"
68
+ end
69
+ ).start
70
+ result_file.close
71
+ end
72
+ end