baiduserp 2.1.14 → 2.2.9
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +1 -231
- data/bin/baiduserp +39 -33
- data/lib/baiduserp.rb +6 -1
- data/lib/baiduserp/analyser-migrations/001_create_keywords_table.rb +17 -0
- data/lib/baiduserp/analyser-migrations/002_create_htmls_table.rb +17 -0
- data/lib/baiduserp/analyser-migrations/003_create_serps_table.rb +16 -0
- data/lib/baiduserp/analyser.rb +69 -0
- data/lib/baiduserp/client.rb +4 -0
- data/lib/baiduserp/parser.rb +1 -1
- data/lib/{parsers → baiduserp/parser}/ads_right.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/ads_top.rb +4 -1
- data/lib/{parsers → baiduserp/parser}/con_ar.rb +0 -0
- data/lib/baiduserp/parser/pinpaizhuanqu.rb +8 -0
- data/lib/{parsers → baiduserp/parser}/ranks.rb +4 -1
- data/lib/{parsers → baiduserp/parser}/related_keywords.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/result_num.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_hotel.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_personinfo.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_relaperson.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_weather.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/zhixin.rb +0 -0
- data/lib/baiduserp/version.rb +1 -1
- metadata +46 -14
- data/lib/parsers/pinpaizhuanqu.rb +0 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 1d1c2a61259d5e3134d0566e002ba551a791b06c
|
4
|
+
data.tar.gz: 8b3f78ad2190a17a2f530019c0b0719bb9abfb93
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 859d3c01765741cfe50eed92a05f5cbec93c68c003d13d2a2dad2e6b85e22d63d2171ded85fa3cd2271d7b830c522bfa95b3b27f91a269af621f811f23c3a87a
|
7
|
+
data.tar.gz: 5847dd8a81f4c61a0f72d2d8a2e6cd91e4aa1e9eed32f3c495453e31fe006a5c8162f902ebf2e831adb16731114fcb6b7d4a4867ed8ce575aecc39e2feca14c4
|
data/README.md
CHANGED
@@ -1,231 +1 @@
|
|
1
|
-
|
2
|
-
|
3
|
-
此gem的目的是专门用来解析百度的搜索结果页.并以最大限度获取SERP结果页面所能拿到的信息为目的.
|
4
|
-
(注意目前这并不是一个批量处理关键词排名的程序, 但可以作为一个批量排名查询软件中解析百度SERP页面的模块)
|
5
|
-
|
6
|
-
## 特点
|
7
|
-
### 解析SERP结果尽量全面
|
8
|
-
众所周知百度的SERP页面现在越来越复杂,左侧的各种新样式层出不穷.右侧也增加了很多内容.
|
9
|
-
这个GEM的功能就是把SERP页面解析成ruby中的数据结构.
|
10
|
-
|
11
|
-
做SERP页面分析的时候有可能会想要分析页面上各种信息, 如SEO排名, SEM排名, 竞争对手排名,
|
12
|
-
标题/描述文字, 还有相关关键词, 右侧相关推荐信息, 是否有百度开放平台等...
|
13
|
-
|
14
|
-
此gem会把上述各种各样的信息都解析出来, 供后续分析使用. 并且如果使用量越来越大, 或者百度又出新产品的话,
|
15
|
-
也可以增加新模块的解析.
|
16
|
-
|
17
|
-
|
18
|
-
### 提供命令行接口(既可以测试用,也可以以JSON格式输出)
|
19
|
-
除了提供ruby调用外, 使用其他编程语言的也可以用命令行的接口, 使用JSON格式输出结果数据.
|
20
|
-
详细使用说明见下文.
|
21
|
-
|
22
|
-
### 已知问题
|
23
|
-
目前这个只是一个基本能用的版本,可能会有各种各样的问题,欢迎提BUG. [已知问题列表](https://github.com/semseo/baiduserp/issues).
|
24
|
-
|
25
|
-
## Installation
|
26
|
-
|
27
|
-
1 系统要求
|
28
|
-
|
29
|
-
Linux或Mac. Linux最好使用新版本的Ubuntu或Fedora系列.
|
30
|
-
|
31
|
-
2 安装ruby环境
|
32
|
-
|
33
|
-
只支持ruby1.9及以上. 最好的安装ruby的方法是通过[RVM](https://rvm.io/),RVM的使用方法可以参考这个页面[http://ruby-china.org/wiki/install_ruby_guide](http://ruby-china.org/wiki/install_ruby_guide), 虽然同时安装了一些不需要的rails相关的软件, 但是介绍很详细.
|
34
|
-
|
35
|
-
在最新的Ubuntu或Fedora系列的Linux中,也可以通过apt-get或yum安装ruby1.9.
|
36
|
-
|
37
|
-
3 安装gem依赖
|
38
|
-
|
39
|
-
需要依赖nokogiri这个gem.而这个gem需要系统中的两个库.
|
40
|
-
所以在ubuntu或者fedora下需要
|
41
|
-
|
42
|
-
```
|
43
|
-
$ sudo apt-get install libxslt-dev libxslt libxml2-dev libxml2 # ubuntu
|
44
|
-
$ sudo yum install libxml2-devel libxml2 libxslt libxslt-devel # fedora
|
45
|
-
```
|
46
|
-
|
47
|
-
以上依赖安装完成后,
|
48
|
-
|
49
|
-
`$ gem install nokogiri`
|
50
|
-
|
51
|
-
4 最后我们安装 baiduserp gem
|
52
|
-
|
53
|
-
`$ gem install baiduserp`
|
54
|
-
|
55
|
-
## Usage
|
56
|
-
|
57
|
-
ruby 代码示例
|
58
|
-
|
59
|
-
```
|
60
|
-
require 'baiduserp'
|
61
|
-
require 'open-uri'
|
62
|
-
require 'pp'
|
63
|
-
|
64
|
-
pp Baiduserp.search 'keyword'
|
65
|
-
|
66
|
-
pp Baiduserp.parse open(http://www.baidu.com/s?wd=keyword).read.encode('UTF-8')
|
67
|
-
|
68
|
-
```
|
69
|
-
|
70
|
-
另外为了方便非ruby程序使用以及一次性调试,也提供了命令行调用方法,可以通过JSON格式交换数据:
|
71
|
-
|
72
|
-
```
|
73
|
-
$ baiduserp -h
|
74
|
-
Usage:
|
75
|
-
1. baiduserp -s 'keyword' # search 'keyword' and print parse result
|
76
|
-
2. baiduserp -s 'keyword' -o output.json # -o means save result to a file
|
77
|
-
3. baiduserp -f 'file path' # parse html source code from file
|
78
|
-
4. baiduserp -s 'keyword' -j # search 'keyword' and print parse result in JSON format
|
79
|
-
-s, --search Keyword Search Keyword & Parse SERP
|
80
|
-
-j, --jsonprint Print result in JSON format
|
81
|
-
-o, --output Output Save Result to File in JSON format
|
82
|
-
-f, --file File Parse Local File
|
83
|
-
```
|
84
|
-
|
85
|
-
最终结果采用了哈希表和数组相互嵌套的数据结构.结果示例如下:
|
86
|
-
|
87
|
-
```
|
88
|
-
$ baiduserp -s 香港
|
89
|
-
{:ads_right=>
|
90
|
-
[{:rank=>1,
|
91
|
-
:title=>"预订香港酒店上携程,全景图..",
|
92
|
-
:content=>"订香港酒店,享受有房保障,服务好,折扣低,返现高达201元,订香港酒店上携程超划算!",
|
93
|
-
:site=>"www.ctrip.com"},
|
94
|
-
{:rank=>2,
|
95
|
-
:title=>"香港-去哪儿网度假频道,聪明..",
|
96
|
-
:content=>"香港-去哪儿网度假频道,比价首选!180000条报价实时更新,先比价后出行!",
|
97
|
-
:site=>"dujia.qunar.com"},
|
98
|
-
{:rank=>3,
|
99
|
-
:title=>"香港香港怎么玩最划算?",
|
100
|
-
:content=>"深圳旅行社香港,香港旅游您的超值之选!天天出团港澳游专线,缤纷全程绝不",
|
101
|
-
:site=>"www.sztygl128.com"},
|
102
|
-
{:rank=>4,
|
103
|
-
:title=>"香港旅游攻略-150000条点评",
|
104
|
-
:content=>"还没来过香港?507个景点都玩遍?到到网告诉你网友怎么玩(150000张游记照片)!",
|
105
|
-
:site=>"www.daodao.com"},
|
106
|
-
{:rank=>5,
|
107
|
-
:title=>"香港旅游首选北京青年旅行社..",
|
108
|
-
:content=>"北京青年旅行社专业香港旅游旅行社,高品质服务,天天折扣价,",
|
109
|
-
:site=>"www.hqly8.com"},
|
110
|
-
{:rank=>6,
|
111
|
-
:title=>"香港旅游特价啦!香港旅游价..",
|
112
|
-
:content=>"本社邀你一起体验超值香港旅游,全程绝无强制购物,行程安排合理.",
|
113
|
-
:site=>"www.cctbj.net"},
|
114
|
-
{:rank=>7,
|
115
|
-
:title=>"香港旅游线路",
|
116
|
-
:content=>"北京旅行社提供香港咨询服务,多条精品旅游线路供您选择.",
|
117
|
-
:site=>"www.ctslyw.com"},
|
118
|
-
{:rank=>8,
|
119
|
-
:title=>"全新香港旅游报价,香港旅游..",
|
120
|
-
:content=>"北京国际旅行社,精选多条香港旅游线路,信誉保证,全程无隐性消费.",
|
121
|
-
:site=>"www.quly8.net"}],
|
122
|
-
:ads_top=>
|
123
|
-
[{:rank=>1,
|
124
|
-
:title=>"香港酒店预订 在Agoda立享1-7折",
|
125
|
-
:content=>"香港酒店预订,尽在Agoda,网上订购低价回馈,为您节省75%.",
|
126
|
-
:site=>"www.agoda.com"},
|
127
|
-
{:rank=>2,
|
128
|
-
:title=>"香港酒店预订 在Agoda立享1-7折",
|
129
|
-
:content=>"香港酒店预订,尽在Agoda,网上订购低价回馈,为您节省75%.",
|
130
|
-
:site=>"www.agoda.com"}],
|
131
|
-
:pinpaizhuanqu=>false,
|
132
|
-
:ranks=>
|
133
|
-
[{:rank=>1,
|
134
|
-
:url=>
|
135
|
-
"http://baike.baidu.com/link?url=Ujomxkw-4Whq7C7TI6do9nxHr3G0sO6ywJ3SZfr-lX4qQiht-2rnuGomrclwc4bJ",
|
136
|
-
:title=>"香港_百度百科",
|
137
|
-
:content=>nil,
|
138
|
-
:mu=>"http://baike.baidu.com/view/2607.htm",
|
139
|
-
:baiduopen=>false},
|
140
|
-
{:rank=>2,
|
141
|
-
:url=>"http://lvyou.baidu.com/xianggang/",
|
142
|
-
:title=>"2013香港旅游攻略_香港景点线路游记_百度旅游",
|
143
|
-
:content=>nil,
|
144
|
-
:mu=>"http://lvyou.baidu.com/xianggang/",
|
145
|
-
:baiduopen=>false},
|
146
|
-
{:rank=>3,
|
147
|
-
:url=>
|
148
|
-
"http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&fr=ala1&word=%CF%E3%B8%DB",
|
149
|
-
:title=>"香港_百度图片 - 举报图片",
|
150
|
-
:content=>nil,
|
151
|
-
:mu=>
|
152
|
-
"http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&fr=ala1&word=%CF%E3%B8%DB",
|
153
|
-
:baiduopen=>false},
|
154
|
-
{:rank=>4,
|
155
|
-
:url=>"http://www.gov.hk/sc/residents/",
|
156
|
-
:title=>"GovHK 香港政府一站通:本港居民",
|
157
|
-
:content=>
|
158
|
-
"香港政府为当地居民提供的资讯和服务,内容包括通讯及科技、文化、康乐及运动、教育及培训、就业、环境、政府、法律及治安、保健及医疗服务、房屋及社会服务、入境事务...",
|
159
|
-
:mu=>nil,
|
160
|
-
:baiduopen=>false},
|
161
|
-
{:rank=>5,
|
162
|
-
:url=>"http://www.baidu.com/s?rtt=2&tn=baiduwb&rn=20&cl=2&wd=%CF%E3%B8%DB",
|
163
|
-
:title=>"香港的最新微博结果",
|
164
|
-
:content=>nil,
|
165
|
-
:mu=>"http://www.baidu.com/s?rtt=2&tn=baiduwb&rn=20&cl=2&wd=%CF%E3%B8%DB",
|
166
|
-
:baiduopen=>false},
|
167
|
-
{:rank=>6,
|
168
|
-
:url=>"http://tieba.baidu.com/f?kw=%CF%E3%B8%DB&fr=ala0",
|
169
|
-
:title=>"香港吧 百度贴吧",
|
170
|
-
:content=>
|
171
|
-
"月活跃用户:38万人 累计发贴:202万 图片(1856) | 视频(61) | 精品贴(335) 香港和上海的夜景那个美?????????? 点击:439 回复:259 最近怎么那么多自以为漂亮的S!B女求认证啊。 点击:303 回复:69 什么时候,去香港不必签证,和去北京上海一样容... 点击:839 回复:187 查看更多香港吧内容>> tieba.baidu.com/香港?fr=ala0 2013-10-18",
|
172
|
-
:mu=>"http://tieba.baidu.com/f?kw=%CF%E3%B8%DB&fr=ala0",
|
173
|
-
:baiduopen=>false},
|
174
|
-
{:rank=>7,
|
175
|
-
:url=>"http://www.weather.com.cn/html/weather/101320101.shtml",
|
176
|
-
:title=>"香港天气预报_一周天气预报_中国天气网 - 最近访问:",
|
177
|
-
:content=>nil,
|
178
|
-
:mu=>"http://www.weather.com.cn/html/weather/101320101.shtml",
|
179
|
-
:baiduopen=>true},
|
180
|
-
{:rank=>8,
|
181
|
-
:url=>"http://www.mafengwo.cn/travel-scenic-spot/mafengwo/10189.html",
|
182
|
-
:title=>"2013香港旅游攻略,香港自助游攻略,蚂蜂窝香港出游攻略游记 - 蚂蜂窝",
|
183
|
-
:content=>
|
184
|
-
"在香港寻吃完全就是一场舌尖的盛宴,从街边小吃到世界顶级的米其林餐厅任您选择,茶餐厅、早茶、烧腊和及甜品极具港式风味,世界各地的美食料理也一个不落单。 香港...",
|
185
|
-
:mu=>nil,
|
186
|
-
:baiduopen=>false},
|
187
|
-
{:rank=>9,
|
188
|
-
:url=>"http://hongkong.cncn.com/",
|
189
|
-
:title=>"香港旅游攻略_香港香港旅游景点_香港旅游网",
|
190
|
-
:content=>
|
191
|
-
"香港欣欣旅游网,提供香港香港旅游景点推荐、10月香港旅游攻略、香港旅行社、香港旅游线路、香港酒店预订、香港旅游地图等出行指南及旅游服务●欣欣旅游网 CNCN.com ...",
|
192
|
-
:mu=>nil,
|
193
|
-
:baiduopen=>false},
|
194
|
-
{:rank=>10,
|
195
|
-
:url=>"http://www.baidu.com/s?tn=baidurt&rtt=1&bsst=1&wd=%CF%E3%B8%DB",
|
196
|
-
:title=>"香港的最新相关信息",
|
197
|
-
:content=>nil,
|
198
|
-
:mu=>"http://www.baidu.com/s?tn=baidurt&rtt=1&bsst=1&wd=%CF%E3%B8%DB",
|
199
|
-
:baiduopen=>false}],
|
200
|
-
:related_keywords=>
|
201
|
-
["香港电影",
|
202
|
-
"香港旅游",
|
203
|
-
"香港天气",
|
204
|
-
"香港大学",
|
205
|
-
"香港购物",
|
206
|
-
"香港地图",
|
207
|
-
"香港电视剧",
|
208
|
-
"香港地铁",
|
209
|
-
"香港中文大学",
|
210
|
-
"香港苹果官网"],
|
211
|
-
:result_num=>100000000,
|
212
|
-
:right_hotel=>nil,
|
213
|
-
:right_personinfo=>nil,
|
214
|
-
:right_relaperson=>
|
215
|
-
[{:title=>"香港特别行政区行政区划", :names=>["油尖旺区", "九龙城区", "湾仔", "元朗区", "西贡区"]},
|
216
|
-
{:title=>"全球性国际金融中心", :names=>["新加坡", "纽约", "东京", "伦敦"]},
|
217
|
-
{:title=>"其他人还搜", :names=>["台北", "上海", "直布罗陀", "海南", "深圳"]}],
|
218
|
-
:right_weather=>nil}
|
219
|
-
```
|
220
|
-
|
221
|
-
## Contributing
|
222
|
-
|
223
|
-
欢迎大家帮忙协助继续完善这个gem:
|
224
|
-
|
225
|
-
1. Fork it
|
226
|
-
2. Create your feature branch (`git checkout -b my-new-feature`)
|
227
|
-
3. Commit your changes (`git commit -am 'Add some feature'`)
|
228
|
-
4. Push to the branch (`git push origin my-new-feature`)
|
229
|
-
5. Create new Pull Request
|
230
|
-
|
231
|
-
或者可以到Issue页面提交问题,可以提BUG,新的需求,各种建议,等等.
|
1
|
+
解析百度的搜索结果页面, 并返回结构化数据以进行后续分析.
|
data/bin/baiduserp
CHANGED
@@ -4,50 +4,56 @@ require 'baiduserp'
|
|
4
4
|
require 'optparse'
|
5
5
|
require 'json'
|
6
6
|
require 'pp'
|
7
|
+
require 'docopt'
|
7
8
|
|
8
|
-
|
9
|
+
cmd = File.basename(__FILE__)
|
10
|
+
|
11
|
+
doc = <<DOCOPT
|
9
12
|
1. baiduserp -s 'keyword' # search 'keyword' and print parse result
|
10
13
|
2. baiduserp -s 'keyword' -o output.json # -o means save result to a file
|
11
14
|
3. baiduserp -f 'file path' # parse html source code from file
|
12
15
|
4. baiduserp -s 'keyword' -j # search 'keyword' and print parse result in JSON format
|
13
|
-
"
|
14
|
-
|
15
|
-
options = {}
|
16
|
-
OptionParser.new do |opts|
|
17
|
-
opts.banner = usage
|
18
|
-
|
19
|
-
opts.on("-s Keyword", "--search Keyword", "Search Keyword & Parse SERP") do |v|
|
20
|
-
options[:keyword] = v
|
21
|
-
end
|
22
|
-
|
23
|
-
opts.on("-j","--jsonprint","Print result in JSON format") do |v|
|
24
|
-
options[:jsonprint] = v
|
25
|
-
end
|
26
16
|
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
17
|
+
Usage:
|
18
|
+
#{cmd} [options]
|
19
|
+
|
20
|
+
Options:
|
21
|
+
-h --help show this help message and exit
|
22
|
+
-v --version show version and exit
|
23
|
+
-a --analyse Name analyse as the given name
|
24
|
+
--keywords File uses with -a, import give keywords File before search
|
25
|
+
-s --search Keyword search Keyword and show result
|
26
|
+
-f --file File parse local file or given url
|
27
|
+
-j --json print JSON output
|
28
|
+
-o --output File output JSON result to File
|
29
|
+
|
30
|
+
DOCOPT
|
31
|
+
|
32
|
+
begin
|
33
|
+
options = Docopt::docopt(doc, version: Baiduserp::VERSION)
|
34
|
+
# pp options
|
35
|
+
rescue Docopt::Exit => e
|
36
|
+
puts e.message
|
37
|
+
end
|
35
38
|
|
36
39
|
result = ''
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
+
if options['--analyse']
|
41
|
+
analyse = Baiduserp.analyse(options['--analyse'])
|
42
|
+
analyse.import_keywords(options('--keywords'))
|
43
|
+
analyse.search
|
44
|
+
result = 'Analyse finished!'
|
45
|
+
elsif options['--search']
|
46
|
+
result = Baiduserp.search options['--search']
|
47
|
+
elsif options['--file']
|
48
|
+
result = Baiduserp.parse_file options['--file']
|
40
49
|
else
|
41
|
-
|
50
|
+
puts "At least given one of -a/-s/-f"
|
42
51
|
end
|
43
52
|
|
44
|
-
if options[
|
45
|
-
|
46
|
-
pp result
|
47
|
-
else
|
48
|
-
puts result.to_json
|
49
|
-
end
|
53
|
+
if options['--json']
|
54
|
+
puts result.to_json
|
50
55
|
else
|
51
|
-
|
56
|
+
pp result
|
52
57
|
end
|
53
58
|
|
59
|
+
open(options['--output'],'w').puts result.to_json if options['--output']
|
data/lib/baiduserp.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
require "baiduserp/version"
|
2
|
-
require
|
2
|
+
require "baiduserp/parser"
|
3
|
+
require "baiduserp/analyser"
|
3
4
|
|
4
5
|
module Baiduserp
|
5
6
|
def self.search(keyword,page=1)
|
@@ -17,4 +18,8 @@ module Baiduserp
|
|
17
18
|
def self.parse_file(file_path)
|
18
19
|
Parser.new.parse_file file_path
|
19
20
|
end
|
21
|
+
|
22
|
+
def self.analyse(name,attrs={})
|
23
|
+
Analyser.new(name,attrs)
|
24
|
+
end
|
20
25
|
end
|
@@ -0,0 +1,69 @@
|
|
1
|
+
require 'sequel'
|
2
|
+
require 'csv'
|
3
|
+
require 'date'
|
4
|
+
require 'yaml'
|
5
|
+
|
6
|
+
module Baiduserp
|
7
|
+
class Analyser
|
8
|
+
# Dir[File.expand_path('../analyser/*.rb', __FILE__)].each{|f| require f}
|
9
|
+
|
10
|
+
def initialize(name,attrs={})
|
11
|
+
@db_file = name + ".db"
|
12
|
+
@attrs = attrs
|
13
|
+
@keywords_imported = File.exists?(@db_file)
|
14
|
+
|
15
|
+
@db = Sequel.connect("sqlite://" + @db_file)
|
16
|
+
|
17
|
+
migrate!
|
18
|
+
|
19
|
+
@keywords = Class.new(Sequel::Model) do
|
20
|
+
set_dataset :keywords
|
21
|
+
end
|
22
|
+
|
23
|
+
@htmls = Class.new(Sequel::Model) do
|
24
|
+
set_dataset :htmls
|
25
|
+
end
|
26
|
+
|
27
|
+
@serps = Class.new(Sequel::Model) do
|
28
|
+
set_dataset :serps
|
29
|
+
end
|
30
|
+
|
31
|
+
import_keywords unless @keywords_imported
|
32
|
+
end
|
33
|
+
|
34
|
+
def run
|
35
|
+
|
36
|
+
end
|
37
|
+
|
38
|
+
def migrate!
|
39
|
+
Sequel.extension :migration, :core_extensions
|
40
|
+
Sequel::Migrator.apply(@db, File.expand_path('../analyser-migrations/',__FILE__))
|
41
|
+
end
|
42
|
+
|
43
|
+
def import_keywords(file=@attrs[:keywords])
|
44
|
+
CSV.foreach(file) do |l|
|
45
|
+
@keywords.insert(:term => l[0], :weight => l[1], :category => l[2])
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
def search(date=Date.today)
|
50
|
+
@keywords.each do |k|
|
51
|
+
next if @htmls.where(:date => date, :keyword_id => k[:id]).count > 0
|
52
|
+
puts k.to_hash
|
53
|
+
html = Baiduserp.get_search_html(k[:term])
|
54
|
+
serp = Baiduserp.parse(html)
|
55
|
+
@htmls.insert(:keyword_id => k[:id], :date => date, :content => html)
|
56
|
+
@serps.insert(:keyword_id => k[:id], :date => date, :content => YAML.dump(serp))
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
def _analyse_competitors(date=Date.today)
|
61
|
+
sites = Hash.new(0)
|
62
|
+
@serps.where(:date => date).each do |serp|
|
63
|
+
serp = YAML.load(serp[:content])
|
64
|
+
serp.sem_sites.each {|site| sites[site] += 1}
|
65
|
+
end
|
66
|
+
puts YAML.dump(sites)
|
67
|
+
end
|
68
|
+
end
|
69
|
+
end
|
data/lib/baiduserp/client.rb
CHANGED
@@ -41,6 +41,10 @@ module Baiduserp
|
|
41
41
|
response = self.class.get_serp(url)
|
42
42
|
end
|
43
43
|
|
44
|
+
if response.headers['Content-Length'].nil?
|
45
|
+
response = self.class.get_serp(url,retries)
|
46
|
+
end
|
47
|
+
|
44
48
|
if response.headers['Content-Length'].to_i != response.body.bytesize
|
45
49
|
issue_file = "/tmp/baiduserp_crawler_issue_#{Time.now.strftime("%Y%m%d%H%M%S")}.html"
|
46
50
|
open(issue_file,'w').puts(response.body)
|
data/lib/baiduserp/parser.rb
CHANGED
@@ -8,7 +8,7 @@ require 'baiduserp/result'
|
|
8
8
|
|
9
9
|
module Baiduserp
|
10
10
|
class Parser
|
11
|
-
Dir[File.expand_path('
|
11
|
+
Dir[File.expand_path('../parser/*.rb', __FILE__)].each{|f| require f}
|
12
12
|
|
13
13
|
def parse(html)
|
14
14
|
html = html.encode!('UTF-8','UTF-8',:invalid => :replace)
|
File without changes
|
@@ -3,7 +3,10 @@ class Baiduserp::Parser
|
|
3
3
|
result = []
|
4
4
|
rank = 0
|
5
5
|
|
6
|
-
file[:doc].search('div#content_left').first
|
6
|
+
part = file[:doc].search('div#content_left').first
|
7
|
+
return result if part.nil?
|
8
|
+
|
9
|
+
part.children.each do |div|
|
7
10
|
id = div['id'].to_i
|
8
11
|
break if id > 0 && id < 3000
|
9
12
|
next unless div['class'].to_s.include?('ec_pp_f')
|
File without changes
|
@@ -1,7 +1,10 @@
|
|
1
1
|
class Baiduserp::Parser
|
2
2
|
def _parse_ranks(file)
|
3
3
|
result = []
|
4
|
-
file[:doc].search("div[@id='content_left']").first
|
4
|
+
part = file[:doc].search("div[@id='content_left']").first
|
5
|
+
return result if part.nil?
|
6
|
+
|
7
|
+
part.children.each do |table|
|
5
8
|
next if table.nil?
|
6
9
|
id = table['id'].to_i
|
7
10
|
next unless id > 0 && id < 3000
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
data/lib/baiduserp/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: baiduserp
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 2.
|
4
|
+
version: 2.2.9
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- MingQian Zhang
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2013-
|
11
|
+
date: 2013-12-02 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -52,6 +52,34 @@ dependencies:
|
|
52
52
|
- - '>='
|
53
53
|
- !ruby/object:Gem::Version
|
54
54
|
version: '0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: sequel
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - '>='
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
type: :runtime
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - '>='
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '0'
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: docopt
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - '>='
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0'
|
76
|
+
type: :runtime
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - '>='
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: '0'
|
55
83
|
description: Parse Baidu SERP result page.
|
56
84
|
email:
|
57
85
|
- zmingqian@qq.com
|
@@ -60,24 +88,28 @@ executables:
|
|
60
88
|
extensions: []
|
61
89
|
extra_rdoc_files: []
|
62
90
|
files:
|
91
|
+
- lib/baiduserp/analyser-migrations/001_create_keywords_table.rb
|
92
|
+
- lib/baiduserp/analyser-migrations/002_create_htmls_table.rb
|
93
|
+
- lib/baiduserp/analyser-migrations/003_create_serps_table.rb
|
94
|
+
- lib/baiduserp/analyser.rb
|
63
95
|
- lib/baiduserp/client.rb
|
64
96
|
- lib/baiduserp/helper.rb
|
97
|
+
- lib/baiduserp/parser/ads_right.rb
|
98
|
+
- lib/baiduserp/parser/ads_top.rb
|
99
|
+
- lib/baiduserp/parser/con_ar.rb
|
100
|
+
- lib/baiduserp/parser/pinpaizhuanqu.rb
|
101
|
+
- lib/baiduserp/parser/ranks.rb
|
102
|
+
- lib/baiduserp/parser/related_keywords.rb
|
103
|
+
- lib/baiduserp/parser/result_num.rb
|
104
|
+
- lib/baiduserp/parser/right_hotel.rb
|
105
|
+
- lib/baiduserp/parser/right_personinfo.rb
|
106
|
+
- lib/baiduserp/parser/right_relaperson.rb
|
107
|
+
- lib/baiduserp/parser/right_weather.rb
|
108
|
+
- lib/baiduserp/parser/zhixin.rb
|
65
109
|
- lib/baiduserp/parser.rb
|
66
110
|
- lib/baiduserp/result.rb
|
67
111
|
- lib/baiduserp/version.rb
|
68
112
|
- lib/baiduserp.rb
|
69
|
-
- lib/parsers/ads_right.rb
|
70
|
-
- lib/parsers/ads_top.rb
|
71
|
-
- lib/parsers/con_ar.rb
|
72
|
-
- lib/parsers/pinpaizhuanqu.rb
|
73
|
-
- lib/parsers/ranks.rb
|
74
|
-
- lib/parsers/related_keywords.rb
|
75
|
-
- lib/parsers/result_num.rb
|
76
|
-
- lib/parsers/right_hotel.rb
|
77
|
-
- lib/parsers/right_personinfo.rb
|
78
|
-
- lib/parsers/right_relaperson.rb
|
79
|
-
- lib/parsers/right_weather.rb
|
80
|
-
- lib/parsers/zhixin.rb
|
81
113
|
- bin/baiduserp
|
82
114
|
- README.md
|
83
115
|
- lib/baiduserp/user_agents.yml
|