baiduserp 2.1.14 → 2.2.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +1 -231
- data/bin/baiduserp +39 -33
- data/lib/baiduserp.rb +6 -1
- data/lib/baiduserp/analyser-migrations/001_create_keywords_table.rb +17 -0
- data/lib/baiduserp/analyser-migrations/002_create_htmls_table.rb +17 -0
- data/lib/baiduserp/analyser-migrations/003_create_serps_table.rb +16 -0
- data/lib/baiduserp/analyser.rb +69 -0
- data/lib/baiduserp/client.rb +4 -0
- data/lib/baiduserp/parser.rb +1 -1
- data/lib/{parsers → baiduserp/parser}/ads_right.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/ads_top.rb +4 -1
- data/lib/{parsers → baiduserp/parser}/con_ar.rb +0 -0
- data/lib/baiduserp/parser/pinpaizhuanqu.rb +8 -0
- data/lib/{parsers → baiduserp/parser}/ranks.rb +4 -1
- data/lib/{parsers → baiduserp/parser}/related_keywords.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/result_num.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_hotel.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_personinfo.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_relaperson.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/right_weather.rb +0 -0
- data/lib/{parsers → baiduserp/parser}/zhixin.rb +0 -0
- data/lib/baiduserp/version.rb +1 -1
- metadata +46 -14
- data/lib/parsers/pinpaizhuanqu.rb +0 -5
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA1:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 1d1c2a61259d5e3134d0566e002ba551a791b06c
|
|
4
|
+
data.tar.gz: 8b3f78ad2190a17a2f530019c0b0719bb9abfb93
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 859d3c01765741cfe50eed92a05f5cbec93c68c003d13d2a2dad2e6b85e22d63d2171ded85fa3cd2271d7b830c522bfa95b3b27f91a269af621f811f23c3a87a
|
|
7
|
+
data.tar.gz: 5847dd8a81f4c61a0f72d2d8a2e6cd91e4aa1e9eed32f3c495453e31fe006a5c8162f902ebf2e831adb16731114fcb6b7d4a4867ed8ce575aecc39e2feca14c4
|
data/README.md
CHANGED
|
@@ -1,231 +1 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
|
-
此gem的目的是专门用来解析百度的搜索结果页.并以最大限度获取SERP结果页面所能拿到的信息为目的.
|
|
4
|
-
(注意目前这并不是一个批量处理关键词排名的程序, 但可以作为一个批量排名查询软件中解析百度SERP页面的模块)
|
|
5
|
-
|
|
6
|
-
## 特点
|
|
7
|
-
### 解析SERP结果尽量全面
|
|
8
|
-
众所周知百度的SERP页面现在越来越复杂,左侧的各种新样式层出不穷.右侧也增加了很多内容.
|
|
9
|
-
这个GEM的功能就是把SERP页面解析成ruby中的数据结构.
|
|
10
|
-
|
|
11
|
-
做SERP页面分析的时候有可能会想要分析页面上各种信息, 如SEO排名, SEM排名, 竞争对手排名,
|
|
12
|
-
标题/描述文字, 还有相关关键词, 右侧相关推荐信息, 是否有百度开放平台等...
|
|
13
|
-
|
|
14
|
-
此gem会把上述各种各样的信息都解析出来, 供后续分析使用. 并且如果使用量越来越大, 或者百度又出新产品的话,
|
|
15
|
-
也可以增加新模块的解析.
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
### 提供命令行接口(既可以测试用,也可以以JSON格式输出)
|
|
19
|
-
除了提供ruby调用外, 使用其他编程语言的也可以用命令行的接口, 使用JSON格式输出结果数据.
|
|
20
|
-
详细使用说明见下文.
|
|
21
|
-
|
|
22
|
-
### 已知问题
|
|
23
|
-
目前这个只是一个基本能用的版本,可能会有各种各样的问题,欢迎提BUG. [已知问题列表](https://github.com/semseo/baiduserp/issues).
|
|
24
|
-
|
|
25
|
-
## Installation
|
|
26
|
-
|
|
27
|
-
1 系统要求
|
|
28
|
-
|
|
29
|
-
Linux或Mac. Linux最好使用新版本的Ubuntu或Fedora系列.
|
|
30
|
-
|
|
31
|
-
2 安装ruby环境
|
|
32
|
-
|
|
33
|
-
只支持ruby1.9及以上. 最好的安装ruby的方法是通过[RVM](https://rvm.io/),RVM的使用方法可以参考这个页面[http://ruby-china.org/wiki/install_ruby_guide](http://ruby-china.org/wiki/install_ruby_guide), 虽然同时安装了一些不需要的rails相关的软件, 但是介绍很详细.
|
|
34
|
-
|
|
35
|
-
在最新的Ubuntu或Fedora系列的Linux中,也可以通过apt-get或yum安装ruby1.9.
|
|
36
|
-
|
|
37
|
-
3 安装gem依赖
|
|
38
|
-
|
|
39
|
-
需要依赖nokogiri这个gem.而这个gem需要系统中的两个库.
|
|
40
|
-
所以在ubuntu或者fedora下需要
|
|
41
|
-
|
|
42
|
-
```
|
|
43
|
-
$ sudo apt-get install libxslt-dev libxslt libxml2-dev libxml2 # ubuntu
|
|
44
|
-
$ sudo yum install libxml2-devel libxml2 libxslt libxslt-devel # fedora
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
以上依赖安装完成后,
|
|
48
|
-
|
|
49
|
-
`$ gem install nokogiri`
|
|
50
|
-
|
|
51
|
-
4 最后我们安装 baiduserp gem
|
|
52
|
-
|
|
53
|
-
`$ gem install baiduserp`
|
|
54
|
-
|
|
55
|
-
## Usage
|
|
56
|
-
|
|
57
|
-
ruby 代码示例
|
|
58
|
-
|
|
59
|
-
```
|
|
60
|
-
require 'baiduserp'
|
|
61
|
-
require 'open-uri'
|
|
62
|
-
require 'pp'
|
|
63
|
-
|
|
64
|
-
pp Baiduserp.search 'keyword'
|
|
65
|
-
|
|
66
|
-
pp Baiduserp.parse open(http://www.baidu.com/s?wd=keyword).read.encode('UTF-8')
|
|
67
|
-
|
|
68
|
-
```
|
|
69
|
-
|
|
70
|
-
另外为了方便非ruby程序使用以及一次性调试,也提供了命令行调用方法,可以通过JSON格式交换数据:
|
|
71
|
-
|
|
72
|
-
```
|
|
73
|
-
$ baiduserp -h
|
|
74
|
-
Usage:
|
|
75
|
-
1. baiduserp -s 'keyword' # search 'keyword' and print parse result
|
|
76
|
-
2. baiduserp -s 'keyword' -o output.json # -o means save result to a file
|
|
77
|
-
3. baiduserp -f 'file path' # parse html source code from file
|
|
78
|
-
4. baiduserp -s 'keyword' -j # search 'keyword' and print parse result in JSON format
|
|
79
|
-
-s, --search Keyword Search Keyword & Parse SERP
|
|
80
|
-
-j, --jsonprint Print result in JSON format
|
|
81
|
-
-o, --output Output Save Result to File in JSON format
|
|
82
|
-
-f, --file File Parse Local File
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
最终结果采用了哈希表和数组相互嵌套的数据结构.结果示例如下:
|
|
86
|
-
|
|
87
|
-
```
|
|
88
|
-
$ baiduserp -s 香港
|
|
89
|
-
{:ads_right=>
|
|
90
|
-
[{:rank=>1,
|
|
91
|
-
:title=>"预订香港酒店上携程,全景图..",
|
|
92
|
-
:content=>"订香港酒店,享受有房保障,服务好,折扣低,返现高达201元,订香港酒店上携程超划算!",
|
|
93
|
-
:site=>"www.ctrip.com"},
|
|
94
|
-
{:rank=>2,
|
|
95
|
-
:title=>"香港-去哪儿网度假频道,聪明..",
|
|
96
|
-
:content=>"香港-去哪儿网度假频道,比价首选!180000条报价实时更新,先比价后出行!",
|
|
97
|
-
:site=>"dujia.qunar.com"},
|
|
98
|
-
{:rank=>3,
|
|
99
|
-
:title=>"香港香港怎么玩最划算?",
|
|
100
|
-
:content=>"深圳旅行社香港,香港旅游您的超值之选!天天出团港澳游专线,缤纷全程绝不",
|
|
101
|
-
:site=>"www.sztygl128.com"},
|
|
102
|
-
{:rank=>4,
|
|
103
|
-
:title=>"香港旅游攻略-150000条点评",
|
|
104
|
-
:content=>"还没来过香港?507个景点都玩遍?到到网告诉你网友怎么玩(150000张游记照片)!",
|
|
105
|
-
:site=>"www.daodao.com"},
|
|
106
|
-
{:rank=>5,
|
|
107
|
-
:title=>"香港旅游首选北京青年旅行社..",
|
|
108
|
-
:content=>"北京青年旅行社专业香港旅游旅行社,高品质服务,天天折扣价,",
|
|
109
|
-
:site=>"www.hqly8.com"},
|
|
110
|
-
{:rank=>6,
|
|
111
|
-
:title=>"香港旅游特价啦!香港旅游价..",
|
|
112
|
-
:content=>"本社邀你一起体验超值香港旅游,全程绝无强制购物,行程安排合理.",
|
|
113
|
-
:site=>"www.cctbj.net"},
|
|
114
|
-
{:rank=>7,
|
|
115
|
-
:title=>"香港旅游线路",
|
|
116
|
-
:content=>"北京旅行社提供香港咨询服务,多条精品旅游线路供您选择.",
|
|
117
|
-
:site=>"www.ctslyw.com"},
|
|
118
|
-
{:rank=>8,
|
|
119
|
-
:title=>"全新香港旅游报价,香港旅游..",
|
|
120
|
-
:content=>"北京国际旅行社,精选多条香港旅游线路,信誉保证,全程无隐性消费.",
|
|
121
|
-
:site=>"www.quly8.net"}],
|
|
122
|
-
:ads_top=>
|
|
123
|
-
[{:rank=>1,
|
|
124
|
-
:title=>"香港酒店预订 在Agoda立享1-7折",
|
|
125
|
-
:content=>"香港酒店预订,尽在Agoda,网上订购低价回馈,为您节省75%.",
|
|
126
|
-
:site=>"www.agoda.com"},
|
|
127
|
-
{:rank=>2,
|
|
128
|
-
:title=>"香港酒店预订 在Agoda立享1-7折",
|
|
129
|
-
:content=>"香港酒店预订,尽在Agoda,网上订购低价回馈,为您节省75%.",
|
|
130
|
-
:site=>"www.agoda.com"}],
|
|
131
|
-
:pinpaizhuanqu=>false,
|
|
132
|
-
:ranks=>
|
|
133
|
-
[{:rank=>1,
|
|
134
|
-
:url=>
|
|
135
|
-
"http://baike.baidu.com/link?url=Ujomxkw-4Whq7C7TI6do9nxHr3G0sO6ywJ3SZfr-lX4qQiht-2rnuGomrclwc4bJ",
|
|
136
|
-
:title=>"香港_百度百科",
|
|
137
|
-
:content=>nil,
|
|
138
|
-
:mu=>"http://baike.baidu.com/view/2607.htm",
|
|
139
|
-
:baiduopen=>false},
|
|
140
|
-
{:rank=>2,
|
|
141
|
-
:url=>"http://lvyou.baidu.com/xianggang/",
|
|
142
|
-
:title=>"2013香港旅游攻略_香港景点线路游记_百度旅游",
|
|
143
|
-
:content=>nil,
|
|
144
|
-
:mu=>"http://lvyou.baidu.com/xianggang/",
|
|
145
|
-
:baiduopen=>false},
|
|
146
|
-
{:rank=>3,
|
|
147
|
-
:url=>
|
|
148
|
-
"http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&fr=ala1&word=%CF%E3%B8%DB",
|
|
149
|
-
:title=>"香港_百度图片 - 举报图片",
|
|
150
|
-
:content=>nil,
|
|
151
|
-
:mu=>
|
|
152
|
-
"http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&fr=ala1&word=%CF%E3%B8%DB",
|
|
153
|
-
:baiduopen=>false},
|
|
154
|
-
{:rank=>4,
|
|
155
|
-
:url=>"http://www.gov.hk/sc/residents/",
|
|
156
|
-
:title=>"GovHK 香港政府一站通:本港居民",
|
|
157
|
-
:content=>
|
|
158
|
-
"香港政府为当地居民提供的资讯和服务,内容包括通讯及科技、文化、康乐及运动、教育及培训、就业、环境、政府、法律及治安、保健及医疗服务、房屋及社会服务、入境事务...",
|
|
159
|
-
:mu=>nil,
|
|
160
|
-
:baiduopen=>false},
|
|
161
|
-
{:rank=>5,
|
|
162
|
-
:url=>"http://www.baidu.com/s?rtt=2&tn=baiduwb&rn=20&cl=2&wd=%CF%E3%B8%DB",
|
|
163
|
-
:title=>"香港的最新微博结果",
|
|
164
|
-
:content=>nil,
|
|
165
|
-
:mu=>"http://www.baidu.com/s?rtt=2&tn=baiduwb&rn=20&cl=2&wd=%CF%E3%B8%DB",
|
|
166
|
-
:baiduopen=>false},
|
|
167
|
-
{:rank=>6,
|
|
168
|
-
:url=>"http://tieba.baidu.com/f?kw=%CF%E3%B8%DB&fr=ala0",
|
|
169
|
-
:title=>"香港吧 百度贴吧",
|
|
170
|
-
:content=>
|
|
171
|
-
"月活跃用户:38万人 累计发贴:202万 图片(1856) | 视频(61) | 精品贴(335) 香港和上海的夜景那个美?????????? 点击:439 回复:259 最近怎么那么多自以为漂亮的S!B女求认证啊。 点击:303 回复:69 什么时候,去香港不必签证,和去北京上海一样容... 点击:839 回复:187 查看更多香港吧内容>> tieba.baidu.com/香港?fr=ala0 2013-10-18",
|
|
172
|
-
:mu=>"http://tieba.baidu.com/f?kw=%CF%E3%B8%DB&fr=ala0",
|
|
173
|
-
:baiduopen=>false},
|
|
174
|
-
{:rank=>7,
|
|
175
|
-
:url=>"http://www.weather.com.cn/html/weather/101320101.shtml",
|
|
176
|
-
:title=>"香港天气预报_一周天气预报_中国天气网 - 最近访问:",
|
|
177
|
-
:content=>nil,
|
|
178
|
-
:mu=>"http://www.weather.com.cn/html/weather/101320101.shtml",
|
|
179
|
-
:baiduopen=>true},
|
|
180
|
-
{:rank=>8,
|
|
181
|
-
:url=>"http://www.mafengwo.cn/travel-scenic-spot/mafengwo/10189.html",
|
|
182
|
-
:title=>"2013香港旅游攻略,香港自助游攻略,蚂蜂窝香港出游攻略游记 - 蚂蜂窝",
|
|
183
|
-
:content=>
|
|
184
|
-
"在香港寻吃完全就是一场舌尖的盛宴,从街边小吃到世界顶级的米其林餐厅任您选择,茶餐厅、早茶、烧腊和及甜品极具港式风味,世界各地的美食料理也一个不落单。 香港...",
|
|
185
|
-
:mu=>nil,
|
|
186
|
-
:baiduopen=>false},
|
|
187
|
-
{:rank=>9,
|
|
188
|
-
:url=>"http://hongkong.cncn.com/",
|
|
189
|
-
:title=>"香港旅游攻略_香港香港旅游景点_香港旅游网",
|
|
190
|
-
:content=>
|
|
191
|
-
"香港欣欣旅游网,提供香港香港旅游景点推荐、10月香港旅游攻略、香港旅行社、香港旅游线路、香港酒店预订、香港旅游地图等出行指南及旅游服务●欣欣旅游网 CNCN.com ...",
|
|
192
|
-
:mu=>nil,
|
|
193
|
-
:baiduopen=>false},
|
|
194
|
-
{:rank=>10,
|
|
195
|
-
:url=>"http://www.baidu.com/s?tn=baidurt&rtt=1&bsst=1&wd=%CF%E3%B8%DB",
|
|
196
|
-
:title=>"香港的最新相关信息",
|
|
197
|
-
:content=>nil,
|
|
198
|
-
:mu=>"http://www.baidu.com/s?tn=baidurt&rtt=1&bsst=1&wd=%CF%E3%B8%DB",
|
|
199
|
-
:baiduopen=>false}],
|
|
200
|
-
:related_keywords=>
|
|
201
|
-
["香港电影",
|
|
202
|
-
"香港旅游",
|
|
203
|
-
"香港天气",
|
|
204
|
-
"香港大学",
|
|
205
|
-
"香港购物",
|
|
206
|
-
"香港地图",
|
|
207
|
-
"香港电视剧",
|
|
208
|
-
"香港地铁",
|
|
209
|
-
"香港中文大学",
|
|
210
|
-
"香港苹果官网"],
|
|
211
|
-
:result_num=>100000000,
|
|
212
|
-
:right_hotel=>nil,
|
|
213
|
-
:right_personinfo=>nil,
|
|
214
|
-
:right_relaperson=>
|
|
215
|
-
[{:title=>"香港特别行政区行政区划", :names=>["油尖旺区", "九龙城区", "湾仔", "元朗区", "西贡区"]},
|
|
216
|
-
{:title=>"全球性国际金融中心", :names=>["新加坡", "纽约", "东京", "伦敦"]},
|
|
217
|
-
{:title=>"其他人还搜", :names=>["台北", "上海", "直布罗陀", "海南", "深圳"]}],
|
|
218
|
-
:right_weather=>nil}
|
|
219
|
-
```
|
|
220
|
-
|
|
221
|
-
## Contributing
|
|
222
|
-
|
|
223
|
-
欢迎大家帮忙协助继续完善这个gem:
|
|
224
|
-
|
|
225
|
-
1. Fork it
|
|
226
|
-
2. Create your feature branch (`git checkout -b my-new-feature`)
|
|
227
|
-
3. Commit your changes (`git commit -am 'Add some feature'`)
|
|
228
|
-
4. Push to the branch (`git push origin my-new-feature`)
|
|
229
|
-
5. Create new Pull Request
|
|
230
|
-
|
|
231
|
-
或者可以到Issue页面提交问题,可以提BUG,新的需求,各种建议,等等.
|
|
1
|
+
解析百度的搜索结果页面, 并返回结构化数据以进行后续分析.
|
data/bin/baiduserp
CHANGED
|
@@ -4,50 +4,56 @@ require 'baiduserp'
|
|
|
4
4
|
require 'optparse'
|
|
5
5
|
require 'json'
|
|
6
6
|
require 'pp'
|
|
7
|
+
require 'docopt'
|
|
7
8
|
|
|
8
|
-
|
|
9
|
+
cmd = File.basename(__FILE__)
|
|
10
|
+
|
|
11
|
+
doc = <<DOCOPT
|
|
9
12
|
1. baiduserp -s 'keyword' # search 'keyword' and print parse result
|
|
10
13
|
2. baiduserp -s 'keyword' -o output.json # -o means save result to a file
|
|
11
14
|
3. baiduserp -f 'file path' # parse html source code from file
|
|
12
15
|
4. baiduserp -s 'keyword' -j # search 'keyword' and print parse result in JSON format
|
|
13
|
-
"
|
|
14
|
-
|
|
15
|
-
options = {}
|
|
16
|
-
OptionParser.new do |opts|
|
|
17
|
-
opts.banner = usage
|
|
18
|
-
|
|
19
|
-
opts.on("-s Keyword", "--search Keyword", "Search Keyword & Parse SERP") do |v|
|
|
20
|
-
options[:keyword] = v
|
|
21
|
-
end
|
|
22
|
-
|
|
23
|
-
opts.on("-j","--jsonprint","Print result in JSON format") do |v|
|
|
24
|
-
options[:jsonprint] = v
|
|
25
|
-
end
|
|
26
16
|
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
17
|
+
Usage:
|
|
18
|
+
#{cmd} [options]
|
|
19
|
+
|
|
20
|
+
Options:
|
|
21
|
+
-h --help show this help message and exit
|
|
22
|
+
-v --version show version and exit
|
|
23
|
+
-a --analyse Name analyse as the given name
|
|
24
|
+
--keywords File uses with -a, import give keywords File before search
|
|
25
|
+
-s --search Keyword search Keyword and show result
|
|
26
|
+
-f --file File parse local file or given url
|
|
27
|
+
-j --json print JSON output
|
|
28
|
+
-o --output File output JSON result to File
|
|
29
|
+
|
|
30
|
+
DOCOPT
|
|
31
|
+
|
|
32
|
+
begin
|
|
33
|
+
options = Docopt::docopt(doc, version: Baiduserp::VERSION)
|
|
34
|
+
# pp options
|
|
35
|
+
rescue Docopt::Exit => e
|
|
36
|
+
puts e.message
|
|
37
|
+
end
|
|
35
38
|
|
|
36
39
|
result = ''
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
+
if options['--analyse']
|
|
41
|
+
analyse = Baiduserp.analyse(options['--analyse'])
|
|
42
|
+
analyse.import_keywords(options('--keywords'))
|
|
43
|
+
analyse.search
|
|
44
|
+
result = 'Analyse finished!'
|
|
45
|
+
elsif options['--search']
|
|
46
|
+
result = Baiduserp.search options['--search']
|
|
47
|
+
elsif options['--file']
|
|
48
|
+
result = Baiduserp.parse_file options['--file']
|
|
40
49
|
else
|
|
41
|
-
|
|
50
|
+
puts "At least given one of -a/-s/-f"
|
|
42
51
|
end
|
|
43
52
|
|
|
44
|
-
if options[
|
|
45
|
-
|
|
46
|
-
pp result
|
|
47
|
-
else
|
|
48
|
-
puts result.to_json
|
|
49
|
-
end
|
|
53
|
+
if options['--json']
|
|
54
|
+
puts result.to_json
|
|
50
55
|
else
|
|
51
|
-
|
|
56
|
+
pp result
|
|
52
57
|
end
|
|
53
58
|
|
|
59
|
+
open(options['--output'],'w').puts result.to_json if options['--output']
|
data/lib/baiduserp.rb
CHANGED
|
@@ -1,5 +1,6 @@
|
|
|
1
1
|
require "baiduserp/version"
|
|
2
|
-
require
|
|
2
|
+
require "baiduserp/parser"
|
|
3
|
+
require "baiduserp/analyser"
|
|
3
4
|
|
|
4
5
|
module Baiduserp
|
|
5
6
|
def self.search(keyword,page=1)
|
|
@@ -17,4 +18,8 @@ module Baiduserp
|
|
|
17
18
|
def self.parse_file(file_path)
|
|
18
19
|
Parser.new.parse_file file_path
|
|
19
20
|
end
|
|
21
|
+
|
|
22
|
+
def self.analyse(name,attrs={})
|
|
23
|
+
Analyser.new(name,attrs)
|
|
24
|
+
end
|
|
20
25
|
end
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
require 'sequel'
|
|
2
|
+
require 'csv'
|
|
3
|
+
require 'date'
|
|
4
|
+
require 'yaml'
|
|
5
|
+
|
|
6
|
+
module Baiduserp
|
|
7
|
+
class Analyser
|
|
8
|
+
# Dir[File.expand_path('../analyser/*.rb', __FILE__)].each{|f| require f}
|
|
9
|
+
|
|
10
|
+
def initialize(name,attrs={})
|
|
11
|
+
@db_file = name + ".db"
|
|
12
|
+
@attrs = attrs
|
|
13
|
+
@keywords_imported = File.exists?(@db_file)
|
|
14
|
+
|
|
15
|
+
@db = Sequel.connect("sqlite://" + @db_file)
|
|
16
|
+
|
|
17
|
+
migrate!
|
|
18
|
+
|
|
19
|
+
@keywords = Class.new(Sequel::Model) do
|
|
20
|
+
set_dataset :keywords
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
@htmls = Class.new(Sequel::Model) do
|
|
24
|
+
set_dataset :htmls
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
@serps = Class.new(Sequel::Model) do
|
|
28
|
+
set_dataset :serps
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
import_keywords unless @keywords_imported
|
|
32
|
+
end
|
|
33
|
+
|
|
34
|
+
def run
|
|
35
|
+
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
def migrate!
|
|
39
|
+
Sequel.extension :migration, :core_extensions
|
|
40
|
+
Sequel::Migrator.apply(@db, File.expand_path('../analyser-migrations/',__FILE__))
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
def import_keywords(file=@attrs[:keywords])
|
|
44
|
+
CSV.foreach(file) do |l|
|
|
45
|
+
@keywords.insert(:term => l[0], :weight => l[1], :category => l[2])
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
def search(date=Date.today)
|
|
50
|
+
@keywords.each do |k|
|
|
51
|
+
next if @htmls.where(:date => date, :keyword_id => k[:id]).count > 0
|
|
52
|
+
puts k.to_hash
|
|
53
|
+
html = Baiduserp.get_search_html(k[:term])
|
|
54
|
+
serp = Baiduserp.parse(html)
|
|
55
|
+
@htmls.insert(:keyword_id => k[:id], :date => date, :content => html)
|
|
56
|
+
@serps.insert(:keyword_id => k[:id], :date => date, :content => YAML.dump(serp))
|
|
57
|
+
end
|
|
58
|
+
end
|
|
59
|
+
|
|
60
|
+
def _analyse_competitors(date=Date.today)
|
|
61
|
+
sites = Hash.new(0)
|
|
62
|
+
@serps.where(:date => date).each do |serp|
|
|
63
|
+
serp = YAML.load(serp[:content])
|
|
64
|
+
serp.sem_sites.each {|site| sites[site] += 1}
|
|
65
|
+
end
|
|
66
|
+
puts YAML.dump(sites)
|
|
67
|
+
end
|
|
68
|
+
end
|
|
69
|
+
end
|
data/lib/baiduserp/client.rb
CHANGED
|
@@ -41,6 +41,10 @@ module Baiduserp
|
|
|
41
41
|
response = self.class.get_serp(url)
|
|
42
42
|
end
|
|
43
43
|
|
|
44
|
+
if response.headers['Content-Length'].nil?
|
|
45
|
+
response = self.class.get_serp(url,retries)
|
|
46
|
+
end
|
|
47
|
+
|
|
44
48
|
if response.headers['Content-Length'].to_i != response.body.bytesize
|
|
45
49
|
issue_file = "/tmp/baiduserp_crawler_issue_#{Time.now.strftime("%Y%m%d%H%M%S")}.html"
|
|
46
50
|
open(issue_file,'w').puts(response.body)
|
data/lib/baiduserp/parser.rb
CHANGED
|
@@ -8,7 +8,7 @@ require 'baiduserp/result'
|
|
|
8
8
|
|
|
9
9
|
module Baiduserp
|
|
10
10
|
class Parser
|
|
11
|
-
Dir[File.expand_path('
|
|
11
|
+
Dir[File.expand_path('../parser/*.rb', __FILE__)].each{|f| require f}
|
|
12
12
|
|
|
13
13
|
def parse(html)
|
|
14
14
|
html = html.encode!('UTF-8','UTF-8',:invalid => :replace)
|
|
File without changes
|
|
@@ -3,7 +3,10 @@ class Baiduserp::Parser
|
|
|
3
3
|
result = []
|
|
4
4
|
rank = 0
|
|
5
5
|
|
|
6
|
-
file[:doc].search('div#content_left').first
|
|
6
|
+
part = file[:doc].search('div#content_left').first
|
|
7
|
+
return result if part.nil?
|
|
8
|
+
|
|
9
|
+
part.children.each do |div|
|
|
7
10
|
id = div['id'].to_i
|
|
8
11
|
break if id > 0 && id < 3000
|
|
9
12
|
next unless div['class'].to_s.include?('ec_pp_f')
|
|
File without changes
|
|
@@ -1,7 +1,10 @@
|
|
|
1
1
|
class Baiduserp::Parser
|
|
2
2
|
def _parse_ranks(file)
|
|
3
3
|
result = []
|
|
4
|
-
file[:doc].search("div[@id='content_left']").first
|
|
4
|
+
part = file[:doc].search("div[@id='content_left']").first
|
|
5
|
+
return result if part.nil?
|
|
6
|
+
|
|
7
|
+
part.children.each do |table|
|
|
5
8
|
next if table.nil?
|
|
6
9
|
id = table['id'].to_i
|
|
7
10
|
next unless id > 0 && id < 3000
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
data/lib/baiduserp/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: baiduserp
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 2.
|
|
4
|
+
version: 2.2.9
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- MingQian Zhang
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2013-
|
|
11
|
+
date: 2013-12-02 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: nokogiri
|
|
@@ -52,6 +52,34 @@ dependencies:
|
|
|
52
52
|
- - '>='
|
|
53
53
|
- !ruby/object:Gem::Version
|
|
54
54
|
version: '0'
|
|
55
|
+
- !ruby/object:Gem::Dependency
|
|
56
|
+
name: sequel
|
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
|
58
|
+
requirements:
|
|
59
|
+
- - '>='
|
|
60
|
+
- !ruby/object:Gem::Version
|
|
61
|
+
version: '0'
|
|
62
|
+
type: :runtime
|
|
63
|
+
prerelease: false
|
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
65
|
+
requirements:
|
|
66
|
+
- - '>='
|
|
67
|
+
- !ruby/object:Gem::Version
|
|
68
|
+
version: '0'
|
|
69
|
+
- !ruby/object:Gem::Dependency
|
|
70
|
+
name: docopt
|
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
|
72
|
+
requirements:
|
|
73
|
+
- - '>='
|
|
74
|
+
- !ruby/object:Gem::Version
|
|
75
|
+
version: '0'
|
|
76
|
+
type: :runtime
|
|
77
|
+
prerelease: false
|
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
79
|
+
requirements:
|
|
80
|
+
- - '>='
|
|
81
|
+
- !ruby/object:Gem::Version
|
|
82
|
+
version: '0'
|
|
55
83
|
description: Parse Baidu SERP result page.
|
|
56
84
|
email:
|
|
57
85
|
- zmingqian@qq.com
|
|
@@ -60,24 +88,28 @@ executables:
|
|
|
60
88
|
extensions: []
|
|
61
89
|
extra_rdoc_files: []
|
|
62
90
|
files:
|
|
91
|
+
- lib/baiduserp/analyser-migrations/001_create_keywords_table.rb
|
|
92
|
+
- lib/baiduserp/analyser-migrations/002_create_htmls_table.rb
|
|
93
|
+
- lib/baiduserp/analyser-migrations/003_create_serps_table.rb
|
|
94
|
+
- lib/baiduserp/analyser.rb
|
|
63
95
|
- lib/baiduserp/client.rb
|
|
64
96
|
- lib/baiduserp/helper.rb
|
|
97
|
+
- lib/baiduserp/parser/ads_right.rb
|
|
98
|
+
- lib/baiduserp/parser/ads_top.rb
|
|
99
|
+
- lib/baiduserp/parser/con_ar.rb
|
|
100
|
+
- lib/baiduserp/parser/pinpaizhuanqu.rb
|
|
101
|
+
- lib/baiduserp/parser/ranks.rb
|
|
102
|
+
- lib/baiduserp/parser/related_keywords.rb
|
|
103
|
+
- lib/baiduserp/parser/result_num.rb
|
|
104
|
+
- lib/baiduserp/parser/right_hotel.rb
|
|
105
|
+
- lib/baiduserp/parser/right_personinfo.rb
|
|
106
|
+
- lib/baiduserp/parser/right_relaperson.rb
|
|
107
|
+
- lib/baiduserp/parser/right_weather.rb
|
|
108
|
+
- lib/baiduserp/parser/zhixin.rb
|
|
65
109
|
- lib/baiduserp/parser.rb
|
|
66
110
|
- lib/baiduserp/result.rb
|
|
67
111
|
- lib/baiduserp/version.rb
|
|
68
112
|
- lib/baiduserp.rb
|
|
69
|
-
- lib/parsers/ads_right.rb
|
|
70
|
-
- lib/parsers/ads_top.rb
|
|
71
|
-
- lib/parsers/con_ar.rb
|
|
72
|
-
- lib/parsers/pinpaizhuanqu.rb
|
|
73
|
-
- lib/parsers/ranks.rb
|
|
74
|
-
- lib/parsers/related_keywords.rb
|
|
75
|
-
- lib/parsers/result_num.rb
|
|
76
|
-
- lib/parsers/right_hotel.rb
|
|
77
|
-
- lib/parsers/right_personinfo.rb
|
|
78
|
-
- lib/parsers/right_relaperson.rb
|
|
79
|
-
- lib/parsers/right_weather.rb
|
|
80
|
-
- lib/parsers/zhixin.rb
|
|
81
113
|
- bin/baiduserp
|
|
82
114
|
- README.md
|
|
83
115
|
- lib/baiduserp/user_agents.yml
|