RubyGems - the_scrap - Versions diffs - 0.0.1 - Mend

the_scrap 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 3b6dbb1e2bbe11284969c7a5bcc79e3bba665b96
+  data.tar.gz: acfc7d5ac75f238fc77c578b52a68954e71b64c7
+SHA512:
+  metadata.gz: b14dd5d813c97c2a4c9f8c954d8d523004b3d3472e502c8e881301d4213d1539e417da1fb19ca0b864335356065e9b23d8ec8f546b4e464649f92a0814005ec9
+  data.tar.gz: c989f8ccca09cdef3f892ca78264f9c53c8ef52beaa0020e3482109625e92a14f6e3a49c0b64131cb7a174688ce4734272e0b8c59cd6052ee98e130e3c5fc4b0

data/.gitignore ADDED Viewed

@@ -0,0 +1,22 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in the_scrap.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2014 H.J.LeoChen
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,435 @@
+## The Scrap
+The Scrap 是一个基于Nokogiri的网页数据抓取的框架
+目标是使用简单、高效、高自定义、高适配性。
+## Installation
+Add this line to your application's Gemfile:
+    gem 'the_scrap'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install the_scrap
+## Usage
+### 0. 全景
+```ruby
+# encoding: utf-8
+require 'rubygems'
+require 'the_scrap'
+require 'pp'
+#create Object
+scrap = TheScrap::ListObj.new
+#set start url
+scrap.url = "http://fz.ganji.com/shouji/"
+#fragment css selector
+#表示，表格的每一行，或者列表的每个元素
+#这个行或者元素里面应该包含这条记录的详细信息
+#详细信息通过attr列表来获取。
+scrap.item_frag = ".layoutlist .list-bigpic"
+#scrap attr list
+scrap.attr_name = ['.ft-tit',:inner_html]
+scrap.attr_detail_url = ['.ft-tit',:href]
+scrap.attr_img = ['dt a img',:src]
+scrap.attr_desc = '.feature p'
+scrap.attr_price = '.fc-org'
+#debug
+scrap.debug = true
+scrap.verbose = true
+#html preprocess
+scrap.html_proc << lambda { |html|
+  #html.gsub(/abcd/,'efgh')
+}
+#filter scraped item
+scrap.item_filters << lambda { |item_info|
+  return false if item_info['name'].nil? || item_info['name'].length == 0
+  return true
+}
+#data process
+scrap.data_proc << lambda {|url,i|
+  i['name'] = i['name'].strip
+}
+#result process
+scrap.result_proc << lambda {|url,items|
+  items.each do |item|
+    pp item
+  end
+}
+##### 此处可以添加 多页分页 抓取功能 参见 2
+##### 此处可以添加 详细信息页面 抓取功能 参见 3
+#scrap
+scrap.scrap_list
+```
+### 1. 列表抓取
+参考上一节
+### 2. 多页列表抓取
+```ruby
+#create ListObj
+#...
+########### has many pages ###########
+#如果设置了可以根据不同的分页方式抓取多页列表
+scrap.has_many_pages = true
+#next page link
+# [:next_page, :total_pages, :total_records]
+#:next_page
+scrap.page_method = :next_page
+scrap.next_page_css = ".next_page a"
+#:total_page
+scrap.page_method = :total_pages
+scrap.get_page_count = lambda { |doc|
+  if doc.css('.total_p[age').text =~ /(\d+)页/
+    $~[1].to_i
+  else
+    0
+  end
+}
+scrap.get_next_url = lambda { |url,next_page_number|
+  #url is  http://fz.ganji.com/shouji/
+  #page url pattern http://fz.ganji.com/shouji/o#{page_number}/
+  url += "/o#{next_page_number}"
+}
+#**total_record in progress
+scrap.page_method = :total_records
+#...
+scrap.scrap_list
+```
+### 3. 带详细页面信息提取
+**如果DetailObj不是单独运行而是在ListObj中运行，抓取的信息将合并到ListObj的结果中去**
+```ruby
+#create ListObj
+#extra detail page url
+scrap.attr_detail_url = [".list a",:href]
+...
+################# has detail page ################
+#如果设置了可以根据之前抓取的详细页面URL获取详细页面信息
+#1. define a detail object
+scrap_detail = TheScrap::DetailObj.new
+scrap_detail.attr_title = ".Tbox h3"
+scrap_detail.attr_detail = ".Tbox .newsatr"
+scrap_detail.attr_content = [".Tbox .view",:inner_html]
+#optional html preprocess
+scrap_detail.html_proc << lambda{ |response|
+}
+#optional data process
+scrap_detail.data_proc << lambda {|url,i|
+}
+#optional result process
+#此处可选，抓取的信息将合并到列表页面抓取的记录中去，也可以单独入库了。
+scrap_detail.result_proc << lambda {|url,items|
+}
+#get url from list attr and extra data by scrap_detail
+scrap.detail_info << [scrap_detail,'detail_url']
+#scrap.detail_info << [scrap_detail_1,'detail_url_1']
+#...
+scrap.scrap_list
+```
+### 4. 元素属性说明
+元素属性使用 **scrap.attr_#{元素名称} = 规则** 来表示
+**抓取后将全部放到一个Hash中，其中“元素名称”为Hash的Key，获取的数据为Hash的值**
+如
+	scrap.attr_name = ".title"
+则结果item['name'] = ".title 对应的节点内容"
+其中规则可以使用多种方式表示
+#### 4.1 直接使用CSS Selector
+直接使用CSS Selector的情况下，则取得CSS节点对应的 文本内容（inner_text)
+```ruby
+@book_info.attr_author = "#divBookInfo .title a"
+```
+#### 4.2 一个数组
+scrap.attr_name = [css_selector,attrs]
+其中数值的第一个元素为： css_selector
+第二个元素可选值为：
+**:frag_attr**
+直接去Fragmengt的属性，如list的属性,因为在实际使用过程中遇到过需要取列表或表格行的某个属性的情况。
+scrap.attr_name = [:frag_attr,'href']
+数组第一个元素为frag_attr而非css selector因为css selector 已经在 scrap.item_frag 中指定，此为特例仅此一处出现此用法。
+**:inner_html**
+取节点内的html
+**:join**
+遇到某个list时，需要把里面的元素全部获取并使用逗号分隔。如：tags
+```html
+<ul class="tags">
+<li>ruby</li>
+<li>rails</li>
+<li>activerecord</li>
+</ul>
+```
+```ruby
+scrap.attr_name = ['.tags', :join]
+```
+使用上述取得一个字符串:
+```ruby
+"ruby,rails,activerecord"
+```
+**:array**
+遇到某个list时，需要把里面的元素全部获取并返回一个Array
+```html
+<ul class="tags">
+<li>ruby</li>
+<li>rails</li>
+<li>activerecord</li>
+</ul>
+```
+```ruby
+scrap.attr_name = ['.tags', :array]
+```
+使用上述取得一个字数组:
+```ruby
+['ruby','rails','activerecord']
+```
+**:src**
+取得图片的SRC属性，并且使用URI.join(current_page_url,src_value)
+**:href**
+取得链接的href属性，并且使用URI.join(current_page_url,href_value)
+**"else"**
+直接获取元素属性的，不做任何其他处理。
+**实例**
+```ruby
+@book_info = TheScrap::DetailObj.new
+@book_info.attr_name = "#divBookInfo .title h1"
+@book_info.attr_author = "#divBookInfo .title a"
+@book_info.attr_desc = [".intro .txt",:inner_html]
+@book_info.attr_pic_url = ['.pic_box a img',:src]
+@book_info.attr_chapters_url = ['.book_pic .opt li[1] a',:href]
+@book_info.attr_book_info = ".info_box table tr"
+@book_info.attr_cat_1 = '.box_title .page_site a[2]'
+@book_info.attr_tags = ['.book_info .other .labels .box[1] a',:array]
+@book_info.attr_user_tags = ['.book_info .other .labels .box[2] a',:join]
+@book_info.attr_rate = '#bzhjshu'
+@book_info.attr_rate_cnt = ["#div_pingjiarenshu",'title']
+@book_info.attr_last_updated_at ="#divBookInfo .tabs .right"
+@book_info.attr_last_chapter = '.updata_cont .title a'
+@book_info.attr_last_chapter_desc = ['.updata_cont .cont a',:inner_html]
+```
+### 5. 分页模式
+参考 2. 多页列表抓取
+### 6. 获取的记录处理方法
+可以多获取的结果进行处理后再执行入库操作：
+简单举例：
+```ruby
+baidu.data_proc << lambda {|url,i|
+  i['title'] = i['title'].strip
+  if i['ori_url'] =~ /view.aspx\?id=(\d+)/
+    i['ori_id'] = $~[1].to_i
+  end
+  if i['detail'] =~ /发布时间：(.*?) /
+    i['updated_at'] = i['created_at'] = $~[1]
+  end
+  if i['detail'] =~ /来源：(.*?)作者：/
+    i['description'] = $~[1].strip
+  end
+  i.delete('detail')
+  i['content'].gsub!(/<script type="text\/javascript">.*?<\/script>/m,'');
+  i['content'].gsub!(/<style>.*?<\/style>/m,'');
+  i['content'].gsub!(/<img class="img_(sina|qq)_share".*?>/m,'');
+  if i['content'] =~ /image=(.*?)"/
+    #i['image'] = open($~[1]) if $~[1].length > 0
+  end
+  i['site_id'] = @site_id
+  i['cat_id'] = @cat_id
+  time = Time.parse(i['updated_at'])
+  prep = '['+time.strftime('%y%m%d')+']'
+}
+```
+### 7. 结果处理
+#### mysql
+```ruby
+require 'active_record'
+require 'mysql2'
+require 'activerecord-import' #recommend
+ActiveRecord::Base.establish_connection( :adapter => "mysql2",  :host => "localhost",
+ :database => "test", :username => "test", :password => ""  )
+ActiveRecord::Base.record_timestamps = false
+class Article < ActiveRecord::Base
+  validates :ori_id, :uniqueness => true
+end
+# OR load Rails env!
+scrap.result_proc << lambda {|url,items|
+  articles = []
+  items.each do |item|
+		#item[:user_id] = 1
+		articles << Article.new(item)
+	end
+  Article.import articles
+}
+```
+#### mongodb
+```ruby
+require 'mongoid'
+Mongoid.load!("./mongoid.yml", :production)
+Mongoid.allow_dynamic_fields = true
+class Article
+  include Mongoid::Document
+	#....
+end
+# OR load Rails env!
+scrap.result_proc << lambda {|url,items|
+  items.each do |item|
+		#item[:user_id] = 1
+		Article.create(item)
+	end
+}
+```
+### json,xml...
+```ruby
+#json
+scrap.result_proc << lambda {|url,items|
+	File.open("xxx.json",'w').write(items.to_json)
+}
+#xml
+scrap.result_proc << lambda {|url,items|
+	articles = []
+  items.each do |item|
+		articles << item.to_xml
+	end
+	file  = File.open("xxx.xml",'w')
+	file.write('<articles>')
+	file.write(articles.join(''))
+	file.write('</articles>')
+	file.close
+}
+```
+## TODO
+1. 多线程抓取
+2. 线程管理
+3. 完善文档
+## Contributing
+1. Fork it ( https://github.com/[my-github-username]/thescrap/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ require "bundler/gem_tasks"
2	+

data/lib/the_scrap/detail_obj.rb ADDED Viewed

@@ -0,0 +1,41 @@
+# encoding: utf-8
+module TheScrap
+  class DetailObj < Scrap
+    def scrap( url, item_info )
+      return retryable(:tries => 3, :on => Timeout::Error) do
+        do_scrap(url,item_info)
+      end
+    end
+    def do_scrap( url, item_info )
+      html = open(url).read
+      html_proc.each do |dp|
+        html = dp.call(html)
+      end
+      doc = Nokogiri::HTML(html,nil,encoding)
+      get_attrs(url,doc,item_info)
+      #has detail page?
+      #可以递归下层
+      detail_info.each do |detail|
+        detail[0].scrap(item_info[detail[1]],item_info)
+      end
+      #proc data
+      data_proc.each do |dp|
+        dp.call(url,item_info)
+      end
+      #proc result
+      #此处可以单独指定对明细信息的入库处理
+      result_proc.each do |rp|
+        rp.call(url,[item_info])
+      end
+      pp item_info if debug?
+      return item_info
+    end
+  end
+end

data/lib/the_scrap/list_obj.rb ADDED Viewed

@@ -0,0 +1,100 @@
+# encoding: utf-8
+module TheScrap
+  class ListObj < Scrap
+    attr_accessor :item_filters #条目过滤
+    attr_accessor :has_many_pages #是否多页
+    attr_accessor :pager_method #分页模式
+    attr_accessor :next_page_css #下一页模式时取下一页链接的 css selector
+    attr_accessor :get_page_count #总页数模式时取总页数方法,不用CSS因为很可能需要重新处理数字。
+    attr_accessor :get_next_url #总页数模式时，下一页的URL生成方式，方法
+    def initialize()
+      super
+      @item_filters = []
+    end
+    def scrap( url )
+      items = []
+      html = open(url)
+      html_proc.each do |dp|
+        html = dp.call(html)
+      end
+      doc = Nokogiri::HTML(html,nil,encoding)
+      doc.css(item_frag).each do |item|
+        item_info = {}
+        get_attrs(url,item,item_info)
+        #filter items
+        need_skip = false
+        item_filters.each do |filter|
+          unless filter.call(item_info)
+            need_skip = true
+            break
+          end
+        end
+        next if need_skip
+        #has detail page?
+        detail_info.each do |detail|
+          detail[0].scrap(item_info[detail[1]],item_info)
+        end
+        #proc result
+        data_proc.each do |dp|
+          dp.call(url,item_info)
+        end
+        items << item_info
+        pp item_info if debug?
+        break if debug?
+      end
+      result_proc.each do |rp|
+        rp.call(url,items)
+      end
+      return doc,items
+    end
+    def scrap_list
+      doc,items = retryable(:tries => 3, :on => Timeout::Error) do
+        scrap(url)
+      end
+      return unless has_many_pages
+      #TODO Refactor it
+      next_page_url = nil
+      if pager_method == :next_page #有下一页连接的方式
+        while node = doc.css(next_page_css).first
+          next_page_url = URI.join(next_page_url||url,node['href']).to_s
+          puts next_page_url if verbose?
+          doc,items = retryable(:tries => 3, :on => Timeout::Error) do
+            scrap(next_page_url)
+          end
+          break if items.count == 0
+          break if debug?
+        end
+      elsif pager_method == :total_pages #可以获取总页数的方式,start by 1
+        page_cnt = get_page_count.call(doc)
+        (2..page_cnt).each do |idx|
+          next_page_url = get_next_url.call(url,idx)
+          puts next_page_url if verbose?
+          doc,items = retryable(:tries => 3, :on => Timeout::Error) do
+            scrap(next_page_url)
+          end
+          break if items.count == 0
+          break if debug?
+        end
+      elsif pager_method == :total_records
+        #TODO
+        #可以取到总条数的方式 , 其实也可以使用上一方式(总页数）实现,只是在外部先使用总条数计算一下总页数
+      end
+    end
+  end
+end

data/lib/the_scrap/scrap.rb ADDED Viewed

@@ -0,0 +1,100 @@
+# encoding: utf-8
+require 'rubygems'
+require 'nokogiri'
+require 'open-uri'
+require 'pp'
+require 'timeout'
+module TheScrap
+  class Scrap
+    attr_accessor :item_frag #条目
+    attr_accessor :url #起点URL
+    attr_accessor :base_url #图片，连接base url
+    attr_accessor :html_proc #获取页面html后的处理方法
+    attr_accessor :data_proc #抓取完内容后手工对数据进行加工
+    attr_accessor :result_proc #入库，文件生成等。
+    attr_accessor :detail_info #详细页面对象
+    attr_accessor :encoding
+    attr_accessor :debug
+    alias_method :debug?, :debug
+    attr_accessor :verbose
+    alias_method :verbose?, :verbose
+    def initialize()
+      @attrs = {}
+      @more_info = []
+      @debug = false
+      #@encoding = 'utf-8'
+      @result_proc = []
+      @detail_info = []
+      @data_proc = []
+      @html_proc = []
+    end
+    def retryable( options = {} )
+      opts = { :tries => 1, :on => Exception }.merge(options)
+      retry_exception, retries = opts[:on], opts[:tries]
+      begin
+        return yield
+      rescue retry_exception
+        if (retries -= 1) > 0
+          sleep 2
+          retry
+        else
+          raise
+        end
+      end
+    end
+    def method_missing( method_id, *arguments, &block )
+      if(method_id =~ /attr_(.*)=/)
+        name = $~[1]
+        @attrs[name] = arguments.first
+      end
+    end
+    protected
+    #TODO document
+    def get_attrs( url, doc, item_info )
+      @attrs.keys.each do |k|
+        unless @attrs[k].is_a? Array
+          item_info[k] = doc.css(@attrs[k]).text.strip
+        else
+          option = @attrs[k]
+          if option[0] == :frag_attr
+            item_info[k] = doc[option[1]]
+            next
+          end
+          node = doc.css(option[0]).first
+          next unless node
+          if(option[1] == :inner_html)
+            item_info[k] = node.inner_html
+          elsif(option[1] == :join)
+            item_info[k] = doc.css(option[0]).map{|i|i.text}.join(',')
+          elsif(option[1] == :array)
+            item_info[k] = doc.css(option[0]).map{|i|i.text}
+          else
+            if [:href,:src].include? option[1].to_sym
+              #why ???
+              src = node[option[1]].strip.gsub(" ","%20")
+              begin
+                item_info[k] = URI.join(base_url||url,src).to_s
+              rescue
+                item_info[k] = src.to_s
+              end
+            else
+              item_info[k] = node[option[1]].strip
+            end
+          end
+        end
+      end
+    end
+  end
+end

data/lib/the_scrap/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module TheScrap
+  VERSION = "0.0.1"
+end

data/lib/the_scrap.rb ADDED Viewed

@@ -0,0 +1,5 @@
+require "the_scrap/version"
+require "the_scrap/scrap"
+require "the_scrap/list_obj"
+require "the_scrap/detail_obj"

data/the_scrap.gemspec ADDED Viewed

@@ -0,0 +1,24 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'the_scrap/version'
+Gem::Specification.new do |spec|
+  spec.name          = "the_scrap"
+  spec.version       = TheScrap::VERSION
+  spec.authors       = ["H.J.LeoChen"]
+  spec.email         = ["hjleochen@hotmail.com"]
+  spec.summary       = %q{The webpage scrapping.}
+  spec.description   = %q{The webpage scrapping based Nokogiri.}
+  spec.homepage      = ""
+  spec.license       = "MIT"
+  spec.files         = `git ls-files -z`.split("\x0")
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.6"
+  spec.add_development_dependency "rake"
+  spec.add_dependency "nokogiri"
+end

metadata ADDED Viewed

@@ -0,0 +1,97 @@
+--- !ruby/object:Gem::Specification
+name: the_scrap
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- H.J.LeoChen
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2014-08-18 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.6'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+description: The webpage scrapping based Nokogiri.
+email:
+- hjleochen@hotmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- lib/the_scrap.rb
+- lib/the_scrap/detail_obj.rb
+- lib/the_scrap/list_obj.rb
+- lib/the_scrap/scrap.rb
+- lib/the_scrap/version.rb
+- the_scrap.gemspec
+homepage: ''
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 2.2.2
+signing_key:
+specification_version: 4
+summary: The webpage scrapping.
+test_files: []