diffbot 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +1 -0
- data/LICENSE +19 -0
- data/README.md +125 -0
- data/Rakefile +15 -0
- data/diffbot.gemspec +19 -0
- data/lib/diffbot.rb +45 -0
- data/lib/diffbot/article.rb +194 -0
- data/lib/diffbot/coercible_hash.rb +113 -0
- data/lib/diffbot/frontpage.rb +60 -0
- data/lib/diffbot/frontpage/dml_parser.rb +83 -0
- data/lib/diffbot/item.rb +55 -0
- data/lib/diffbot/request.rb +54 -0
- data/test/coercible_hash_test.rb +66 -0
- data/test/test_helper.rb +2 -0
- metadata +114 -0
data/.gitignore
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
/pkg
|
data/LICENSE
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
Copyright (c) 2012 Nicolás Sanguinetti for Tinder Inc.
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
4
|
+
of this software and associated documentation files (the "Software"), to deal
|
5
|
+
in the Software without restriction, including without limitation the rights
|
6
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
7
|
+
copies of the Software, and to permit persons to whom the Software is
|
8
|
+
furnished to do so, subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in
|
11
|
+
all copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
15
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
16
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
17
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
18
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
19
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,125 @@
|
|
1
|
+
# Diffbot
|
2
|
+
|
3
|
+
This is a ruby client for the [Diffbot](http://diffbot.com) API.
|
4
|
+
|
5
|
+
## Global Options
|
6
|
+
|
7
|
+
You can pass some settings to Diffbot like this:
|
8
|
+
|
9
|
+
``` ruby
|
10
|
+
Diffbot.configure do |config|
|
11
|
+
config.token = ENV["DIFFBOT_TOKEN"]
|
12
|
+
config.instrumentor = ActiveSupport::Notifications
|
13
|
+
end
|
14
|
+
```
|
15
|
+
|
16
|
+
The list of supported settings is:
|
17
|
+
|
18
|
+
* `token`: Your Diffbot API token. This will be used for all requests in which
|
19
|
+
you don't specify it manually (see below).
|
20
|
+
* `instrumentor`: An object that matches the [ActiveSupport::Notifications][1]
|
21
|
+
API, which will be used to trace network events. None is used by default.
|
22
|
+
* `article_defaults`: Pass a block to this method to configure the global
|
23
|
+
request settings used for Diffbot::Article requests. See below the options
|
24
|
+
supported.
|
25
|
+
|
26
|
+
[1]: http://api.rubyonrails.org/classes/ActiveSupport/Notifications.html
|
27
|
+
|
28
|
+
## Articles
|
29
|
+
|
30
|
+
In order to fetch an article, do this:
|
31
|
+
|
32
|
+
``` ruby
|
33
|
+
require "diffbot/article"
|
34
|
+
|
35
|
+
article = Diffbot::Article.fetch(article_url, diffbot_token)
|
36
|
+
|
37
|
+
# Now you can inspect the result:
|
38
|
+
article.title
|
39
|
+
article.author
|
40
|
+
article.date
|
41
|
+
article.text
|
42
|
+
# etc. See below for the full list of available response attributes.
|
43
|
+
```
|
44
|
+
|
45
|
+
This is a list of all the fields returned by the `Diffbot::Article.fetch` call:
|
46
|
+
|
47
|
+
* `url`: The URL of the article.
|
48
|
+
* `title`: The title of the article.
|
49
|
+
* `author`: The author of the article.
|
50
|
+
* `date`: The date in which this article was published.
|
51
|
+
* `media`: A list of media items attached to this article.
|
52
|
+
* `text`: The body of the article. This will be plain text unless you specify
|
53
|
+
the HTML option in the request.
|
54
|
+
* `tags`: A list of tags/keywords extracted from the article.
|
55
|
+
* `xpath`: The XPath at which this article was found in the page.
|
56
|
+
|
57
|
+
### Options
|
58
|
+
|
59
|
+
You can customize your request like this:
|
60
|
+
|
61
|
+
``` ruby
|
62
|
+
article = Diffbot::Article.fetch(article_url, diffbot_token) do |request|
|
63
|
+
request.html = true # Return HTML instead of plain text.
|
64
|
+
request.dont_strip_ads = true # Leave any inline ads within the article.
|
65
|
+
request.tags = true # Generate ads for the article.
|
66
|
+
request.comments = true # Extract the comments from the article as well.
|
67
|
+
request.summary = true # Return a summary text instead of the full text.
|
68
|
+
request.stats = true # Return performance, probabilistic scoring stats.
|
69
|
+
end
|
70
|
+
```
|
71
|
+
|
72
|
+
## Frontpages
|
73
|
+
|
74
|
+
In order to fetch and analyze a front page, do this:
|
75
|
+
|
76
|
+
``` ruby
|
77
|
+
require "diffbot/frontpage"
|
78
|
+
|
79
|
+
frontpage = Diffbot::Frontpage.fetch(url, diffbot_token)
|
80
|
+
|
81
|
+
# Results are available in the returned object:
|
82
|
+
frontpage.title
|
83
|
+
frontpage.icon
|
84
|
+
frontpage.items #=> An array of Diffbot::Item instances
|
85
|
+
```
|
86
|
+
|
87
|
+
The fields you can extract from a Frontpage are:
|
88
|
+
|
89
|
+
* `title`: The title of the page.
|
90
|
+
* `icon`: The favicon of the page.
|
91
|
+
* `source_type`: What kind of page this is.
|
92
|
+
* `source_url`: The URL of the page.
|
93
|
+
* `items`: The list of `Diffbot::Item` representing each item on the page.
|
94
|
+
|
95
|
+
The instances of `Diffbot::Item` have the following fields:
|
96
|
+
|
97
|
+
* `id`: Unique identifier for this item.
|
98
|
+
* `title`: Title of the item.
|
99
|
+
* `link`: Extracted permalink of the item (if applicable).
|
100
|
+
* `description`: innerHTML content of the item.
|
101
|
+
* `summary`: A plain-text summary of the item.
|
102
|
+
* `pub_date`: Date when item was detected on page.
|
103
|
+
* `type`: The type of item, according to Diffbot. One of: `IMAGE`, `LINK`,
|
104
|
+
`STORY`, `CHUNK`.
|
105
|
+
* `img`: The main image extracted from this item.
|
106
|
+
* `xroot`: XPath of where the item was found on the page.
|
107
|
+
* `cluster`: XPath of the cluster of items where this item was found.
|
108
|
+
* `stats`: An object with the following attributes:
|
109
|
+
* `spam_score`: A Float between 0.0 and 1.0 indicating the probability this
|
110
|
+
item is spam/an advertisement.
|
111
|
+
* `static_rank`: A Float between 1.0 and 5.0 indicating the quality score of
|
112
|
+
the item.
|
113
|
+
* `fresh`: The percentage of the item that has changed compared to the
|
114
|
+
previous crawl.
|
115
|
+
|
116
|
+
## TODO
|
117
|
+
|
118
|
+
* Implement the Follow API.
|
119
|
+
* Add tests for Article and Frontpage requests.
|
120
|
+
* Add a Frontpage.crawl method that given the URL of a frontpage, it will fetch
|
121
|
+
the article for each item in the page.
|
122
|
+
|
123
|
+
## License
|
124
|
+
|
125
|
+
This is published under an MIT License, see LICENSE for further details.
|
data/Rakefile
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
require "rake/testtask"
|
2
|
+
require "rubygems/package_task"
|
3
|
+
|
4
|
+
gem_spec = eval(File.read("./diffbot.gemspec")) rescue nil
|
5
|
+
Gem::PackageTask.new(gem_spec) do |pkg|
|
6
|
+
pkg.need_zip = false
|
7
|
+
pkg.need_tar = false
|
8
|
+
end
|
9
|
+
|
10
|
+
Rake::TestTask.new do |t|
|
11
|
+
t.pattern = "test/*_test.rb"
|
12
|
+
t.verbose = true
|
13
|
+
end
|
14
|
+
|
15
|
+
task default: :test
|
data/diffbot.gemspec
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
Gem::Specification.new do |s|
|
2
|
+
s.name = "diffbot"
|
3
|
+
s.version = "0.1.0"
|
4
|
+
s.description = "Diffbot provides a concise API for analyzing and extracting semantic information from web pages using Diffbot (http://www.diffbot.com)."
|
5
|
+
s.summary = "Ruby interface to the Diffbot API "
|
6
|
+
s.authors = ["Nicolas Sanguinetti"]
|
7
|
+
s.email = "hi@nicolassanguinetti.info"
|
8
|
+
s.homepage = "http://github.com/tinder/diffbot"
|
9
|
+
s.has_rdoc = false
|
10
|
+
s.files = `git ls-files`.split "\n"
|
11
|
+
s.platform = Gem::Platform::RUBY
|
12
|
+
|
13
|
+
s.add_dependency("excon")
|
14
|
+
s.add_dependency("yajl-ruby")
|
15
|
+
s.add_dependency("nokogiri")
|
16
|
+
s.add_dependency("hashie")
|
17
|
+
|
18
|
+
s.add_development_dependency("minitest")
|
19
|
+
end
|
data/lib/diffbot.rb
ADDED
@@ -0,0 +1,45 @@
|
|
1
|
+
require "hashie/trash"
|
2
|
+
require "diffbot/coercible_hash"
|
3
|
+
require "diffbot/request"
|
4
|
+
require "diffbot/article"
|
5
|
+
require "diffbot/frontpage"
|
6
|
+
|
7
|
+
module Diffbot
|
8
|
+
# Public: Set global options. This is a nice API to group calls to the Diffbot
|
9
|
+
# module.
|
10
|
+
#
|
11
|
+
# Yields the Diffbot module so you can set options on it.
|
12
|
+
#
|
13
|
+
# Returns self.
|
14
|
+
def self.configure
|
15
|
+
yield self
|
16
|
+
self
|
17
|
+
end
|
18
|
+
|
19
|
+
# Public: Configure the default request parameters for Article requests. See
|
20
|
+
# Article::RequestParams documentation for the specific configuration values
|
21
|
+
# you can set.
|
22
|
+
#
|
23
|
+
# Yields the default Article::RequestParams object.
|
24
|
+
#
|
25
|
+
# Returns the default Article::RequestParams object.
|
26
|
+
def self.article_defaults
|
27
|
+
if block_given?
|
28
|
+
@article_defaults = Article::RequestParams.new
|
29
|
+
yield @article_defaults
|
30
|
+
else
|
31
|
+
@article_defaults ||= Article::RequestParams.new
|
32
|
+
end
|
33
|
+
|
34
|
+
@article_defaults
|
35
|
+
end
|
36
|
+
|
37
|
+
class << self
|
38
|
+
# Public: Your Diffbot API token.
|
39
|
+
attr_accessor :token
|
40
|
+
|
41
|
+
# Public: The object used for network instrumentation. Must match
|
42
|
+
# ActiveSupport::Notifications API.
|
43
|
+
attr_accessor :instrumentor
|
44
|
+
end
|
45
|
+
end
|
@@ -0,0 +1,194 @@
|
|
1
|
+
require "yajl"
|
2
|
+
require "diffbot"
|
3
|
+
require "diffbot/coercible_hash"
|
4
|
+
|
5
|
+
module Diffbot
|
6
|
+
# Representation of an article (ie a blog post or similar). This class offers
|
7
|
+
# a single entry point: the `.fetch` method, that, given a URL, will return
|
8
|
+
# the article as analyzed by Diffbot.
|
9
|
+
class Article < Hashie::Trash
|
10
|
+
extend CoercibleHash
|
11
|
+
|
12
|
+
# Public: Fetch an article from a URL.
|
13
|
+
#
|
14
|
+
# url - The article URL.
|
15
|
+
# token - The API token for Diffbot.
|
16
|
+
# parser - The callable object that will parse the raw output from the
|
17
|
+
# API. Defaults to Yajl::Parser.method(:parse).
|
18
|
+
# defaults - The default request options. See Diffbot.article_defaults.
|
19
|
+
#
|
20
|
+
# Yields the request configuration.
|
21
|
+
#
|
22
|
+
# Examples
|
23
|
+
#
|
24
|
+
# # Request an article with the default options.
|
25
|
+
# article = Diffbot::Article.fetch(url, api_token)
|
26
|
+
#
|
27
|
+
# # Pass options to the request. See Diffbot::Article::RequestParams to
|
28
|
+
# # see the available configuration options.
|
29
|
+
# article = Diffbot::Article.fetch(url, api_token) do |req|
|
30
|
+
# req.html = true
|
31
|
+
# end
|
32
|
+
#
|
33
|
+
# Returns a Diffbot::Article.
|
34
|
+
def self.fetch(url, token=Diffbot.token, parser=Yajl::Parser.method(:parse), defaults=Diffbot.article_defaults)
|
35
|
+
params = defaults.dup
|
36
|
+
yield params if block_given?
|
37
|
+
|
38
|
+
request = Diffbot::Request.new(token)
|
39
|
+
response = request.perform(:get, endpoint, params) do |req|
|
40
|
+
req[:query][:url] = url
|
41
|
+
end
|
42
|
+
|
43
|
+
new(parser.call(response.body))
|
44
|
+
end
|
45
|
+
|
46
|
+
# The API endpoint where requests should be made.
|
47
|
+
#
|
48
|
+
# Returns a URL.
|
49
|
+
def self.endpoint
|
50
|
+
"http://www.diffbot.com/api/article"
|
51
|
+
end
|
52
|
+
|
53
|
+
# Public: URL of the article.
|
54
|
+
property :url
|
55
|
+
|
56
|
+
# Public: Title of the article.
|
57
|
+
property :title
|
58
|
+
|
59
|
+
# Public: Author (or Authors) ofthe article.
|
60
|
+
property :author
|
61
|
+
|
62
|
+
# Public: Date of the article (as a string).
|
63
|
+
property :date
|
64
|
+
|
65
|
+
class MediaItem < Hashie::Trash
|
66
|
+
property :type
|
67
|
+
property :link
|
68
|
+
property :primary, default: false
|
69
|
+
end
|
70
|
+
|
71
|
+
# Public: List of media items related to the articles. Each item is an
|
72
|
+
# object with the following attributes:
|
73
|
+
#
|
74
|
+
# type - Either `"image"` or `"video"`.
|
75
|
+
# link - The URL of the given media resource.
|
76
|
+
# primary - Only present in one of the items. This is assumed to be the most
|
77
|
+
# representative media for this article.
|
78
|
+
property :media
|
79
|
+
coerce_property :media, collection: MediaItem
|
80
|
+
|
81
|
+
# Public: The raw text of the article, without formatting.
|
82
|
+
property :text
|
83
|
+
|
84
|
+
# Public: The contents of the article in HTML, stripped of any ads or other
|
85
|
+
# chunks of HTML which are considered unrelated by Diffbot, unless you set
|
86
|
+
# the `dont_strip_ads` option in the request.
|
87
|
+
#
|
88
|
+
# Only present if you set `html` to true in the request.
|
89
|
+
property :html
|
90
|
+
|
91
|
+
# Public: A summary line for this article.
|
92
|
+
#
|
93
|
+
# Only present if you set `summary` to true in the request.
|
94
|
+
property :summary
|
95
|
+
|
96
|
+
# Public: A list of tags related to this article.
|
97
|
+
#
|
98
|
+
# Only present if you set `tags` to true in the request.
|
99
|
+
property :tags
|
100
|
+
|
101
|
+
# Public: The favicon of the page where this article was extracted from.
|
102
|
+
property :icon
|
103
|
+
|
104
|
+
class Stats < Hashie::Trash
|
105
|
+
property :fetch_time, from: :fetchTime
|
106
|
+
property :confidence
|
107
|
+
end
|
108
|
+
|
109
|
+
# Public: Returns an object with the following attributes:
|
110
|
+
#
|
111
|
+
# fetch_time - The time of the request, in ms.
|
112
|
+
# confidence - The confidence of Diffbot that the returned text is really
|
113
|
+
# the text of the article. Between 0.0 and 1.0.
|
114
|
+
#
|
115
|
+
# Only present if you set `stats` to true in the request.
|
116
|
+
property :stats
|
117
|
+
coerce_property :stats, class: Stats
|
118
|
+
|
119
|
+
# Public: The XPath selector at which the body of the article was found in
|
120
|
+
# the page.
|
121
|
+
property :xpath
|
122
|
+
|
123
|
+
# Public: If there was an error in the request, this will contain the error
|
124
|
+
# message.
|
125
|
+
property :error
|
126
|
+
|
127
|
+
# Public: If there was an error in the request, this will contain the error
|
128
|
+
# code.
|
129
|
+
property :error_code, from: :errorCode
|
130
|
+
|
131
|
+
# This represents the parameters you can pass to Diffbot to configure a
|
132
|
+
# given request. These are either set globally with Diffbot.article_defaults
|
133
|
+
# or on a request basis by passing a block to Diffbot::Article.fetch.
|
134
|
+
#
|
135
|
+
# Example:
|
136
|
+
#
|
137
|
+
# # All article requests will include the HTML and tags.
|
138
|
+
# Diffbot.configure do |config|
|
139
|
+
# config.article_defaults do |defaults|
|
140
|
+
# defaults.html = true
|
141
|
+
# defaults.tags = true
|
142
|
+
# end
|
143
|
+
# end
|
144
|
+
#
|
145
|
+
# # This article request will *also* include the summary.
|
146
|
+
# Diffbot::Article.fetch(url, token) do |req|
|
147
|
+
# req.summary = true
|
148
|
+
# end
|
149
|
+
class RequestParams < Hashie::Trash
|
150
|
+
# Public: Set to true to return HTML instead of plain-text.
|
151
|
+
#
|
152
|
+
# Defaults to nil.
|
153
|
+
#
|
154
|
+
# If enabled, sets the `html` key in the `Diffbot::Article`.
|
155
|
+
property :html
|
156
|
+
|
157
|
+
# Public: Set to true to keep any inline ads in the generated story.
|
158
|
+
#
|
159
|
+
# Defaults to nil.
|
160
|
+
#
|
161
|
+
# If enabled, it will change the `html` key in the `Diffbot::Article`.
|
162
|
+
property :dontStripAds, from: :dont_strip_ads
|
163
|
+
|
164
|
+
# Public: Set to true to generate tags for the extracted story.
|
165
|
+
#
|
166
|
+
# Defaults to nil.
|
167
|
+
#
|
168
|
+
# If enabled, sets the `tags` key in the `Diffbot::Article`.
|
169
|
+
property :tags
|
170
|
+
|
171
|
+
# Public: Set to true to find the comments and identify count, link, etc.
|
172
|
+
#
|
173
|
+
# Defaults to nil.
|
174
|
+
#
|
175
|
+
# If enabled, sets the `comments` key in the `Diffbot::Article`.
|
176
|
+
property :comments
|
177
|
+
|
178
|
+
# Public: Set to true to return a summary text.
|
179
|
+
#
|
180
|
+
# Defaults to nil.
|
181
|
+
#
|
182
|
+
# If enabled, sets the `summary` key in the `Diffbot::Article`.
|
183
|
+
property :summary
|
184
|
+
|
185
|
+
# Public: Set to true to include performance and probabilistic scoring
|
186
|
+
# stats.
|
187
|
+
#
|
188
|
+
# Defaults to nil.
|
189
|
+
#
|
190
|
+
# If enabled, sets the `stats` key in the `Diffbot::Article`.
|
191
|
+
property :stats
|
192
|
+
end
|
193
|
+
end
|
194
|
+
end
|
@@ -0,0 +1,113 @@
|
|
1
|
+
module Diffbot
|
2
|
+
# Public: Extend a hash with this mixin to make keys coercible to certain
|
3
|
+
# classes. These keys, when assigned to the hash, will be transformed into the
|
4
|
+
# specified classes.
|
5
|
+
#
|
6
|
+
# The object you pass as coercion types should implement either a `coerce` or
|
7
|
+
# a `new` method.
|
8
|
+
#
|
9
|
+
# You can define rules to coerce properties into classes or collections of
|
10
|
+
# classes. In the latter case, CoercibleHash will just map over whatever value
|
11
|
+
# is passed and attempt to coerce each item individually to the given class.
|
12
|
+
#
|
13
|
+
# Examples
|
14
|
+
#
|
15
|
+
# class Address < Struct.new(:street, :zipcode, :state)
|
16
|
+
# def self.coerce(address)
|
17
|
+
# new(address[:street], address[:zipcode], address[:state])
|
18
|
+
# end
|
19
|
+
# end
|
20
|
+
#
|
21
|
+
# class Person < Hash
|
22
|
+
# extend Diffbot::CoercibleHash
|
23
|
+
#
|
24
|
+
# coerce_property :address, Address
|
25
|
+
# coerce_property :children, collection: Person
|
26
|
+
#
|
27
|
+
# def name
|
28
|
+
# self["name"]
|
29
|
+
# end
|
30
|
+
# end
|
31
|
+
#
|
32
|
+
# person = Person.new(address: {
|
33
|
+
# street: "123 Example St.", zipcode: "12345", state: "XX"
|
34
|
+
# })
|
35
|
+
#
|
36
|
+
# person.address.street #=> "123 Example St."
|
37
|
+
# # etc.
|
38
|
+
#
|
39
|
+
# father = Person.new(name: "John", children: [
|
40
|
+
# { name: "Tim" }, { name: "Sarah" }
|
41
|
+
# ])
|
42
|
+
#
|
43
|
+
# father.name #=> "John"
|
44
|
+
# father.children.first.name #=> "Tim"
|
45
|
+
# father.children.last.name #=> "Sarah"
|
46
|
+
module CoercibleHash
|
47
|
+
# The coercion rules defined for this hash.
|
48
|
+
attr_reader :coercions
|
49
|
+
|
50
|
+
# Adds a #[]= that checks for coercion on the property and delegates to super.
|
51
|
+
def self.extended(base)
|
52
|
+
base.instance_variable_set("@coercions", {})
|
53
|
+
base.class_eval do
|
54
|
+
def []=(property, value)
|
55
|
+
if self.class.coercions.key?(property.to_s)
|
56
|
+
super property, self.class.coercions[property.to_s].(value)
|
57
|
+
else
|
58
|
+
super
|
59
|
+
end
|
60
|
+
end
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
# Public: Coerce a property of this hash into a given type. We will try to
|
65
|
+
# call .coerce on the object you pass as the class, and if that fails, we will
|
66
|
+
# call .new.
|
67
|
+
#
|
68
|
+
# property - The name of the property to coerce.
|
69
|
+
# class_or_options - Either a class to which coerce, or a hash with options:
|
70
|
+
# * class: The class to which coerce
|
71
|
+
# * collection: Coerce the key into an array of members of
|
72
|
+
# this class.
|
73
|
+
#
|
74
|
+
# Examples
|
75
|
+
#
|
76
|
+
# class Person < Hash
|
77
|
+
# extend Diffbot::CoercibleHash
|
78
|
+
#
|
79
|
+
# coerce_property :address, Address
|
80
|
+
#
|
81
|
+
# coerce_property :children, collection: Person
|
82
|
+
#
|
83
|
+
# coerce_property :dob, class: Date
|
84
|
+
# end
|
85
|
+
def coerce_property(property, options)
|
86
|
+
unless options.is_a?(Hash)
|
87
|
+
options = { class: options }
|
88
|
+
end
|
89
|
+
|
90
|
+
coercion_method = ->(obj) do
|
91
|
+
if obj.respond_to?(:coerce)
|
92
|
+
obj.method(:coerce)
|
93
|
+
elsif obj.respond_to?(:new)
|
94
|
+
obj.method(:new)
|
95
|
+
else
|
96
|
+
raise ArgumentError, "#{obj.inspect} does not implement neither .coerce nor .new"
|
97
|
+
end
|
98
|
+
end
|
99
|
+
|
100
|
+
if options.has_key?(:collection)
|
101
|
+
klass = options[:collection]
|
102
|
+
coercion = ->(value) { value.map { |el| coercion_method[klass][el] } }
|
103
|
+
elsif options.has_key?(:class)
|
104
|
+
klass = options[:class]
|
105
|
+
coercion = ->(value) { coercion_method[klass][value] }
|
106
|
+
else
|
107
|
+
raise ArgumentError, "You need to specify either :class or :collection"
|
108
|
+
end
|
109
|
+
|
110
|
+
coercions[property.to_s] = coercion
|
111
|
+
end
|
112
|
+
end
|
113
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
require "nokogiri"
|
2
|
+
require "diffbot"
|
3
|
+
require "diffbot/item"
|
4
|
+
|
5
|
+
module Diffbot
|
6
|
+
# Representation of an front page. This class offers a single entry point: the
|
7
|
+
# `.fetch` method, that, given a URL, will return the front page as analyzed
|
8
|
+
# by Diffbot.
|
9
|
+
class Frontpage < Hashie::Trash
|
10
|
+
extend CoercibleHash
|
11
|
+
|
12
|
+
# Public: Fetch a frontpage's information from a URL.
|
13
|
+
#
|
14
|
+
# url - The frontpage URL.
|
15
|
+
# token - The API token for Diffbot.
|
16
|
+
# parser - The callable object that will parse the raw output from the
|
17
|
+
# API. Defaults to Diffbot::Frontpage::DmlParser.method(:parse).
|
18
|
+
#
|
19
|
+
# Examples
|
20
|
+
#
|
21
|
+
# # Request a frontpage with the default options.
|
22
|
+
# frontpage = Diffbot::Frontpage.fetch(url, api_token)
|
23
|
+
#
|
24
|
+
# Returns a Diffbot::Frontpage.
|
25
|
+
def self.fetch(url, token=Diffbot.token, parser=Diffbot::Frontpage::DmlParser.method(:parse))
|
26
|
+
request = Diffbot::Request.new(token)
|
27
|
+
response = request.perform(:get, endpoint) do |req|
|
28
|
+
req[:query][:url] = url
|
29
|
+
end
|
30
|
+
|
31
|
+
new(parser.call(response.body))
|
32
|
+
end
|
33
|
+
|
34
|
+
# The API endpoint where requests should be made.
|
35
|
+
#
|
36
|
+
# Returns a URL.
|
37
|
+
def self.endpoint
|
38
|
+
"http://www.diffbot.com/api/frontpage"
|
39
|
+
end
|
40
|
+
|
41
|
+
# Public: The title of the page.
|
42
|
+
property :title
|
43
|
+
|
44
|
+
# Public: The favicon of the page.
|
45
|
+
property :icon
|
46
|
+
|
47
|
+
# Public: The favicon of the page.
|
48
|
+
property :source_type, from: :sourceType
|
49
|
+
|
50
|
+
# Public: The URL where this page was extracted from.
|
51
|
+
property :source_url, from: :sourceURL
|
52
|
+
|
53
|
+
# Public: The items extracted from the page. These are instances of
|
54
|
+
# Diffbot::Item.
|
55
|
+
property :items
|
56
|
+
coerce_property :items, collection: Item
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
require "diffbot/frontpage/dml_parser"
|
@@ -0,0 +1,83 @@
|
|
1
|
+
# Parser that takes the XML generated from Diffbot's Frontpage API call and
|
2
|
+
# returns a hash suitable for Diffbot::Frontpage.
|
3
|
+
class Diffbot::Frontpage::DmlParser
|
4
|
+
# Take the string of DML and convert it into a nice little hash we can pass to
|
5
|
+
# Diffbot::Frontpage.
|
6
|
+
#
|
7
|
+
# dml - A string of DML.
|
8
|
+
#
|
9
|
+
# Returns a Hash.
|
10
|
+
def self.parse(dml)
|
11
|
+
node = Nokogiri(dml).root
|
12
|
+
parser = new(node)
|
13
|
+
parser.parse
|
14
|
+
end
|
15
|
+
|
16
|
+
# Initialize the parser with a DML node.
|
17
|
+
#
|
18
|
+
# dml - The root XML::Element
|
19
|
+
def initialize(node)
|
20
|
+
@dml = node
|
21
|
+
end
|
22
|
+
|
23
|
+
# The root element of the DML document.
|
24
|
+
attr_reader :dml
|
25
|
+
|
26
|
+
# Parses the Diffbot Markup Language and generates a Hash that we can pass to
|
27
|
+
# Frontpage.new.
|
28
|
+
#
|
29
|
+
# Returns a Hash.
|
30
|
+
def parse
|
31
|
+
attrs = {}
|
32
|
+
|
33
|
+
info = dml % "info"
|
34
|
+
attrs["title"] = (info % "title").text
|
35
|
+
attrs["icon"] = (info % "icon").text
|
36
|
+
attrs["sourceType"] = (info % "sourceType").text
|
37
|
+
attrs["sourceURL"] = (info % "sourceURL").text
|
38
|
+
|
39
|
+
items = dml / "item"
|
40
|
+
attrs["items"] = items.map do |item|
|
41
|
+
ItemParser.new(item).parse
|
42
|
+
end
|
43
|
+
|
44
|
+
attrs
|
45
|
+
end
|
46
|
+
|
47
|
+
# Parser that takes the XML from a particular item from the XML returned from
|
48
|
+
# the frontpage API.
|
49
|
+
class ItemParser
|
50
|
+
# The root element of each item.
|
51
|
+
attr_reader :item
|
52
|
+
|
53
|
+
# Initialize the parser with an Item node.
|
54
|
+
#
|
55
|
+
# item_node - The root node of the item.
|
56
|
+
def initialize(item_node)
|
57
|
+
@item = item_node
|
58
|
+
end
|
59
|
+
|
60
|
+
# Parses the item's DML and generates a Hash that we can add to the DML
|
61
|
+
# parser's parser's "items" key together with the other items.
|
62
|
+
#
|
63
|
+
# Returns a Hash.
|
64
|
+
def parse
|
65
|
+
attrs = {}
|
66
|
+
|
67
|
+
%w(title link pubDate description textSummary).each do |attr|
|
68
|
+
node = item % attr
|
69
|
+
attrs[attr] = node && node.text
|
70
|
+
end
|
71
|
+
|
72
|
+
%w(type img id xroot cluster).each do |attr|
|
73
|
+
attrs[attr] = item[attr]
|
74
|
+
end
|
75
|
+
|
76
|
+
attrs["stats"] = %w(fresh sp sr).each_with_object({}) do |attr, hash|
|
77
|
+
hash[attr] = item[attr].to_f
|
78
|
+
end
|
79
|
+
|
80
|
+
attrs
|
81
|
+
end
|
82
|
+
end
|
83
|
+
end
|
data/lib/diffbot/item.rb
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
module Diffbot
|
2
|
+
class Item < Hashie::Trash
|
3
|
+
extend CoercibleHash
|
4
|
+
|
5
|
+
class Stats < Hashie::Trash
|
6
|
+
property :fresh
|
7
|
+
property :static_rank, from: :sr
|
8
|
+
property :spam_score, from: :sp
|
9
|
+
end
|
10
|
+
|
11
|
+
# Public: The identifier of this item.
|
12
|
+
property :id
|
13
|
+
|
14
|
+
# Public: The title of this item.
|
15
|
+
property :title
|
16
|
+
|
17
|
+
# Public: The permalink/URL for this item.
|
18
|
+
property :link
|
19
|
+
|
20
|
+
# Public: A string with the date of the item.
|
21
|
+
property :pub_date, from: :pubDate
|
22
|
+
|
23
|
+
# Public: The HTML from the item.
|
24
|
+
property :description
|
25
|
+
|
26
|
+
# Public: A summary line with text from the item.
|
27
|
+
property :summary, from: :textSummary
|
28
|
+
|
29
|
+
# Public: The type of the item. Can be either `IMAGE`, `LINK`, `STORY`, or
|
30
|
+
# `CHUNK` (a chunk of HTML).
|
31
|
+
property :type
|
32
|
+
|
33
|
+
# Public: The URL for the image of this item.
|
34
|
+
property :img
|
35
|
+
|
36
|
+
# Public: The XPath where this item is located at.
|
37
|
+
property :xroot
|
38
|
+
|
39
|
+
# Public: The XPath for the cluster of items where this item comes from. If
|
40
|
+
# a frontpage has, for example, a main list of articles and a sidebar with
|
41
|
+
# "Top Articles", for example, both will be separate clusters, each with
|
42
|
+
# their own articles.
|
43
|
+
property :cluster
|
44
|
+
|
45
|
+
# Public: Stats extracted from this item. This is an object with the
|
46
|
+
# following attributes:
|
47
|
+
#
|
48
|
+
# fresh - The percentage of the item that has changed compared to the
|
49
|
+
# previous crawl.
|
50
|
+
# static_rank - The quality score of the item on a 1 to 5 scale.
|
51
|
+
# spam_score - The probability this item is spam/an advertisement.
|
52
|
+
property :stats
|
53
|
+
coerce_property :stats, class: Stats
|
54
|
+
end
|
55
|
+
end
|
@@ -0,0 +1,54 @@
|
|
1
|
+
require "excon"
|
2
|
+
|
3
|
+
module Diffbot
|
4
|
+
class Request
|
5
|
+
# The API token for Diffbot.
|
6
|
+
attr_reader :token
|
7
|
+
|
8
|
+
# Public: Initialize a new request to the API.
|
9
|
+
#
|
10
|
+
# token - The API token for Diffbot.
|
11
|
+
def initialize(token)
|
12
|
+
@token = token
|
13
|
+
end
|
14
|
+
|
15
|
+
# Public: Perform an HTTP request against Diffbot's API.
|
16
|
+
#
|
17
|
+
# method - The request method, one of :get, :head, :post, :put, or
|
18
|
+
# :delete.
|
19
|
+
# endpoint - The URL to which we'll make the request, as a String.
|
20
|
+
# query - A hash of query string params we want to pass along.
|
21
|
+
#
|
22
|
+
# Yields the request hash before making the request.
|
23
|
+
#
|
24
|
+
# Returns the response.
|
25
|
+
def perform(method, endpoint, query={})
|
26
|
+
request_options = build_request(method, query)
|
27
|
+
yield request_options if block_given?
|
28
|
+
|
29
|
+
request = Excon.new(endpoint)
|
30
|
+
|
31
|
+
request.request(request_options)
|
32
|
+
end
|
33
|
+
|
34
|
+
# Build the hash of options that Excon requires for an HTTP request.
|
35
|
+
#
|
36
|
+
# method - A Symbol with the HTTP method (:get, :post, etc).
|
37
|
+
# query_params - Any query parameters to add to the request.
|
38
|
+
#
|
39
|
+
# Returns a Hash.
|
40
|
+
def build_request(method, query_params={})
|
41
|
+
query = { token: token }.merge(query_params)
|
42
|
+
request = { query: query, method: method, headers: {} }
|
43
|
+
|
44
|
+
if Diffbot.instrumentor
|
45
|
+
request.update(
|
46
|
+
instrumentor: Diffbot.instrumentor,
|
47
|
+
instrumentor_name: "diffbot"
|
48
|
+
)
|
49
|
+
end
|
50
|
+
|
51
|
+
request
|
52
|
+
end
|
53
|
+
end
|
54
|
+
end
|
@@ -0,0 +1,66 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
require "diffbot/coercible_hash"
|
3
|
+
|
4
|
+
describe Diffbot::CoercibleHash do
|
5
|
+
module Foo
|
6
|
+
def self.coerce(value)
|
7
|
+
"coerced #{value}"
|
8
|
+
end
|
9
|
+
end
|
10
|
+
|
11
|
+
module Bar
|
12
|
+
def self.new(value)
|
13
|
+
"initialized #{value}"
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
module Baz
|
18
|
+
def self.coerce(value)
|
19
|
+
"coerced #{value}"
|
20
|
+
end
|
21
|
+
|
22
|
+
def self.new(value)
|
23
|
+
"initialized #{value}"
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
class TestHash < Hash
|
28
|
+
extend Diffbot::CoercibleHash
|
29
|
+
|
30
|
+
coerce_property :foo, Foo
|
31
|
+
coerce_property :foos, collection: Foo
|
32
|
+
|
33
|
+
coerce_property :bar, Bar
|
34
|
+
|
35
|
+
coerce_property :baz, Baz
|
36
|
+
end
|
37
|
+
|
38
|
+
subject do
|
39
|
+
TestHash.new
|
40
|
+
end
|
41
|
+
|
42
|
+
it "coerces keys using the .coerce method" do
|
43
|
+
subject["foo"] = 1
|
44
|
+
subject["foo"].must_equal("coerced 1")
|
45
|
+
end
|
46
|
+
|
47
|
+
it "coerces collections" do
|
48
|
+
subject["foos"] = [1, 2, 3]
|
49
|
+
subject["foos"].must_equal(["coerced 1", "coerced 2", "coerced 3"])
|
50
|
+
end
|
51
|
+
|
52
|
+
it "coerces keys using the .new method" do
|
53
|
+
subject["bar"] = 2
|
54
|
+
subject["bar"].must_equal("initialized 2")
|
55
|
+
end
|
56
|
+
|
57
|
+
it "when both are present, prefers .coerce" do
|
58
|
+
subject["baz"] = 3
|
59
|
+
subject["baz"].must_equal("coerced 3")
|
60
|
+
end
|
61
|
+
|
62
|
+
it "coerces symbols as well" do
|
63
|
+
subject[:foo] = 2
|
64
|
+
subject[:foo].must_equal("coerced 2")
|
65
|
+
end
|
66
|
+
end
|
data/test/test_helper.rb
ADDED
metadata
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: diffbot
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Nicolas Sanguinetti
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-02-06 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: excon
|
16
|
+
requirement: &70280593864880 !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: *70280593864880
|
25
|
+
- !ruby/object:Gem::Dependency
|
26
|
+
name: yajl-ruby
|
27
|
+
requirement: &70280593864420 !ruby/object:Gem::Requirement
|
28
|
+
none: false
|
29
|
+
requirements:
|
30
|
+
- - ! '>='
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: '0'
|
33
|
+
type: :runtime
|
34
|
+
prerelease: false
|
35
|
+
version_requirements: *70280593864420
|
36
|
+
- !ruby/object:Gem::Dependency
|
37
|
+
name: nokogiri
|
38
|
+
requirement: &70280593864000 !ruby/object:Gem::Requirement
|
39
|
+
none: false
|
40
|
+
requirements:
|
41
|
+
- - ! '>='
|
42
|
+
- !ruby/object:Gem::Version
|
43
|
+
version: '0'
|
44
|
+
type: :runtime
|
45
|
+
prerelease: false
|
46
|
+
version_requirements: *70280593864000
|
47
|
+
- !ruby/object:Gem::Dependency
|
48
|
+
name: hashie
|
49
|
+
requirement: &70280593863580 !ruby/object:Gem::Requirement
|
50
|
+
none: false
|
51
|
+
requirements:
|
52
|
+
- - ! '>='
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
type: :runtime
|
56
|
+
prerelease: false
|
57
|
+
version_requirements: *70280593863580
|
58
|
+
- !ruby/object:Gem::Dependency
|
59
|
+
name: minitest
|
60
|
+
requirement: &70280593863160 !ruby/object:Gem::Requirement
|
61
|
+
none: false
|
62
|
+
requirements:
|
63
|
+
- - ! '>='
|
64
|
+
- !ruby/object:Gem::Version
|
65
|
+
version: '0'
|
66
|
+
type: :development
|
67
|
+
prerelease: false
|
68
|
+
version_requirements: *70280593863160
|
69
|
+
description: Diffbot provides a concise API for analyzing and extracting semantic
|
70
|
+
information from web pages using Diffbot (http://www.diffbot.com).
|
71
|
+
email: hi@nicolassanguinetti.info
|
72
|
+
executables: []
|
73
|
+
extensions: []
|
74
|
+
extra_rdoc_files: []
|
75
|
+
files:
|
76
|
+
- .gitignore
|
77
|
+
- LICENSE
|
78
|
+
- README.md
|
79
|
+
- Rakefile
|
80
|
+
- diffbot.gemspec
|
81
|
+
- lib/diffbot.rb
|
82
|
+
- lib/diffbot/article.rb
|
83
|
+
- lib/diffbot/coercible_hash.rb
|
84
|
+
- lib/diffbot/frontpage.rb
|
85
|
+
- lib/diffbot/frontpage/dml_parser.rb
|
86
|
+
- lib/diffbot/item.rb
|
87
|
+
- lib/diffbot/request.rb
|
88
|
+
- test/coercible_hash_test.rb
|
89
|
+
- test/test_helper.rb
|
90
|
+
homepage: http://github.com/tinder/diffbot
|
91
|
+
licenses: []
|
92
|
+
post_install_message:
|
93
|
+
rdoc_options: []
|
94
|
+
require_paths:
|
95
|
+
- lib
|
96
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
97
|
+
none: false
|
98
|
+
requirements:
|
99
|
+
- - ! '>='
|
100
|
+
- !ruby/object:Gem::Version
|
101
|
+
version: '0'
|
102
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
103
|
+
none: false
|
104
|
+
requirements:
|
105
|
+
- - ! '>='
|
106
|
+
- !ruby/object:Gem::Version
|
107
|
+
version: '0'
|
108
|
+
requirements: []
|
109
|
+
rubyforge_project:
|
110
|
+
rubygems_version: 1.8.11
|
111
|
+
signing_key:
|
112
|
+
specification_version: 3
|
113
|
+
summary: Ruby interface to the Diffbot API
|
114
|
+
test_files: []
|