diffbot 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +1 -0
- data/LICENSE +19 -0
- data/README.md +125 -0
- data/Rakefile +15 -0
- data/diffbot.gemspec +19 -0
- data/lib/diffbot.rb +45 -0
- data/lib/diffbot/article.rb +194 -0
- data/lib/diffbot/coercible_hash.rb +113 -0
- data/lib/diffbot/frontpage.rb +60 -0
- data/lib/diffbot/frontpage/dml_parser.rb +83 -0
- data/lib/diffbot/item.rb +55 -0
- data/lib/diffbot/request.rb +54 -0
- data/test/coercible_hash_test.rb +66 -0
- data/test/test_helper.rb +2 -0
- metadata +114 -0
data/.gitignore
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
/pkg
|
data/LICENSE
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
Copyright (c) 2012 Nicolás Sanguinetti for Tinder Inc.
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
4
|
+
of this software and associated documentation files (the "Software"), to deal
|
5
|
+
in the Software without restriction, including without limitation the rights
|
6
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
7
|
+
copies of the Software, and to permit persons to whom the Software is
|
8
|
+
furnished to do so, subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in
|
11
|
+
all copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
15
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
16
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
17
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
18
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
19
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,125 @@
|
|
1
|
+
# Diffbot
|
2
|
+
|
3
|
+
This is a ruby client for the [Diffbot](http://diffbot.com) API.
|
4
|
+
|
5
|
+
## Global Options
|
6
|
+
|
7
|
+
You can pass some settings to Diffbot like this:
|
8
|
+
|
9
|
+
``` ruby
|
10
|
+
Diffbot.configure do |config|
|
11
|
+
config.token = ENV["DIFFBOT_TOKEN"]
|
12
|
+
config.instrumentor = ActiveSupport::Notifications
|
13
|
+
end
|
14
|
+
```
|
15
|
+
|
16
|
+
The list of supported settings is:
|
17
|
+
|
18
|
+
* `token`: Your Diffbot API token. This will be used for all requests in which
|
19
|
+
you don't specify it manually (see below).
|
20
|
+
* `instrumentor`: An object that matches the [ActiveSupport::Notifications][1]
|
21
|
+
API, which will be used to trace network events. None is used by default.
|
22
|
+
* `article_defaults`: Pass a block to this method to configure the global
|
23
|
+
request settings used for Diffbot::Article requests. See below the options
|
24
|
+
supported.
|
25
|
+
|
26
|
+
[1]: http://api.rubyonrails.org/classes/ActiveSupport/Notifications.html
|
27
|
+
|
28
|
+
## Articles
|
29
|
+
|
30
|
+
In order to fetch an article, do this:
|
31
|
+
|
32
|
+
``` ruby
|
33
|
+
require "diffbot/article"
|
34
|
+
|
35
|
+
article = Diffbot::Article.fetch(article_url, diffbot_token)
|
36
|
+
|
37
|
+
# Now you can inspect the result:
|
38
|
+
article.title
|
39
|
+
article.author
|
40
|
+
article.date
|
41
|
+
article.text
|
42
|
+
# etc. See below for the full list of available response attributes.
|
43
|
+
```
|
44
|
+
|
45
|
+
This is a list of all the fields returned by the `Diffbot::Article.fetch` call:
|
46
|
+
|
47
|
+
* `url`: The URL of the article.
|
48
|
+
* `title`: The title of the article.
|
49
|
+
* `author`: The author of the article.
|
50
|
+
* `date`: The date in which this article was published.
|
51
|
+
* `media`: A list of media items attached to this article.
|
52
|
+
* `text`: The body of the article. This will be plain text unless you specify
|
53
|
+
the HTML option in the request.
|
54
|
+
* `tags`: A list of tags/keywords extracted from the article.
|
55
|
+
* `xpath`: The XPath at which this article was found in the page.
|
56
|
+
|
57
|
+
### Options
|
58
|
+
|
59
|
+
You can customize your request like this:
|
60
|
+
|
61
|
+
``` ruby
|
62
|
+
article = Diffbot::Article.fetch(article_url, diffbot_token) do |request|
|
63
|
+
request.html = true # Return HTML instead of plain text.
|
64
|
+
request.dont_strip_ads = true # Leave any inline ads within the article.
|
65
|
+
request.tags = true # Generate ads for the article.
|
66
|
+
request.comments = true # Extract the comments from the article as well.
|
67
|
+
request.summary = true # Return a summary text instead of the full text.
|
68
|
+
request.stats = true # Return performance, probabilistic scoring stats.
|
69
|
+
end
|
70
|
+
```
|
71
|
+
|
72
|
+
## Frontpages
|
73
|
+
|
74
|
+
In order to fetch and analyze a front page, do this:
|
75
|
+
|
76
|
+
``` ruby
|
77
|
+
require "diffbot/frontpage"
|
78
|
+
|
79
|
+
frontpage = Diffbot::Frontpage.fetch(url, diffbot_token)
|
80
|
+
|
81
|
+
# Results are available in the returned object:
|
82
|
+
frontpage.title
|
83
|
+
frontpage.icon
|
84
|
+
frontpage.items #=> An array of Diffbot::Item instances
|
85
|
+
```
|
86
|
+
|
87
|
+
The fields you can extract from a Frontpage are:
|
88
|
+
|
89
|
+
* `title`: The title of the page.
|
90
|
+
* `icon`: The favicon of the page.
|
91
|
+
* `source_type`: What kind of page this is.
|
92
|
+
* `source_url`: The URL of the page.
|
93
|
+
* `items`: The list of `Diffbot::Item` representing each item on the page.
|
94
|
+
|
95
|
+
The instances of `Diffbot::Item` have the following fields:
|
96
|
+
|
97
|
+
* `id`: Unique identifier for this item.
|
98
|
+
* `title`: Title of the item.
|
99
|
+
* `link`: Extracted permalink of the item (if applicable).
|
100
|
+
* `description`: innerHTML content of the item.
|
101
|
+
* `summary`: A plain-text summary of the item.
|
102
|
+
* `pub_date`: Date when item was detected on page.
|
103
|
+
* `type`: The type of item, according to Diffbot. One of: `IMAGE`, `LINK`,
|
104
|
+
`STORY`, `CHUNK`.
|
105
|
+
* `img`: The main image extracted from this item.
|
106
|
+
* `xroot`: XPath of where the item was found on the page.
|
107
|
+
* `cluster`: XPath of the cluster of items where this item was found.
|
108
|
+
* `stats`: An object with the following attributes:
|
109
|
+
* `spam_score`: A Float between 0.0 and 1.0 indicating the probability this
|
110
|
+
item is spam/an advertisement.
|
111
|
+
* `static_rank`: A Float between 1.0 and 5.0 indicating the quality score of
|
112
|
+
the item.
|
113
|
+
* `fresh`: The percentage of the item that has changed compared to the
|
114
|
+
previous crawl.
|
115
|
+
|
116
|
+
## TODO
|
117
|
+
|
118
|
+
* Implement the Follow API.
|
119
|
+
* Add tests for Article and Frontpage requests.
|
120
|
+
* Add a Frontpage.crawl method that given the URL of a frontpage, it will fetch
|
121
|
+
the article for each item in the page.
|
122
|
+
|
123
|
+
## License
|
124
|
+
|
125
|
+
This is published under an MIT License, see LICENSE for further details.
|
data/Rakefile
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
require "rake/testtask"
|
2
|
+
require "rubygems/package_task"
|
3
|
+
|
4
|
+
gem_spec = eval(File.read("./diffbot.gemspec")) rescue nil
|
5
|
+
Gem::PackageTask.new(gem_spec) do |pkg|
|
6
|
+
pkg.need_zip = false
|
7
|
+
pkg.need_tar = false
|
8
|
+
end
|
9
|
+
|
10
|
+
Rake::TestTask.new do |t|
|
11
|
+
t.pattern = "test/*_test.rb"
|
12
|
+
t.verbose = true
|
13
|
+
end
|
14
|
+
|
15
|
+
task default: :test
|
data/diffbot.gemspec
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
Gem::Specification.new do |s|
|
2
|
+
s.name = "diffbot"
|
3
|
+
s.version = "0.1.0"
|
4
|
+
s.description = "Diffbot provides a concise API for analyzing and extracting semantic information from web pages using Diffbot (http://www.diffbot.com)."
|
5
|
+
s.summary = "Ruby interface to the Diffbot API "
|
6
|
+
s.authors = ["Nicolas Sanguinetti"]
|
7
|
+
s.email = "hi@nicolassanguinetti.info"
|
8
|
+
s.homepage = "http://github.com/tinder/diffbot"
|
9
|
+
s.has_rdoc = false
|
10
|
+
s.files = `git ls-files`.split "\n"
|
11
|
+
s.platform = Gem::Platform::RUBY
|
12
|
+
|
13
|
+
s.add_dependency("excon")
|
14
|
+
s.add_dependency("yajl-ruby")
|
15
|
+
s.add_dependency("nokogiri")
|
16
|
+
s.add_dependency("hashie")
|
17
|
+
|
18
|
+
s.add_development_dependency("minitest")
|
19
|
+
end
|
data/lib/diffbot.rb
ADDED
@@ -0,0 +1,45 @@
|
|
1
|
+
require "hashie/trash"
|
2
|
+
require "diffbot/coercible_hash"
|
3
|
+
require "diffbot/request"
|
4
|
+
require "diffbot/article"
|
5
|
+
require "diffbot/frontpage"
|
6
|
+
|
7
|
+
module Diffbot
|
8
|
+
# Public: Set global options. This is a nice API to group calls to the Diffbot
|
9
|
+
# module.
|
10
|
+
#
|
11
|
+
# Yields the Diffbot module so you can set options on it.
|
12
|
+
#
|
13
|
+
# Returns self.
|
14
|
+
def self.configure
|
15
|
+
yield self
|
16
|
+
self
|
17
|
+
end
|
18
|
+
|
19
|
+
# Public: Configure the default request parameters for Article requests. See
|
20
|
+
# Article::RequestParams documentation for the specific configuration values
|
21
|
+
# you can set.
|
22
|
+
#
|
23
|
+
# Yields the default Article::RequestParams object.
|
24
|
+
#
|
25
|
+
# Returns the default Article::RequestParams object.
|
26
|
+
def self.article_defaults
|
27
|
+
if block_given?
|
28
|
+
@article_defaults = Article::RequestParams.new
|
29
|
+
yield @article_defaults
|
30
|
+
else
|
31
|
+
@article_defaults ||= Article::RequestParams.new
|
32
|
+
end
|
33
|
+
|
34
|
+
@article_defaults
|
35
|
+
end
|
36
|
+
|
37
|
+
class << self
|
38
|
+
# Public: Your Diffbot API token.
|
39
|
+
attr_accessor :token
|
40
|
+
|
41
|
+
# Public: The object used for network instrumentation. Must match
|
42
|
+
# ActiveSupport::Notifications API.
|
43
|
+
attr_accessor :instrumentor
|
44
|
+
end
|
45
|
+
end
|
@@ -0,0 +1,194 @@
|
|
1
|
+
require "yajl"
|
2
|
+
require "diffbot"
|
3
|
+
require "diffbot/coercible_hash"
|
4
|
+
|
5
|
+
module Diffbot
|
6
|
+
# Representation of an article (ie a blog post or similar). This class offers
|
7
|
+
# a single entry point: the `.fetch` method, that, given a URL, will return
|
8
|
+
# the article as analyzed by Diffbot.
|
9
|
+
class Article < Hashie::Trash
|
10
|
+
extend CoercibleHash
|
11
|
+
|
12
|
+
# Public: Fetch an article from a URL.
|
13
|
+
#
|
14
|
+
# url - The article URL.
|
15
|
+
# token - The API token for Diffbot.
|
16
|
+
# parser - The callable object that will parse the raw output from the
|
17
|
+
# API. Defaults to Yajl::Parser.method(:parse).
|
18
|
+
# defaults - The default request options. See Diffbot.article_defaults.
|
19
|
+
#
|
20
|
+
# Yields the request configuration.
|
21
|
+
#
|
22
|
+
# Examples
|
23
|
+
#
|
24
|
+
# # Request an article with the default options.
|
25
|
+
# article = Diffbot::Article.fetch(url, api_token)
|
26
|
+
#
|
27
|
+
# # Pass options to the request. See Diffbot::Article::RequestParams to
|
28
|
+
# # see the available configuration options.
|
29
|
+
# article = Diffbot::Article.fetch(url, api_token) do |req|
|
30
|
+
# req.html = true
|
31
|
+
# end
|
32
|
+
#
|
33
|
+
# Returns a Diffbot::Article.
|
34
|
+
def self.fetch(url, token=Diffbot.token, parser=Yajl::Parser.method(:parse), defaults=Diffbot.article_defaults)
|
35
|
+
params = defaults.dup
|
36
|
+
yield params if block_given?
|
37
|
+
|
38
|
+
request = Diffbot::Request.new(token)
|
39
|
+
response = request.perform(:get, endpoint, params) do |req|
|
40
|
+
req[:query][:url] = url
|
41
|
+
end
|
42
|
+
|
43
|
+
new(parser.call(response.body))
|
44
|
+
end
|
45
|
+
|
46
|
+
# The API endpoint where requests should be made.
|
47
|
+
#
|
48
|
+
# Returns a URL.
|
49
|
+
def self.endpoint
|
50
|
+
"http://www.diffbot.com/api/article"
|
51
|
+
end
|
52
|
+
|
53
|
+
# Public: URL of the article.
|
54
|
+
property :url
|
55
|
+
|
56
|
+
# Public: Title of the article.
|
57
|
+
property :title
|
58
|
+
|
59
|
+
# Public: Author (or Authors) ofthe article.
|
60
|
+
property :author
|
61
|
+
|
62
|
+
# Public: Date of the article (as a string).
|
63
|
+
property :date
|
64
|
+
|
65
|
+
class MediaItem < Hashie::Trash
|
66
|
+
property :type
|
67
|
+
property :link
|
68
|
+
property :primary, default: false
|
69
|
+
end
|
70
|
+
|
71
|
+
# Public: List of media items related to the articles. Each item is an
|
72
|
+
# object with the following attributes:
|
73
|
+
#
|
74
|
+
# type - Either `"image"` or `"video"`.
|
75
|
+
# link - The URL of the given media resource.
|
76
|
+
# primary - Only present in one of the items. This is assumed to be the most
|
77
|
+
# representative media for this article.
|
78
|
+
property :media
|
79
|
+
coerce_property :media, collection: MediaItem
|
80
|
+
|
81
|
+
# Public: The raw text of the article, without formatting.
|
82
|
+
property :text
|
83
|
+
|
84
|
+
# Public: The contents of the article in HTML, stripped of any ads or other
|
85
|
+
# chunks of HTML which are considered unrelated by Diffbot, unless you set
|
86
|
+
# the `dont_strip_ads` option in the request.
|
87
|
+
#
|
88
|
+
# Only present if you set `html` to true in the request.
|
89
|
+
property :html
|
90
|
+
|
91
|
+
# Public: A summary line for this article.
|
92
|
+
#
|
93
|
+
# Only present if you set `summary` to true in the request.
|
94
|
+
property :summary
|
95
|
+
|
96
|
+
# Public: A list of tags related to this article.
|
97
|
+
#
|
98
|
+
# Only present if you set `tags` to true in the request.
|
99
|
+
property :tags
|
100
|
+
|
101
|
+
# Public: The favicon of the page where this article was extracted from.
|
102
|
+
property :icon
|
103
|
+
|
104
|
+
class Stats < Hashie::Trash
|
105
|
+
property :fetch_time, from: :fetchTime
|
106
|
+
property :confidence
|
107
|
+
end
|
108
|
+
|
109
|
+
# Public: Returns an object with the following attributes:
|
110
|
+
#
|
111
|
+
# fetch_time - The time of the request, in ms.
|
112
|
+
# confidence - The confidence of Diffbot that the returned text is really
|
113
|
+
# the text of the article. Between 0.0 and 1.0.
|
114
|
+
#
|
115
|
+
# Only present if you set `stats` to true in the request.
|
116
|
+
property :stats
|
117
|
+
coerce_property :stats, class: Stats
|
118
|
+
|
119
|
+
# Public: The XPath selector at which the body of the article was found in
|
120
|
+
# the page.
|
121
|
+
property :xpath
|
122
|
+
|
123
|
+
# Public: If there was an error in the request, this will contain the error
|
124
|
+
# message.
|
125
|
+
property :error
|
126
|
+
|
127
|
+
# Public: If there was an error in the request, this will contain the error
|
128
|
+
# code.
|
129
|
+
property :error_code, from: :errorCode
|
130
|
+
|
131
|
+
# This represents the parameters you can pass to Diffbot to configure a
|
132
|
+
# given request. These are either set globally with Diffbot.article_defaults
|
133
|
+
# or on a request basis by passing a block to Diffbot::Article.fetch.
|
134
|
+
#
|
135
|
+
# Example:
|
136
|
+
#
|
137
|
+
# # All article requests will include the HTML and tags.
|
138
|
+
# Diffbot.configure do |config|
|
139
|
+
# config.article_defaults do |defaults|
|
140
|
+
# defaults.html = true
|
141
|
+
# defaults.tags = true
|
142
|
+
# end
|
143
|
+
# end
|
144
|
+
#
|
145
|
+
# # This article request will *also* include the summary.
|
146
|
+
# Diffbot::Article.fetch(url, token) do |req|
|
147
|
+
# req.summary = true
|
148
|
+
# end
|
149
|
+
class RequestParams < Hashie::Trash
|
150
|
+
# Public: Set to true to return HTML instead of plain-text.
|
151
|
+
#
|
152
|
+
# Defaults to nil.
|
153
|
+
#
|
154
|
+
# If enabled, sets the `html` key in the `Diffbot::Article`.
|
155
|
+
property :html
|
156
|
+
|
157
|
+
# Public: Set to true to keep any inline ads in the generated story.
|
158
|
+
#
|
159
|
+
# Defaults to nil.
|
160
|
+
#
|
161
|
+
# If enabled, it will change the `html` key in the `Diffbot::Article`.
|
162
|
+
property :dontStripAds, from: :dont_strip_ads
|
163
|
+
|
164
|
+
# Public: Set to true to generate tags for the extracted story.
|
165
|
+
#
|
166
|
+
# Defaults to nil.
|
167
|
+
#
|
168
|
+
# If enabled, sets the `tags` key in the `Diffbot::Article`.
|
169
|
+
property :tags
|
170
|
+
|
171
|
+
# Public: Set to true to find the comments and identify count, link, etc.
|
172
|
+
#
|
173
|
+
# Defaults to nil.
|
174
|
+
#
|
175
|
+
# If enabled, sets the `comments` key in the `Diffbot::Article`.
|
176
|
+
property :comments
|
177
|
+
|
178
|
+
# Public: Set to true to return a summary text.
|
179
|
+
#
|
180
|
+
# Defaults to nil.
|
181
|
+
#
|
182
|
+
# If enabled, sets the `summary` key in the `Diffbot::Article`.
|
183
|
+
property :summary
|
184
|
+
|
185
|
+
# Public: Set to true to include performance and probabilistic scoring
|
186
|
+
# stats.
|
187
|
+
#
|
188
|
+
# Defaults to nil.
|
189
|
+
#
|
190
|
+
# If enabled, sets the `stats` key in the `Diffbot::Article`.
|
191
|
+
property :stats
|
192
|
+
end
|
193
|
+
end
|
194
|
+
end
|
@@ -0,0 +1,113 @@
|
|
1
|
+
module Diffbot
|
2
|
+
# Public: Extend a hash with this mixin to make keys coercible to certain
|
3
|
+
# classes. These keys, when assigned to the hash, will be transformed into the
|
4
|
+
# specified classes.
|
5
|
+
#
|
6
|
+
# The object you pass as coercion types should implement either a `coerce` or
|
7
|
+
# a `new` method.
|
8
|
+
#
|
9
|
+
# You can define rules to coerce properties into classes or collections of
|
10
|
+
# classes. In the latter case, CoercibleHash will just map over whatever value
|
11
|
+
# is passed and attempt to coerce each item individually to the given class.
|
12
|
+
#
|
13
|
+
# Examples
|
14
|
+
#
|
15
|
+
# class Address < Struct.new(:street, :zipcode, :state)
|
16
|
+
# def self.coerce(address)
|
17
|
+
# new(address[:street], address[:zipcode], address[:state])
|
18
|
+
# end
|
19
|
+
# end
|
20
|
+
#
|
21
|
+
# class Person < Hash
|
22
|
+
# extend Diffbot::CoercibleHash
|
23
|
+
#
|
24
|
+
# coerce_property :address, Address
|
25
|
+
# coerce_property :children, collection: Person
|
26
|
+
#
|
27
|
+
# def name
|
28
|
+
# self["name"]
|
29
|
+
# end
|
30
|
+
# end
|
31
|
+
#
|
32
|
+
# person = Person.new(address: {
|
33
|
+
# street: "123 Example St.", zipcode: "12345", state: "XX"
|
34
|
+
# })
|
35
|
+
#
|
36
|
+
# person.address.street #=> "123 Example St."
|
37
|
+
# # etc.
|
38
|
+
#
|
39
|
+
# father = Person.new(name: "John", children: [
|
40
|
+
# { name: "Tim" }, { name: "Sarah" }
|
41
|
+
# ])
|
42
|
+
#
|
43
|
+
# father.name #=> "John"
|
44
|
+
# father.children.first.name #=> "Tim"
|
45
|
+
# father.children.last.name #=> "Sarah"
|
46
|
+
module CoercibleHash
|
47
|
+
# The coercion rules defined for this hash.
|
48
|
+
attr_reader :coercions
|
49
|
+
|
50
|
+
# Adds a #[]= that checks for coercion on the property and delegates to super.
|
51
|
+
def self.extended(base)
|
52
|
+
base.instance_variable_set("@coercions", {})
|
53
|
+
base.class_eval do
|
54
|
+
def []=(property, value)
|
55
|
+
if self.class.coercions.key?(property.to_s)
|
56
|
+
super property, self.class.coercions[property.to_s].(value)
|
57
|
+
else
|
58
|
+
super
|
59
|
+
end
|
60
|
+
end
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
# Public: Coerce a property of this hash into a given type. We will try to
|
65
|
+
# call .coerce on the object you pass as the class, and if that fails, we will
|
66
|
+
# call .new.
|
67
|
+
#
|
68
|
+
# property - The name of the property to coerce.
|
69
|
+
# class_or_options - Either a class to which coerce, or a hash with options:
|
70
|
+
# * class: The class to which coerce
|
71
|
+
# * collection: Coerce the key into an array of members of
|
72
|
+
# this class.
|
73
|
+
#
|
74
|
+
# Examples
|
75
|
+
#
|
76
|
+
# class Person < Hash
|
77
|
+
# extend Diffbot::CoercibleHash
|
78
|
+
#
|
79
|
+
# coerce_property :address, Address
|
80
|
+
#
|
81
|
+
# coerce_property :children, collection: Person
|
82
|
+
#
|
83
|
+
# coerce_property :dob, class: Date
|
84
|
+
# end
|
85
|
+
def coerce_property(property, options)
|
86
|
+
unless options.is_a?(Hash)
|
87
|
+
options = { class: options }
|
88
|
+
end
|
89
|
+
|
90
|
+
coercion_method = ->(obj) do
|
91
|
+
if obj.respond_to?(:coerce)
|
92
|
+
obj.method(:coerce)
|
93
|
+
elsif obj.respond_to?(:new)
|
94
|
+
obj.method(:new)
|
95
|
+
else
|
96
|
+
raise ArgumentError, "#{obj.inspect} does not implement neither .coerce nor .new"
|
97
|
+
end
|
98
|
+
end
|
99
|
+
|
100
|
+
if options.has_key?(:collection)
|
101
|
+
klass = options[:collection]
|
102
|
+
coercion = ->(value) { value.map { |el| coercion_method[klass][el] } }
|
103
|
+
elsif options.has_key?(:class)
|
104
|
+
klass = options[:class]
|
105
|
+
coercion = ->(value) { coercion_method[klass][value] }
|
106
|
+
else
|
107
|
+
raise ArgumentError, "You need to specify either :class or :collection"
|
108
|
+
end
|
109
|
+
|
110
|
+
coercions[property.to_s] = coercion
|
111
|
+
end
|
112
|
+
end
|
113
|
+
end
|
@@ -0,0 +1,60 @@
|
|
1
|
+
require "nokogiri"
|
2
|
+
require "diffbot"
|
3
|
+
require "diffbot/item"
|
4
|
+
|
5
|
+
module Diffbot
|
6
|
+
# Representation of an front page. This class offers a single entry point: the
|
7
|
+
# `.fetch` method, that, given a URL, will return the front page as analyzed
|
8
|
+
# by Diffbot.
|
9
|
+
class Frontpage < Hashie::Trash
|
10
|
+
extend CoercibleHash
|
11
|
+
|
12
|
+
# Public: Fetch a frontpage's information from a URL.
|
13
|
+
#
|
14
|
+
# url - The frontpage URL.
|
15
|
+
# token - The API token for Diffbot.
|
16
|
+
# parser - The callable object that will parse the raw output from the
|
17
|
+
# API. Defaults to Diffbot::Frontpage::DmlParser.method(:parse).
|
18
|
+
#
|
19
|
+
# Examples
|
20
|
+
#
|
21
|
+
# # Request a frontpage with the default options.
|
22
|
+
# frontpage = Diffbot::Frontpage.fetch(url, api_token)
|
23
|
+
#
|
24
|
+
# Returns a Diffbot::Frontpage.
|
25
|
+
def self.fetch(url, token=Diffbot.token, parser=Diffbot::Frontpage::DmlParser.method(:parse))
|
26
|
+
request = Diffbot::Request.new(token)
|
27
|
+
response = request.perform(:get, endpoint) do |req|
|
28
|
+
req[:query][:url] = url
|
29
|
+
end
|
30
|
+
|
31
|
+
new(parser.call(response.body))
|
32
|
+
end
|
33
|
+
|
34
|
+
# The API endpoint where requests should be made.
|
35
|
+
#
|
36
|
+
# Returns a URL.
|
37
|
+
def self.endpoint
|
38
|
+
"http://www.diffbot.com/api/frontpage"
|
39
|
+
end
|
40
|
+
|
41
|
+
# Public: The title of the page.
|
42
|
+
property :title
|
43
|
+
|
44
|
+
# Public: The favicon of the page.
|
45
|
+
property :icon
|
46
|
+
|
47
|
+
# Public: The favicon of the page.
|
48
|
+
property :source_type, from: :sourceType
|
49
|
+
|
50
|
+
# Public: The URL where this page was extracted from.
|
51
|
+
property :source_url, from: :sourceURL
|
52
|
+
|
53
|
+
# Public: The items extracted from the page. These are instances of
|
54
|
+
# Diffbot::Item.
|
55
|
+
property :items
|
56
|
+
coerce_property :items, collection: Item
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
require "diffbot/frontpage/dml_parser"
|
@@ -0,0 +1,83 @@
|
|
1
|
+
# Parser that takes the XML generated from Diffbot's Frontpage API call and
|
2
|
+
# returns a hash suitable for Diffbot::Frontpage.
|
3
|
+
class Diffbot::Frontpage::DmlParser
|
4
|
+
# Take the string of DML and convert it into a nice little hash we can pass to
|
5
|
+
# Diffbot::Frontpage.
|
6
|
+
#
|
7
|
+
# dml - A string of DML.
|
8
|
+
#
|
9
|
+
# Returns a Hash.
|
10
|
+
def self.parse(dml)
|
11
|
+
node = Nokogiri(dml).root
|
12
|
+
parser = new(node)
|
13
|
+
parser.parse
|
14
|
+
end
|
15
|
+
|
16
|
+
# Initialize the parser with a DML node.
|
17
|
+
#
|
18
|
+
# dml - The root XML::Element
|
19
|
+
def initialize(node)
|
20
|
+
@dml = node
|
21
|
+
end
|
22
|
+
|
23
|
+
# The root element of the DML document.
|
24
|
+
attr_reader :dml
|
25
|
+
|
26
|
+
# Parses the Diffbot Markup Language and generates a Hash that we can pass to
|
27
|
+
# Frontpage.new.
|
28
|
+
#
|
29
|
+
# Returns a Hash.
|
30
|
+
def parse
|
31
|
+
attrs = {}
|
32
|
+
|
33
|
+
info = dml % "info"
|
34
|
+
attrs["title"] = (info % "title").text
|
35
|
+
attrs["icon"] = (info % "icon").text
|
36
|
+
attrs["sourceType"] = (info % "sourceType").text
|
37
|
+
attrs["sourceURL"] = (info % "sourceURL").text
|
38
|
+
|
39
|
+
items = dml / "item"
|
40
|
+
attrs["items"] = items.map do |item|
|
41
|
+
ItemParser.new(item).parse
|
42
|
+
end
|
43
|
+
|
44
|
+
attrs
|
45
|
+
end
|
46
|
+
|
47
|
+
# Parser that takes the XML from a particular item from the XML returned from
|
48
|
+
# the frontpage API.
|
49
|
+
class ItemParser
|
50
|
+
# The root element of each item.
|
51
|
+
attr_reader :item
|
52
|
+
|
53
|
+
# Initialize the parser with an Item node.
|
54
|
+
#
|
55
|
+
# item_node - The root node of the item.
|
56
|
+
def initialize(item_node)
|
57
|
+
@item = item_node
|
58
|
+
end
|
59
|
+
|
60
|
+
# Parses the item's DML and generates a Hash that we can add to the DML
|
61
|
+
# parser's parser's "items" key together with the other items.
|
62
|
+
#
|
63
|
+
# Returns a Hash.
|
64
|
+
def parse
|
65
|
+
attrs = {}
|
66
|
+
|
67
|
+
%w(title link pubDate description textSummary).each do |attr|
|
68
|
+
node = item % attr
|
69
|
+
attrs[attr] = node && node.text
|
70
|
+
end
|
71
|
+
|
72
|
+
%w(type img id xroot cluster).each do |attr|
|
73
|
+
attrs[attr] = item[attr]
|
74
|
+
end
|
75
|
+
|
76
|
+
attrs["stats"] = %w(fresh sp sr).each_with_object({}) do |attr, hash|
|
77
|
+
hash[attr] = item[attr].to_f
|
78
|
+
end
|
79
|
+
|
80
|
+
attrs
|
81
|
+
end
|
82
|
+
end
|
83
|
+
end
|
data/lib/diffbot/item.rb
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
module Diffbot
|
2
|
+
class Item < Hashie::Trash
|
3
|
+
extend CoercibleHash
|
4
|
+
|
5
|
+
class Stats < Hashie::Trash
|
6
|
+
property :fresh
|
7
|
+
property :static_rank, from: :sr
|
8
|
+
property :spam_score, from: :sp
|
9
|
+
end
|
10
|
+
|
11
|
+
# Public: The identifier of this item.
|
12
|
+
property :id
|
13
|
+
|
14
|
+
# Public: The title of this item.
|
15
|
+
property :title
|
16
|
+
|
17
|
+
# Public: The permalink/URL for this item.
|
18
|
+
property :link
|
19
|
+
|
20
|
+
# Public: A string with the date of the item.
|
21
|
+
property :pub_date, from: :pubDate
|
22
|
+
|
23
|
+
# Public: The HTML from the item.
|
24
|
+
property :description
|
25
|
+
|
26
|
+
# Public: A summary line with text from the item.
|
27
|
+
property :summary, from: :textSummary
|
28
|
+
|
29
|
+
# Public: The type of the item. Can be either `IMAGE`, `LINK`, `STORY`, or
|
30
|
+
# `CHUNK` (a chunk of HTML).
|
31
|
+
property :type
|
32
|
+
|
33
|
+
# Public: The URL for the image of this item.
|
34
|
+
property :img
|
35
|
+
|
36
|
+
# Public: The XPath where this item is located at.
|
37
|
+
property :xroot
|
38
|
+
|
39
|
+
# Public: The XPath for the cluster of items where this item comes from. If
|
40
|
+
# a frontpage has, for example, a main list of articles and a sidebar with
|
41
|
+
# "Top Articles", for example, both will be separate clusters, each with
|
42
|
+
# their own articles.
|
43
|
+
property :cluster
|
44
|
+
|
45
|
+
# Public: Stats extracted from this item. This is an object with the
|
46
|
+
# following attributes:
|
47
|
+
#
|
48
|
+
# fresh - The percentage of the item that has changed compared to the
|
49
|
+
# previous crawl.
|
50
|
+
# static_rank - The quality score of the item on a 1 to 5 scale.
|
51
|
+
# spam_score - The probability this item is spam/an advertisement.
|
52
|
+
property :stats
|
53
|
+
coerce_property :stats, class: Stats
|
54
|
+
end
|
55
|
+
end
|
@@ -0,0 +1,54 @@
|
|
1
|
+
require "excon"
|
2
|
+
|
3
|
+
module Diffbot
|
4
|
+
class Request
|
5
|
+
# The API token for Diffbot.
|
6
|
+
attr_reader :token
|
7
|
+
|
8
|
+
# Public: Initialize a new request to the API.
|
9
|
+
#
|
10
|
+
# token - The API token for Diffbot.
|
11
|
+
def initialize(token)
|
12
|
+
@token = token
|
13
|
+
end
|
14
|
+
|
15
|
+
# Public: Perform an HTTP request against Diffbot's API.
|
16
|
+
#
|
17
|
+
# method - The request method, one of :get, :head, :post, :put, or
|
18
|
+
# :delete.
|
19
|
+
# endpoint - The URL to which we'll make the request, as a String.
|
20
|
+
# query - A hash of query string params we want to pass along.
|
21
|
+
#
|
22
|
+
# Yields the request hash before making the request.
|
23
|
+
#
|
24
|
+
# Returns the response.
|
25
|
+
def perform(method, endpoint, query={})
|
26
|
+
request_options = build_request(method, query)
|
27
|
+
yield request_options if block_given?
|
28
|
+
|
29
|
+
request = Excon.new(endpoint)
|
30
|
+
|
31
|
+
request.request(request_options)
|
32
|
+
end
|
33
|
+
|
34
|
+
# Build the hash of options that Excon requires for an HTTP request.
|
35
|
+
#
|
36
|
+
# method - A Symbol with the HTTP method (:get, :post, etc).
|
37
|
+
# query_params - Any query parameters to add to the request.
|
38
|
+
#
|
39
|
+
# Returns a Hash.
|
40
|
+
def build_request(method, query_params={})
|
41
|
+
query = { token: token }.merge(query_params)
|
42
|
+
request = { query: query, method: method, headers: {} }
|
43
|
+
|
44
|
+
if Diffbot.instrumentor
|
45
|
+
request.update(
|
46
|
+
instrumentor: Diffbot.instrumentor,
|
47
|
+
instrumentor_name: "diffbot"
|
48
|
+
)
|
49
|
+
end
|
50
|
+
|
51
|
+
request
|
52
|
+
end
|
53
|
+
end
|
54
|
+
end
|
@@ -0,0 +1,66 @@
|
|
1
|
+
require "test_helper"
|
2
|
+
require "diffbot/coercible_hash"
|
3
|
+
|
4
|
+
describe Diffbot::CoercibleHash do
|
5
|
+
module Foo
|
6
|
+
def self.coerce(value)
|
7
|
+
"coerced #{value}"
|
8
|
+
end
|
9
|
+
end
|
10
|
+
|
11
|
+
module Bar
|
12
|
+
def self.new(value)
|
13
|
+
"initialized #{value}"
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
module Baz
|
18
|
+
def self.coerce(value)
|
19
|
+
"coerced #{value}"
|
20
|
+
end
|
21
|
+
|
22
|
+
def self.new(value)
|
23
|
+
"initialized #{value}"
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
class TestHash < Hash
|
28
|
+
extend Diffbot::CoercibleHash
|
29
|
+
|
30
|
+
coerce_property :foo, Foo
|
31
|
+
coerce_property :foos, collection: Foo
|
32
|
+
|
33
|
+
coerce_property :bar, Bar
|
34
|
+
|
35
|
+
coerce_property :baz, Baz
|
36
|
+
end
|
37
|
+
|
38
|
+
subject do
|
39
|
+
TestHash.new
|
40
|
+
end
|
41
|
+
|
42
|
+
it "coerces keys using the .coerce method" do
|
43
|
+
subject["foo"] = 1
|
44
|
+
subject["foo"].must_equal("coerced 1")
|
45
|
+
end
|
46
|
+
|
47
|
+
it "coerces collections" do
|
48
|
+
subject["foos"] = [1, 2, 3]
|
49
|
+
subject["foos"].must_equal(["coerced 1", "coerced 2", "coerced 3"])
|
50
|
+
end
|
51
|
+
|
52
|
+
it "coerces keys using the .new method" do
|
53
|
+
subject["bar"] = 2
|
54
|
+
subject["bar"].must_equal("initialized 2")
|
55
|
+
end
|
56
|
+
|
57
|
+
it "when both are present, prefers .coerce" do
|
58
|
+
subject["baz"] = 3
|
59
|
+
subject["baz"].must_equal("coerced 3")
|
60
|
+
end
|
61
|
+
|
62
|
+
it "coerces symbols as well" do
|
63
|
+
subject[:foo] = 2
|
64
|
+
subject[:foo].must_equal("coerced 2")
|
65
|
+
end
|
66
|
+
end
|
data/test/test_helper.rb
ADDED
metadata
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: diffbot
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Nicolas Sanguinetti
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2012-02-06 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: excon
|
16
|
+
requirement: &70280593864880 !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ! '>='
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: '0'
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: *70280593864880
|
25
|
+
- !ruby/object:Gem::Dependency
|
26
|
+
name: yajl-ruby
|
27
|
+
requirement: &70280593864420 !ruby/object:Gem::Requirement
|
28
|
+
none: false
|
29
|
+
requirements:
|
30
|
+
- - ! '>='
|
31
|
+
- !ruby/object:Gem::Version
|
32
|
+
version: '0'
|
33
|
+
type: :runtime
|
34
|
+
prerelease: false
|
35
|
+
version_requirements: *70280593864420
|
36
|
+
- !ruby/object:Gem::Dependency
|
37
|
+
name: nokogiri
|
38
|
+
requirement: &70280593864000 !ruby/object:Gem::Requirement
|
39
|
+
none: false
|
40
|
+
requirements:
|
41
|
+
- - ! '>='
|
42
|
+
- !ruby/object:Gem::Version
|
43
|
+
version: '0'
|
44
|
+
type: :runtime
|
45
|
+
prerelease: false
|
46
|
+
version_requirements: *70280593864000
|
47
|
+
- !ruby/object:Gem::Dependency
|
48
|
+
name: hashie
|
49
|
+
requirement: &70280593863580 !ruby/object:Gem::Requirement
|
50
|
+
none: false
|
51
|
+
requirements:
|
52
|
+
- - ! '>='
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
type: :runtime
|
56
|
+
prerelease: false
|
57
|
+
version_requirements: *70280593863580
|
58
|
+
- !ruby/object:Gem::Dependency
|
59
|
+
name: minitest
|
60
|
+
requirement: &70280593863160 !ruby/object:Gem::Requirement
|
61
|
+
none: false
|
62
|
+
requirements:
|
63
|
+
- - ! '>='
|
64
|
+
- !ruby/object:Gem::Version
|
65
|
+
version: '0'
|
66
|
+
type: :development
|
67
|
+
prerelease: false
|
68
|
+
version_requirements: *70280593863160
|
69
|
+
description: Diffbot provides a concise API for analyzing and extracting semantic
|
70
|
+
information from web pages using Diffbot (http://www.diffbot.com).
|
71
|
+
email: hi@nicolassanguinetti.info
|
72
|
+
executables: []
|
73
|
+
extensions: []
|
74
|
+
extra_rdoc_files: []
|
75
|
+
files:
|
76
|
+
- .gitignore
|
77
|
+
- LICENSE
|
78
|
+
- README.md
|
79
|
+
- Rakefile
|
80
|
+
- diffbot.gemspec
|
81
|
+
- lib/diffbot.rb
|
82
|
+
- lib/diffbot/article.rb
|
83
|
+
- lib/diffbot/coercible_hash.rb
|
84
|
+
- lib/diffbot/frontpage.rb
|
85
|
+
- lib/diffbot/frontpage/dml_parser.rb
|
86
|
+
- lib/diffbot/item.rb
|
87
|
+
- lib/diffbot/request.rb
|
88
|
+
- test/coercible_hash_test.rb
|
89
|
+
- test/test_helper.rb
|
90
|
+
homepage: http://github.com/tinder/diffbot
|
91
|
+
licenses: []
|
92
|
+
post_install_message:
|
93
|
+
rdoc_options: []
|
94
|
+
require_paths:
|
95
|
+
- lib
|
96
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
97
|
+
none: false
|
98
|
+
requirements:
|
99
|
+
- - ! '>='
|
100
|
+
- !ruby/object:Gem::Version
|
101
|
+
version: '0'
|
102
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
103
|
+
none: false
|
104
|
+
requirements:
|
105
|
+
- - ! '>='
|
106
|
+
- !ruby/object:Gem::Version
|
107
|
+
version: '0'
|
108
|
+
requirements: []
|
109
|
+
rubyforge_project:
|
110
|
+
rubygems_version: 1.8.11
|
111
|
+
signing_key:
|
112
|
+
specification_version: 3
|
113
|
+
summary: Ruby interface to the Diffbot API
|
114
|
+
test_files: []
|