diffbot 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1 @@
1
+ /pkg
data/LICENSE ADDED
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2012 Nicolás Sanguinetti for Tinder Inc.
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in
11
+ all copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19
+ THE SOFTWARE.
@@ -0,0 +1,125 @@
1
+ # Diffbot
2
+
3
+ This is a ruby client for the [Diffbot](http://diffbot.com) API.
4
+
5
+ ## Global Options
6
+
7
+ You can pass some settings to Diffbot like this:
8
+
9
+ ``` ruby
10
+ Diffbot.configure do |config|
11
+ config.token = ENV["DIFFBOT_TOKEN"]
12
+ config.instrumentor = ActiveSupport::Notifications
13
+ end
14
+ ```
15
+
16
+ The list of supported settings is:
17
+
18
+ * `token`: Your Diffbot API token. This will be used for all requests in which
19
+ you don't specify it manually (see below).
20
+ * `instrumentor`: An object that matches the [ActiveSupport::Notifications][1]
21
+ API, which will be used to trace network events. None is used by default.
22
+ * `article_defaults`: Pass a block to this method to configure the global
23
+ request settings used for Diffbot::Article requests. See below the options
24
+ supported.
25
+
26
+ [1]: http://api.rubyonrails.org/classes/ActiveSupport/Notifications.html
27
+
28
+ ## Articles
29
+
30
+ In order to fetch an article, do this:
31
+
32
+ ``` ruby
33
+ require "diffbot/article"
34
+
35
+ article = Diffbot::Article.fetch(article_url, diffbot_token)
36
+
37
+ # Now you can inspect the result:
38
+ article.title
39
+ article.author
40
+ article.date
41
+ article.text
42
+ # etc. See below for the full list of available response attributes.
43
+ ```
44
+
45
+ This is a list of all the fields returned by the `Diffbot::Article.fetch` call:
46
+
47
+ * `url`: The URL of the article.
48
+ * `title`: The title of the article.
49
+ * `author`: The author of the article.
50
+ * `date`: The date in which this article was published.
51
+ * `media`: A list of media items attached to this article.
52
+ * `text`: The body of the article. This will be plain text unless you specify
53
+ the HTML option in the request.
54
+ * `tags`: A list of tags/keywords extracted from the article.
55
+ * `xpath`: The XPath at which this article was found in the page.
56
+
57
+ ### Options
58
+
59
+ You can customize your request like this:
60
+
61
+ ``` ruby
62
+ article = Diffbot::Article.fetch(article_url, diffbot_token) do |request|
63
+ request.html = true # Return HTML instead of plain text.
64
+ request.dont_strip_ads = true # Leave any inline ads within the article.
65
+ request.tags = true # Generate ads for the article.
66
+ request.comments = true # Extract the comments from the article as well.
67
+ request.summary = true # Return a summary text instead of the full text.
68
+ request.stats = true # Return performance, probabilistic scoring stats.
69
+ end
70
+ ```
71
+
72
+ ## Frontpages
73
+
74
+ In order to fetch and analyze a front page, do this:
75
+
76
+ ``` ruby
77
+ require "diffbot/frontpage"
78
+
79
+ frontpage = Diffbot::Frontpage.fetch(url, diffbot_token)
80
+
81
+ # Results are available in the returned object:
82
+ frontpage.title
83
+ frontpage.icon
84
+ frontpage.items #=> An array of Diffbot::Item instances
85
+ ```
86
+
87
+ The fields you can extract from a Frontpage are:
88
+
89
+ * `title`: The title of the page.
90
+ * `icon`: The favicon of the page.
91
+ * `source_type`: What kind of page this is.
92
+ * `source_url`: The URL of the page.
93
+ * `items`: The list of `Diffbot::Item` representing each item on the page.
94
+
95
+ The instances of `Diffbot::Item` have the following fields:
96
+
97
+ * `id`: Unique identifier for this item.
98
+ * `title`: Title of the item.
99
+ * `link`: Extracted permalink of the item (if applicable).
100
+ * `description`: innerHTML content of the item.
101
+ * `summary`: A plain-text summary of the item.
102
+ * `pub_date`: Date when item was detected on page.
103
+ * `type`: The type of item, according to Diffbot. One of: `IMAGE`, `LINK`,
104
+ `STORY`, `CHUNK`.
105
+ * `img`: The main image extracted from this item.
106
+ * `xroot`: XPath of where the item was found on the page.
107
+ * `cluster`: XPath of the cluster of items where this item was found.
108
+ * `stats`: An object with the following attributes:
109
+ * `spam_score`: A Float between 0.0 and 1.0 indicating the probability this
110
+ item is spam/an advertisement.
111
+ * `static_rank`: A Float between 1.0 and 5.0 indicating the quality score of
112
+ the item.
113
+ * `fresh`: The percentage of the item that has changed compared to the
114
+ previous crawl.
115
+
116
+ ## TODO
117
+
118
+ * Implement the Follow API.
119
+ * Add tests for Article and Frontpage requests.
120
+ * Add a Frontpage.crawl method that given the URL of a frontpage, it will fetch
121
+ the article for each item in the page.
122
+
123
+ ## License
124
+
125
+ This is published under an MIT License, see LICENSE for further details.
@@ -0,0 +1,15 @@
1
+ require "rake/testtask"
2
+ require "rubygems/package_task"
3
+
4
+ gem_spec = eval(File.read("./diffbot.gemspec")) rescue nil
5
+ Gem::PackageTask.new(gem_spec) do |pkg|
6
+ pkg.need_zip = false
7
+ pkg.need_tar = false
8
+ end
9
+
10
+ Rake::TestTask.new do |t|
11
+ t.pattern = "test/*_test.rb"
12
+ t.verbose = true
13
+ end
14
+
15
+ task default: :test
@@ -0,0 +1,19 @@
1
+ Gem::Specification.new do |s|
2
+ s.name = "diffbot"
3
+ s.version = "0.1.0"
4
+ s.description = "Diffbot provides a concise API for analyzing and extracting semantic information from web pages using Diffbot (http://www.diffbot.com)."
5
+ s.summary = "Ruby interface to the Diffbot API "
6
+ s.authors = ["Nicolas Sanguinetti"]
7
+ s.email = "hi@nicolassanguinetti.info"
8
+ s.homepage = "http://github.com/tinder/diffbot"
9
+ s.has_rdoc = false
10
+ s.files = `git ls-files`.split "\n"
11
+ s.platform = Gem::Platform::RUBY
12
+
13
+ s.add_dependency("excon")
14
+ s.add_dependency("yajl-ruby")
15
+ s.add_dependency("nokogiri")
16
+ s.add_dependency("hashie")
17
+
18
+ s.add_development_dependency("minitest")
19
+ end
@@ -0,0 +1,45 @@
1
+ require "hashie/trash"
2
+ require "diffbot/coercible_hash"
3
+ require "diffbot/request"
4
+ require "diffbot/article"
5
+ require "diffbot/frontpage"
6
+
7
+ module Diffbot
8
+ # Public: Set global options. This is a nice API to group calls to the Diffbot
9
+ # module.
10
+ #
11
+ # Yields the Diffbot module so you can set options on it.
12
+ #
13
+ # Returns self.
14
+ def self.configure
15
+ yield self
16
+ self
17
+ end
18
+
19
+ # Public: Configure the default request parameters for Article requests. See
20
+ # Article::RequestParams documentation for the specific configuration values
21
+ # you can set.
22
+ #
23
+ # Yields the default Article::RequestParams object.
24
+ #
25
+ # Returns the default Article::RequestParams object.
26
+ def self.article_defaults
27
+ if block_given?
28
+ @article_defaults = Article::RequestParams.new
29
+ yield @article_defaults
30
+ else
31
+ @article_defaults ||= Article::RequestParams.new
32
+ end
33
+
34
+ @article_defaults
35
+ end
36
+
37
+ class << self
38
+ # Public: Your Diffbot API token.
39
+ attr_accessor :token
40
+
41
+ # Public: The object used for network instrumentation. Must match
42
+ # ActiveSupport::Notifications API.
43
+ attr_accessor :instrumentor
44
+ end
45
+ end
@@ -0,0 +1,194 @@
1
+ require "yajl"
2
+ require "diffbot"
3
+ require "diffbot/coercible_hash"
4
+
5
+ module Diffbot
6
+ # Representation of an article (ie a blog post or similar). This class offers
7
+ # a single entry point: the `.fetch` method, that, given a URL, will return
8
+ # the article as analyzed by Diffbot.
9
+ class Article < Hashie::Trash
10
+ extend CoercibleHash
11
+
12
+ # Public: Fetch an article from a URL.
13
+ #
14
+ # url - The article URL.
15
+ # token - The API token for Diffbot.
16
+ # parser - The callable object that will parse the raw output from the
17
+ # API. Defaults to Yajl::Parser.method(:parse).
18
+ # defaults - The default request options. See Diffbot.article_defaults.
19
+ #
20
+ # Yields the request configuration.
21
+ #
22
+ # Examples
23
+ #
24
+ # # Request an article with the default options.
25
+ # article = Diffbot::Article.fetch(url, api_token)
26
+ #
27
+ # # Pass options to the request. See Diffbot::Article::RequestParams to
28
+ # # see the available configuration options.
29
+ # article = Diffbot::Article.fetch(url, api_token) do |req|
30
+ # req.html = true
31
+ # end
32
+ #
33
+ # Returns a Diffbot::Article.
34
+ def self.fetch(url, token=Diffbot.token, parser=Yajl::Parser.method(:parse), defaults=Diffbot.article_defaults)
35
+ params = defaults.dup
36
+ yield params if block_given?
37
+
38
+ request = Diffbot::Request.new(token)
39
+ response = request.perform(:get, endpoint, params) do |req|
40
+ req[:query][:url] = url
41
+ end
42
+
43
+ new(parser.call(response.body))
44
+ end
45
+
46
+ # The API endpoint where requests should be made.
47
+ #
48
+ # Returns a URL.
49
+ def self.endpoint
50
+ "http://www.diffbot.com/api/article"
51
+ end
52
+
53
+ # Public: URL of the article.
54
+ property :url
55
+
56
+ # Public: Title of the article.
57
+ property :title
58
+
59
+ # Public: Author (or Authors) ofthe article.
60
+ property :author
61
+
62
+ # Public: Date of the article (as a string).
63
+ property :date
64
+
65
+ class MediaItem < Hashie::Trash
66
+ property :type
67
+ property :link
68
+ property :primary, default: false
69
+ end
70
+
71
+ # Public: List of media items related to the articles. Each item is an
72
+ # object with the following attributes:
73
+ #
74
+ # type - Either `"image"` or `"video"`.
75
+ # link - The URL of the given media resource.
76
+ # primary - Only present in one of the items. This is assumed to be the most
77
+ # representative media for this article.
78
+ property :media
79
+ coerce_property :media, collection: MediaItem
80
+
81
+ # Public: The raw text of the article, without formatting.
82
+ property :text
83
+
84
+ # Public: The contents of the article in HTML, stripped of any ads or other
85
+ # chunks of HTML which are considered unrelated by Diffbot, unless you set
86
+ # the `dont_strip_ads` option in the request.
87
+ #
88
+ # Only present if you set `html` to true in the request.
89
+ property :html
90
+
91
+ # Public: A summary line for this article.
92
+ #
93
+ # Only present if you set `summary` to true in the request.
94
+ property :summary
95
+
96
+ # Public: A list of tags related to this article.
97
+ #
98
+ # Only present if you set `tags` to true in the request.
99
+ property :tags
100
+
101
+ # Public: The favicon of the page where this article was extracted from.
102
+ property :icon
103
+
104
+ class Stats < Hashie::Trash
105
+ property :fetch_time, from: :fetchTime
106
+ property :confidence
107
+ end
108
+
109
+ # Public: Returns an object with the following attributes:
110
+ #
111
+ # fetch_time - The time of the request, in ms.
112
+ # confidence - The confidence of Diffbot that the returned text is really
113
+ # the text of the article. Between 0.0 and 1.0.
114
+ #
115
+ # Only present if you set `stats` to true in the request.
116
+ property :stats
117
+ coerce_property :stats, class: Stats
118
+
119
+ # Public: The XPath selector at which the body of the article was found in
120
+ # the page.
121
+ property :xpath
122
+
123
+ # Public: If there was an error in the request, this will contain the error
124
+ # message.
125
+ property :error
126
+
127
+ # Public: If there was an error in the request, this will contain the error
128
+ # code.
129
+ property :error_code, from: :errorCode
130
+
131
+ # This represents the parameters you can pass to Diffbot to configure a
132
+ # given request. These are either set globally with Diffbot.article_defaults
133
+ # or on a request basis by passing a block to Diffbot::Article.fetch.
134
+ #
135
+ # Example:
136
+ #
137
+ # # All article requests will include the HTML and tags.
138
+ # Diffbot.configure do |config|
139
+ # config.article_defaults do |defaults|
140
+ # defaults.html = true
141
+ # defaults.tags = true
142
+ # end
143
+ # end
144
+ #
145
+ # # This article request will *also* include the summary.
146
+ # Diffbot::Article.fetch(url, token) do |req|
147
+ # req.summary = true
148
+ # end
149
+ class RequestParams < Hashie::Trash
150
+ # Public: Set to true to return HTML instead of plain-text.
151
+ #
152
+ # Defaults to nil.
153
+ #
154
+ # If enabled, sets the `html` key in the `Diffbot::Article`.
155
+ property :html
156
+
157
+ # Public: Set to true to keep any inline ads in the generated story.
158
+ #
159
+ # Defaults to nil.
160
+ #
161
+ # If enabled, it will change the `html` key in the `Diffbot::Article`.
162
+ property :dontStripAds, from: :dont_strip_ads
163
+
164
+ # Public: Set to true to generate tags for the extracted story.
165
+ #
166
+ # Defaults to nil.
167
+ #
168
+ # If enabled, sets the `tags` key in the `Diffbot::Article`.
169
+ property :tags
170
+
171
+ # Public: Set to true to find the comments and identify count, link, etc.
172
+ #
173
+ # Defaults to nil.
174
+ #
175
+ # If enabled, sets the `comments` key in the `Diffbot::Article`.
176
+ property :comments
177
+
178
+ # Public: Set to true to return a summary text.
179
+ #
180
+ # Defaults to nil.
181
+ #
182
+ # If enabled, sets the `summary` key in the `Diffbot::Article`.
183
+ property :summary
184
+
185
+ # Public: Set to true to include performance and probabilistic scoring
186
+ # stats.
187
+ #
188
+ # Defaults to nil.
189
+ #
190
+ # If enabled, sets the `stats` key in the `Diffbot::Article`.
191
+ property :stats
192
+ end
193
+ end
194
+ end
@@ -0,0 +1,113 @@
1
+ module Diffbot
2
+ # Public: Extend a hash with this mixin to make keys coercible to certain
3
+ # classes. These keys, when assigned to the hash, will be transformed into the
4
+ # specified classes.
5
+ #
6
+ # The object you pass as coercion types should implement either a `coerce` or
7
+ # a `new` method.
8
+ #
9
+ # You can define rules to coerce properties into classes or collections of
10
+ # classes. In the latter case, CoercibleHash will just map over whatever value
11
+ # is passed and attempt to coerce each item individually to the given class.
12
+ #
13
+ # Examples
14
+ #
15
+ # class Address < Struct.new(:street, :zipcode, :state)
16
+ # def self.coerce(address)
17
+ # new(address[:street], address[:zipcode], address[:state])
18
+ # end
19
+ # end
20
+ #
21
+ # class Person < Hash
22
+ # extend Diffbot::CoercibleHash
23
+ #
24
+ # coerce_property :address, Address
25
+ # coerce_property :children, collection: Person
26
+ #
27
+ # def name
28
+ # self["name"]
29
+ # end
30
+ # end
31
+ #
32
+ # person = Person.new(address: {
33
+ # street: "123 Example St.", zipcode: "12345", state: "XX"
34
+ # })
35
+ #
36
+ # person.address.street #=> "123 Example St."
37
+ # # etc.
38
+ #
39
+ # father = Person.new(name: "John", children: [
40
+ # { name: "Tim" }, { name: "Sarah" }
41
+ # ])
42
+ #
43
+ # father.name #=> "John"
44
+ # father.children.first.name #=> "Tim"
45
+ # father.children.last.name #=> "Sarah"
46
+ module CoercibleHash
47
+ # The coercion rules defined for this hash.
48
+ attr_reader :coercions
49
+
50
+ # Adds a #[]= that checks for coercion on the property and delegates to super.
51
+ def self.extended(base)
52
+ base.instance_variable_set("@coercions", {})
53
+ base.class_eval do
54
+ def []=(property, value)
55
+ if self.class.coercions.key?(property.to_s)
56
+ super property, self.class.coercions[property.to_s].(value)
57
+ else
58
+ super
59
+ end
60
+ end
61
+ end
62
+ end
63
+
64
+ # Public: Coerce a property of this hash into a given type. We will try to
65
+ # call .coerce on the object you pass as the class, and if that fails, we will
66
+ # call .new.
67
+ #
68
+ # property - The name of the property to coerce.
69
+ # class_or_options - Either a class to which coerce, or a hash with options:
70
+ # * class: The class to which coerce
71
+ # * collection: Coerce the key into an array of members of
72
+ # this class.
73
+ #
74
+ # Examples
75
+ #
76
+ # class Person < Hash
77
+ # extend Diffbot::CoercibleHash
78
+ #
79
+ # coerce_property :address, Address
80
+ #
81
+ # coerce_property :children, collection: Person
82
+ #
83
+ # coerce_property :dob, class: Date
84
+ # end
85
+ def coerce_property(property, options)
86
+ unless options.is_a?(Hash)
87
+ options = { class: options }
88
+ end
89
+
90
+ coercion_method = ->(obj) do
91
+ if obj.respond_to?(:coerce)
92
+ obj.method(:coerce)
93
+ elsif obj.respond_to?(:new)
94
+ obj.method(:new)
95
+ else
96
+ raise ArgumentError, "#{obj.inspect} does not implement neither .coerce nor .new"
97
+ end
98
+ end
99
+
100
+ if options.has_key?(:collection)
101
+ klass = options[:collection]
102
+ coercion = ->(value) { value.map { |el| coercion_method[klass][el] } }
103
+ elsif options.has_key?(:class)
104
+ klass = options[:class]
105
+ coercion = ->(value) { coercion_method[klass][value] }
106
+ else
107
+ raise ArgumentError, "You need to specify either :class or :collection"
108
+ end
109
+
110
+ coercions[property.to_s] = coercion
111
+ end
112
+ end
113
+ end
@@ -0,0 +1,60 @@
1
+ require "nokogiri"
2
+ require "diffbot"
3
+ require "diffbot/item"
4
+
5
+ module Diffbot
6
+ # Representation of an front page. This class offers a single entry point: the
7
+ # `.fetch` method, that, given a URL, will return the front page as analyzed
8
+ # by Diffbot.
9
+ class Frontpage < Hashie::Trash
10
+ extend CoercibleHash
11
+
12
+ # Public: Fetch a frontpage's information from a URL.
13
+ #
14
+ # url - The frontpage URL.
15
+ # token - The API token for Diffbot.
16
+ # parser - The callable object that will parse the raw output from the
17
+ # API. Defaults to Diffbot::Frontpage::DmlParser.method(:parse).
18
+ #
19
+ # Examples
20
+ #
21
+ # # Request a frontpage with the default options.
22
+ # frontpage = Diffbot::Frontpage.fetch(url, api_token)
23
+ #
24
+ # Returns a Diffbot::Frontpage.
25
+ def self.fetch(url, token=Diffbot.token, parser=Diffbot::Frontpage::DmlParser.method(:parse))
26
+ request = Diffbot::Request.new(token)
27
+ response = request.perform(:get, endpoint) do |req|
28
+ req[:query][:url] = url
29
+ end
30
+
31
+ new(parser.call(response.body))
32
+ end
33
+
34
+ # The API endpoint where requests should be made.
35
+ #
36
+ # Returns a URL.
37
+ def self.endpoint
38
+ "http://www.diffbot.com/api/frontpage"
39
+ end
40
+
41
+ # Public: The title of the page.
42
+ property :title
43
+
44
+ # Public: The favicon of the page.
45
+ property :icon
46
+
47
+ # Public: The favicon of the page.
48
+ property :source_type, from: :sourceType
49
+
50
+ # Public: The URL where this page was extracted from.
51
+ property :source_url, from: :sourceURL
52
+
53
+ # Public: The items extracted from the page. These are instances of
54
+ # Diffbot::Item.
55
+ property :items
56
+ coerce_property :items, collection: Item
57
+ end
58
+ end
59
+
60
+ require "diffbot/frontpage/dml_parser"
@@ -0,0 +1,83 @@
1
+ # Parser that takes the XML generated from Diffbot's Frontpage API call and
2
+ # returns a hash suitable for Diffbot::Frontpage.
3
+ class Diffbot::Frontpage::DmlParser
4
+ # Take the string of DML and convert it into a nice little hash we can pass to
5
+ # Diffbot::Frontpage.
6
+ #
7
+ # dml - A string of DML.
8
+ #
9
+ # Returns a Hash.
10
+ def self.parse(dml)
11
+ node = Nokogiri(dml).root
12
+ parser = new(node)
13
+ parser.parse
14
+ end
15
+
16
+ # Initialize the parser with a DML node.
17
+ #
18
+ # dml - The root XML::Element
19
+ def initialize(node)
20
+ @dml = node
21
+ end
22
+
23
+ # The root element of the DML document.
24
+ attr_reader :dml
25
+
26
+ # Parses the Diffbot Markup Language and generates a Hash that we can pass to
27
+ # Frontpage.new.
28
+ #
29
+ # Returns a Hash.
30
+ def parse
31
+ attrs = {}
32
+
33
+ info = dml % "info"
34
+ attrs["title"] = (info % "title").text
35
+ attrs["icon"] = (info % "icon").text
36
+ attrs["sourceType"] = (info % "sourceType").text
37
+ attrs["sourceURL"] = (info % "sourceURL").text
38
+
39
+ items = dml / "item"
40
+ attrs["items"] = items.map do |item|
41
+ ItemParser.new(item).parse
42
+ end
43
+
44
+ attrs
45
+ end
46
+
47
+ # Parser that takes the XML from a particular item from the XML returned from
48
+ # the frontpage API.
49
+ class ItemParser
50
+ # The root element of each item.
51
+ attr_reader :item
52
+
53
+ # Initialize the parser with an Item node.
54
+ #
55
+ # item_node - The root node of the item.
56
+ def initialize(item_node)
57
+ @item = item_node
58
+ end
59
+
60
+ # Parses the item's DML and generates a Hash that we can add to the DML
61
+ # parser's parser's "items" key together with the other items.
62
+ #
63
+ # Returns a Hash.
64
+ def parse
65
+ attrs = {}
66
+
67
+ %w(title link pubDate description textSummary).each do |attr|
68
+ node = item % attr
69
+ attrs[attr] = node && node.text
70
+ end
71
+
72
+ %w(type img id xroot cluster).each do |attr|
73
+ attrs[attr] = item[attr]
74
+ end
75
+
76
+ attrs["stats"] = %w(fresh sp sr).each_with_object({}) do |attr, hash|
77
+ hash[attr] = item[attr].to_f
78
+ end
79
+
80
+ attrs
81
+ end
82
+ end
83
+ end
@@ -0,0 +1,55 @@
1
+ module Diffbot
2
+ class Item < Hashie::Trash
3
+ extend CoercibleHash
4
+
5
+ class Stats < Hashie::Trash
6
+ property :fresh
7
+ property :static_rank, from: :sr
8
+ property :spam_score, from: :sp
9
+ end
10
+
11
+ # Public: The identifier of this item.
12
+ property :id
13
+
14
+ # Public: The title of this item.
15
+ property :title
16
+
17
+ # Public: The permalink/URL for this item.
18
+ property :link
19
+
20
+ # Public: A string with the date of the item.
21
+ property :pub_date, from: :pubDate
22
+
23
+ # Public: The HTML from the item.
24
+ property :description
25
+
26
+ # Public: A summary line with text from the item.
27
+ property :summary, from: :textSummary
28
+
29
+ # Public: The type of the item. Can be either `IMAGE`, `LINK`, `STORY`, or
30
+ # `CHUNK` (a chunk of HTML).
31
+ property :type
32
+
33
+ # Public: The URL for the image of this item.
34
+ property :img
35
+
36
+ # Public: The XPath where this item is located at.
37
+ property :xroot
38
+
39
+ # Public: The XPath for the cluster of items where this item comes from. If
40
+ # a frontpage has, for example, a main list of articles and a sidebar with
41
+ # "Top Articles", for example, both will be separate clusters, each with
42
+ # their own articles.
43
+ property :cluster
44
+
45
+ # Public: Stats extracted from this item. This is an object with the
46
+ # following attributes:
47
+ #
48
+ # fresh - The percentage of the item that has changed compared to the
49
+ # previous crawl.
50
+ # static_rank - The quality score of the item on a 1 to 5 scale.
51
+ # spam_score - The probability this item is spam/an advertisement.
52
+ property :stats
53
+ coerce_property :stats, class: Stats
54
+ end
55
+ end
@@ -0,0 +1,54 @@
1
+ require "excon"
2
+
3
+ module Diffbot
4
+ class Request
5
+ # The API token for Diffbot.
6
+ attr_reader :token
7
+
8
+ # Public: Initialize a new request to the API.
9
+ #
10
+ # token - The API token for Diffbot.
11
+ def initialize(token)
12
+ @token = token
13
+ end
14
+
15
+ # Public: Perform an HTTP request against Diffbot's API.
16
+ #
17
+ # method - The request method, one of :get, :head, :post, :put, or
18
+ # :delete.
19
+ # endpoint - The URL to which we'll make the request, as a String.
20
+ # query - A hash of query string params we want to pass along.
21
+ #
22
+ # Yields the request hash before making the request.
23
+ #
24
+ # Returns the response.
25
+ def perform(method, endpoint, query={})
26
+ request_options = build_request(method, query)
27
+ yield request_options if block_given?
28
+
29
+ request = Excon.new(endpoint)
30
+
31
+ request.request(request_options)
32
+ end
33
+
34
+ # Build the hash of options that Excon requires for an HTTP request.
35
+ #
36
+ # method - A Symbol with the HTTP method (:get, :post, etc).
37
+ # query_params - Any query parameters to add to the request.
38
+ #
39
+ # Returns a Hash.
40
+ def build_request(method, query_params={})
41
+ query = { token: token }.merge(query_params)
42
+ request = { query: query, method: method, headers: {} }
43
+
44
+ if Diffbot.instrumentor
45
+ request.update(
46
+ instrumentor: Diffbot.instrumentor,
47
+ instrumentor_name: "diffbot"
48
+ )
49
+ end
50
+
51
+ request
52
+ end
53
+ end
54
+ end
@@ -0,0 +1,66 @@
1
+ require "test_helper"
2
+ require "diffbot/coercible_hash"
3
+
4
+ describe Diffbot::CoercibleHash do
5
+ module Foo
6
+ def self.coerce(value)
7
+ "coerced #{value}"
8
+ end
9
+ end
10
+
11
+ module Bar
12
+ def self.new(value)
13
+ "initialized #{value}"
14
+ end
15
+ end
16
+
17
+ module Baz
18
+ def self.coerce(value)
19
+ "coerced #{value}"
20
+ end
21
+
22
+ def self.new(value)
23
+ "initialized #{value}"
24
+ end
25
+ end
26
+
27
+ class TestHash < Hash
28
+ extend Diffbot::CoercibleHash
29
+
30
+ coerce_property :foo, Foo
31
+ coerce_property :foos, collection: Foo
32
+
33
+ coerce_property :bar, Bar
34
+
35
+ coerce_property :baz, Baz
36
+ end
37
+
38
+ subject do
39
+ TestHash.new
40
+ end
41
+
42
+ it "coerces keys using the .coerce method" do
43
+ subject["foo"] = 1
44
+ subject["foo"].must_equal("coerced 1")
45
+ end
46
+
47
+ it "coerces collections" do
48
+ subject["foos"] = [1, 2, 3]
49
+ subject["foos"].must_equal(["coerced 1", "coerced 2", "coerced 3"])
50
+ end
51
+
52
+ it "coerces keys using the .new method" do
53
+ subject["bar"] = 2
54
+ subject["bar"].must_equal("initialized 2")
55
+ end
56
+
57
+ it "when both are present, prefers .coerce" do
58
+ subject["baz"] = 3
59
+ subject["baz"].must_equal("coerced 3")
60
+ end
61
+
62
+ it "coerces symbols as well" do
63
+ subject[:foo] = 2
64
+ subject[:foo].must_equal("coerced 2")
65
+ end
66
+ end
@@ -0,0 +1,2 @@
1
+ require "minitest/spec"
2
+ require "minitest/autorun"
metadata ADDED
@@ -0,0 +1,114 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: diffbot
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Nicolas Sanguinetti
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2012-02-06 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: excon
16
+ requirement: &70280593864880 !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ! '>='
20
+ - !ruby/object:Gem::Version
21
+ version: '0'
22
+ type: :runtime
23
+ prerelease: false
24
+ version_requirements: *70280593864880
25
+ - !ruby/object:Gem::Dependency
26
+ name: yajl-ruby
27
+ requirement: &70280593864420 !ruby/object:Gem::Requirement
28
+ none: false
29
+ requirements:
30
+ - - ! '>='
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
33
+ type: :runtime
34
+ prerelease: false
35
+ version_requirements: *70280593864420
36
+ - !ruby/object:Gem::Dependency
37
+ name: nokogiri
38
+ requirement: &70280593864000 !ruby/object:Gem::Requirement
39
+ none: false
40
+ requirements:
41
+ - - ! '>='
42
+ - !ruby/object:Gem::Version
43
+ version: '0'
44
+ type: :runtime
45
+ prerelease: false
46
+ version_requirements: *70280593864000
47
+ - !ruby/object:Gem::Dependency
48
+ name: hashie
49
+ requirement: &70280593863580 !ruby/object:Gem::Requirement
50
+ none: false
51
+ requirements:
52
+ - - ! '>='
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ type: :runtime
56
+ prerelease: false
57
+ version_requirements: *70280593863580
58
+ - !ruby/object:Gem::Dependency
59
+ name: minitest
60
+ requirement: &70280593863160 !ruby/object:Gem::Requirement
61
+ none: false
62
+ requirements:
63
+ - - ! '>='
64
+ - !ruby/object:Gem::Version
65
+ version: '0'
66
+ type: :development
67
+ prerelease: false
68
+ version_requirements: *70280593863160
69
+ description: Diffbot provides a concise API for analyzing and extracting semantic
70
+ information from web pages using Diffbot (http://www.diffbot.com).
71
+ email: hi@nicolassanguinetti.info
72
+ executables: []
73
+ extensions: []
74
+ extra_rdoc_files: []
75
+ files:
76
+ - .gitignore
77
+ - LICENSE
78
+ - README.md
79
+ - Rakefile
80
+ - diffbot.gemspec
81
+ - lib/diffbot.rb
82
+ - lib/diffbot/article.rb
83
+ - lib/diffbot/coercible_hash.rb
84
+ - lib/diffbot/frontpage.rb
85
+ - lib/diffbot/frontpage/dml_parser.rb
86
+ - lib/diffbot/item.rb
87
+ - lib/diffbot/request.rb
88
+ - test/coercible_hash_test.rb
89
+ - test/test_helper.rb
90
+ homepage: http://github.com/tinder/diffbot
91
+ licenses: []
92
+ post_install_message:
93
+ rdoc_options: []
94
+ require_paths:
95
+ - lib
96
+ required_ruby_version: !ruby/object:Gem::Requirement
97
+ none: false
98
+ requirements:
99
+ - - ! '>='
100
+ - !ruby/object:Gem::Version
101
+ version: '0'
102
+ required_rubygems_version: !ruby/object:Gem::Requirement
103
+ none: false
104
+ requirements:
105
+ - - ! '>='
106
+ - !ruby/object:Gem::Version
107
+ version: '0'
108
+ requirements: []
109
+ rubyforge_project:
110
+ rubygems_version: 1.8.11
111
+ signing_key:
112
+ specification_version: 3
113
+ summary: Ruby interface to the Diffbot API
114
+ test_files: []