pismo 0.2.3 → 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc CHANGED
@@ -2,35 +2,40 @@
2
2
 
3
3
  * http://github.com/peterc/pismo
4
4
 
5
- == STATUS:
6
-
7
- pismo is a VERY NEW project developed for use on http://coder.io/ - my forthcoming developer news aggregator. pismo is FAR FROM COMPLETE. If you're brave, you can have a PLAY with it as the examples below and those in the test suite/corpus do work - all tests pass.
8
-
9
- The prime missing features so far are the "external attributes" - where calls are made to external services like Delicious, Yahoo, Bing, etc, for getting third party data about documents. The structures are there but I'm still deciding how best to integrate these ideas.
10
-
11
5
  == DESCRIPTION:
12
6
 
13
- Pismo extracts metadata and machine-usable data from otherwise unstructured
14
- HTML documents, including titles, body text, graphics, date, and keywords.
15
-
16
- For example, if you have a blog post HTML file, Pismo should, in theory, be
17
- able to extract the title, the actual "content", images relating to the
18
- content, look up Delicious tags, and analyze for keywords.
7
+ Pismo extracts metadata and machine-usable data from mostly unstructured (or poorly structured)
8
+ HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
19
9
 
20
- Pismo only understands English. Je suis desolé.
10
+ For example, if you have a blog post HTML file, Pismo, in theory, should
11
+ extract the title, the actual "content", and analyze for keywords, among other things.
21
12
 
22
- == SYNOPSIS:
13
+ Pismo only understands (and much prefers) English. Je suis desolé.
23
14
 
24
- * Basic demo:
15
+ == EXAMPLES:
25
16
 
26
- require 'open-uri'
27
17
  require 'pismo'
28
- doc = Pismo::Document.new(open('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html'))
18
+
19
+ # Load a Web page (you can pass an IO object or a string with existing HTML data along too, if you prefer)
20
+ doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
21
+
29
22
  doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
30
23
  doc.author # => "Peter Cooper"
31
24
  doc.lede # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
32
25
  doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
26
+
27
+ == NEW IN 0.4.0:
28
+
29
+ Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
30
+
31
+ doc.titles # => [..., ..., ...]
32
+ doc.ledes # => [..., ..., ...]
33
+ doc.authors # => [..., ..., ...]
34
+ doc.feeds # => [..., ..., ...]
33
35
 
36
+ == STATUS:
37
+
38
+ Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
34
39
 
35
40
  == COMMAND LINE TOOL:
36
41
 
@@ -55,8 +60,8 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
55
60
  * Fork the project.
56
61
  * Make your feature addition or bug fix.
57
62
  * Add tests for it. This is important so I don't break it in a future version unintentionally.
58
- * Commit, do not mess with Rakefile, version, or history.
59
- * Send me a pull request. I may or may not accept it.
63
+ * Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
64
+ * Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
60
65
 
61
66
  == COPYRIGHT AND LICENSE
62
67
 
@@ -65,4 +70,4 @@ Apache 2.0 License - See LICENSE for details.
65
70
  All except lib/pismo/readability.rb is Copyright (c) 2009, 2010 Peter Cooper
66
71
  lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
67
72
 
68
- The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability
73
+ The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability - sorry! I have respected the license, however. I have promised to contribute back to them directly and, hopefully, use that library as a regular dependency. But.. this takes time.
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.2.3
1
+ 0.4.0
@@ -23,11 +23,11 @@ module Pismo
23
23
 
24
24
  def load(handle, url = nil)
25
25
  @url = url if url
26
- @url = handle if handle =~ /^http/
26
+ @url = handle if handle =~ /\Ahttp/
27
27
 
28
- @html = if handle =~ /^http/
28
+ @html = if handle =~ /\Ahttp/
29
29
  open(handle).read
30
- elsif handle.is_a?(StringIO) || handle.is_a?(IO)
30
+ elsif handle.is_a?(StringIO) || handle.is_a?(IO) || handle.is_a?(Tempfile)
31
31
  handle.read
32
32
  else
33
33
  handle
@@ -2,34 +2,62 @@ module Pismo
2
2
  # Internal attributes are different pieces of data we can extract from a document's content
3
3
  module InternalAttributes
4
4
  # Returns the title of the page/content - attempts to strip site name, etc, if possible
5
- def title
5
+ def title(all = false)
6
6
  # TODO: Memoizations
7
- title = @doc.match( 'h2.title',
8
- '.entry h2', # Common style
9
- '.entryheader h1', # Ruby Inside/Kubrick
10
- '.entry-title a', # Common Blogger/Blogspot rules
11
- '.post-title a',
12
- '.posttitle a',
13
- '.entry-title',
14
- '.post-title',
15
- '.posttitle',
16
- ['meta[@name="title"]', lambda { |el| el.attr('content') }],
17
- '#pname a', # Google Code style
18
- 'h1.headermain',
19
- 'h1.title',
20
- '.mxb h1' # BBC News
7
+ title = @doc.match(
8
+ [
9
+ '.entryheader h1', # Ruby Inside/Kubrick
10
+ '.entry-title a', # Common Blogger/Blogspot rules
11
+ '.post-title a',
12
+ '.post_title a',
13
+ '.posttitle a',
14
+ '.post-header h1',
15
+ '.entry-title',
16
+ '.post-title',
17
+ '.posttitle',
18
+ '.post_title',
19
+ '.pageTitle',
20
+ '.title h1',
21
+ '.post h2',
22
+ 'h2.title',
23
+ '.entry h2', # Common style
24
+ '.boite_titre a',
25
+ ['meta[@name="title"]', lambda { |el| el.attr('content') }],
26
+ '#pname a', # Google Code style
27
+ 'h1.headermain',
28
+ 'h1.title',
29
+ '.mxb h1', # BBC News
30
+ '#content h1',
31
+ '#content h2',
32
+ '#content h3',
33
+ 'a[@rel="bookmark"]',
34
+ '.products h2'
35
+ ],
36
+ all
21
37
  )
22
38
 
23
39
  # If all else fails, go to the HTML title
24
- unless title
25
- title = @doc.match('title')
26
- return unless title
27
-
28
- # Strip off any leading or trailing site names - a scrappy way to try it out..
29
- title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.strip
40
+ if all
41
+ return [html_title] if !title
42
+ return ([*title] + [html_title]).uniq
43
+ else
44
+ return html_title if !title
45
+ return title
30
46
  end
31
-
32
- title
47
+ end
48
+
49
+ def titles
50
+ title(true)
51
+ end
52
+
53
+
54
+ # HTML title
55
+ def html_title
56
+ title = @doc.match('title')
57
+ return unless title
58
+
59
+ # Strip off any leading or trailing site names - a scrappy way to try it out..
60
+ title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
33
61
  end
34
62
 
35
63
  # Return an estimate of when the page/content was created
@@ -43,7 +71,10 @@ module Pismo
43
71
  regexen = [
44
72
  /#{mo}\b\s+\d+\D{1,10}\d{4}/i,
45
73
  /(on\s+)?\d+\s+#{mo}\s+\D{1,10}\d+/i,
46
- /(on[^\d+]{1,10})?\d+(th|st|rd)?.{1,10}#{mo}\b[^\d]{1,10}\d+/i,
74
+ /(on[^\d+]{1,10})\d+(th|st|rd)?.{1,10}#{mo}\b[^\d]{1,10}\d+/i,
75
+ /\b\d{4}\-\d{2}\-\d{2}\b/i,
76
+ /\d+(th|st|rd).{1,10}#{mo}\b[^\d]{1,10}\d+/i,
77
+ /\d+\s+#{mo}\b[^\d]{1,10}\d+/i,
47
78
  /on\s+#{mo}\s+\d+/i,
48
79
  /#{mo}\s+\d+/i,
49
80
  /\d{4}[\.\/\-]\d{2}[\.\/\-]\d{2}/,
@@ -54,7 +85,7 @@ module Pismo
54
85
 
55
86
  regexen.each do |r|
56
87
  datetime = @doc.to_html[r]
57
- p datetime
88
+ # p datetime
58
89
  break if datetime
59
90
  end
60
91
 
@@ -76,10 +107,13 @@ module Pismo
76
107
  # end
77
108
 
78
109
  # Returns the author of the page/content
79
- def author
80
- author = @doc.match('.post-author .fn',
110
+ def author(all = false)
111
+ author = @doc.match([
112
+ '.post-author .fn',
81
113
  '.wire_author',
82
114
  '.cnnByline b',
115
+ '.editorlink',
116
+ '.authors p',
83
117
  ['meta[@name="author"]', lambda { |el| el.attr('content') }], # Traditional meta tag style
84
118
  ['meta[@name="AUTHOR"]', lambda { |el| el.attr('content') }], # CNN style
85
119
  '.byline a', # Ruby Inside style
@@ -94,31 +128,48 @@ module Pismo
94
128
  '.auth',
95
129
  '.cT-storyDetails h5', # smh.com.au - worth dropping maybe..
96
130
  ['meta[@name="byl"]', lambda { |el| el.attr('content') }],
131
+ '.timestamp a',
97
132
  '.fn a',
98
133
  '.fn',
99
- '.byline-author'
100
- )
134
+ '.byline-author',
135
+ '.ArticleAuthor a',
136
+ '.blog_meta a',
137
+ 'cite a',
138
+ 'cite',
139
+ '.contributor_details h4 a'
140
+ ], all)
101
141
 
102
142
  return unless author
103
143
 
104
144
  # Strip off any "By [whoever]" section
105
- author.sub!(/^(post(ed)?\s)?by\W+/i, '')
145
+ if String === author
146
+ author.sub!(/^(post(ed)?\s)?by\W+/i, '')
147
+ elsif Array === author
148
+ author.map! { |a| a.sub(/^(post(ed)?\s)?by\W+/i, '') }.uniq!
149
+ end
106
150
 
107
151
  author
108
152
  end
109
153
 
154
+ def authors
155
+ author(true)
156
+ end
157
+
158
+
110
159
  # Returns the "description" of the page, usually comes from a meta tag
111
160
  def description
112
- @doc.match(
161
+ @doc.match([
113
162
  ['meta[@name="description"]', lambda { |el| el.attr('content') }],
114
163
  ['meta[@name="Description"]', lambda { |el| el.attr('content') }],
164
+ 'rdf:Description[@name="dc:description"]',
115
165
  '.description'
116
- )
166
+ ])
117
167
  end
118
168
 
119
- # Returns the "lede" or first paragraph of the story/page
120
- def lede
121
- lede = @doc.match(
169
+ # Returns the "lede(s)" or first paragraph(s) of the story/page
170
+ def lede(all = false)
171
+ lede = @doc.match([
172
+ '.post-text p',
122
173
  '#blogpost p',
123
174
  '.subhead',
124
175
  '//div[@class="entrytext"]//p[string-length()>10]', # Ruby Inside / Kubrick style
@@ -136,10 +187,24 @@ module Pismo
136
187
  '#content p',
137
188
  '#article p',
138
189
  '.post-body',
139
- '.entry-content'
140
- )
141
-
142
- lede[/^(.*?\.\s){2}/m] || lede
190
+ '.entry-content',
191
+ '.body p',
192
+ '.document_description_short p', # Scribd
193
+ '.single-post p',
194
+ 'p'
195
+ ], all)
196
+
197
+ if lede && String === lede
198
+ return lede[/^(.*?\.\s){2}/m] || lede
199
+ elsif lede && Array === lede
200
+ return lede.map { |l| l.to_s[/^(.*?\.\s){2}/m] || l }.uniq
201
+ else
202
+ return body ? body[/^(.*?\.\s){2}/m] : nil
203
+ end
204
+ end
205
+
206
+ def ledes
207
+ lede(true)
143
208
  end
144
209
 
145
210
  # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
@@ -150,7 +215,9 @@ module Pismo
150
215
 
151
216
  # Convert doc to lowercase, scrub out most HTML tags, then keep track of words
152
217
  cached_title = title
153
- body.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\'\+\#\.]*\b/).each do |word|
218
+ content_to_use = body.to_s.downcase + description.to_s.downcase
219
+
220
+ content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub('. ', ' ').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\+\.\'\+\#\-]*\b/).each do |word|
154
221
  next if word.length > options[:word_length_limit]
155
222
  word.gsub!(/\'\w+/, '')
156
223
  words[word] ||= 0
@@ -178,9 +245,9 @@ module Pismo
178
245
 
179
246
  # Returns URL to the site's favicon
180
247
  def favicon
181
- url = @doc.match( ['link[@rel="fluid-icon"]', lambda { |el| el.attr('href') }], # Get a Fluid icon if possible..
248
+ url = @doc.match([['link[@rel="fluid-icon"]', lambda { |el| el.attr('href') }], # Get a Fluid icon if possible..
182
249
  ['link[@rel="shortcut icon"]', lambda { |el| el.attr('href') }],
183
- ['link[@rel="icon"]', lambda { |el| el.attr('href') }])
250
+ ['link[@rel="icon"]', lambda { |el| el.attr('href') }]])
184
251
  if url && url !~ /^http/ && @url
185
252
  url = URI.join(@url , url).to_s
186
253
  end
@@ -188,17 +255,30 @@ module Pismo
188
255
  url
189
256
  end
190
257
 
191
- # Returns URL of Web feed
192
- def feed
193
- url = @doc.match( ['link[@type="application/rss+xml"]', lambda { |el| el.attr('href') }],
194
- ['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]
258
+ # Returns URL(s) of Web feed(s)
259
+ def feed(all = false)
260
+ url = @doc.match([['link[@type="application/rss+xml"]', lambda { |el| el.attr('href') }],
261
+ ['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]], all
195
262
  )
196
263
 
197
- if url && url !~ /^http/ && @url
264
+ if url && String === url && url !~ /^http/ && @url
198
265
  url = URI.join(@url , url).to_s
266
+ elsif url && Array === url
267
+ url.map! do |u|
268
+ if u !~ /^http/ && @url
269
+ URI.join(@url, u).to_s
270
+ else
271
+ u
272
+ end
273
+ end
274
+ url.uniq!
199
275
  end
200
276
 
201
277
  url
202
278
  end
279
+
280
+ def feeds
281
+ feed(true)
282
+ end
203
283
  end
204
284
  end
@@ -9,6 +9,8 @@
9
9
  # http://lab.arc90.com/experiments/readability/js/readability.js
10
10
  # * Copyright (c) 2009 Arc90 Inc
11
11
  # * Readability is licensed under the Apache License, Version 2.0.
12
+ #
13
+ # Minor edits and tweaks by Peter Cooper
12
14
 
13
15
  require 'nokogiri'
14
16
 
@@ -70,6 +72,9 @@ module Readability
70
72
 
71
73
  sibling_score_threshold = [10, best_candidate[:content_score] * 0.2].max
72
74
  output = Nokogiri::XML::Node.new('div', @html)
75
+
76
+ return output unless best_candidate[:elem]
77
+
73
78
  best_candidate[:elem].parent.children.each do |sibling|
74
79
  append = false
75
80
  append = true if sibling == best_candidate[:elem]
@@ -105,7 +110,7 @@ module Readability
105
110
  end
106
111
 
107
112
  best_candidate = sorted_candidates.first || { :elem => @html.css("body").first, :content_score => 0 }
108
- debug("Best candidate #{best_candidate[:elem].name}##{best_candidate[:elem][:id]}.#{best_candidate[:elem][:class]} with score #{best_candidate[:content_score]}")
113
+ #debug("Best candidate #{best_candidate[:elem].name}##{best_candidate[:elem][:id]}.#{best_candidate[:elem][:class]} with score #{best_candidate[:content_score]}")
109
114
 
110
115
  best_candidate
111
116
  end