pismo 0.2.3 → 0.4.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +25 -20
- data/VERSION +1 -1
- data/lib/pismo/document.rb +3 -3
- data/lib/pismo/internal_attributes.rb +127 -47
- data/lib/pismo/readability.rb +6 -1
- data/lib/pismo/stopwords.txt +452 -326
- data/lib/pismo.rb +10 -4
- data/pismo.gemspec +2 -2
- data/test/corpus/metadata_expected.yaml +17 -0
- metadata +2 -2
data/README.rdoc
CHANGED
@@ -2,35 +2,40 @@
|
|
2
2
|
|
3
3
|
* http://github.com/peterc/pismo
|
4
4
|
|
5
|
-
== STATUS:
|
6
|
-
|
7
|
-
pismo is a VERY NEW project developed for use on http://coder.io/ - my forthcoming developer news aggregator. pismo is FAR FROM COMPLETE. If you're brave, you can have a PLAY with it as the examples below and those in the test suite/corpus do work - all tests pass.
|
8
|
-
|
9
|
-
The prime missing features so far are the "external attributes" - where calls are made to external services like Delicious, Yahoo, Bing, etc, for getting third party data about documents. The structures are there but I'm still deciding how best to integrate these ideas.
|
10
|
-
|
11
5
|
== DESCRIPTION:
|
12
6
|
|
13
|
-
Pismo extracts metadata and machine-usable data from
|
14
|
-
HTML documents
|
15
|
-
|
16
|
-
For example, if you have a blog post HTML file, Pismo should, in theory, be
|
17
|
-
able to extract the title, the actual "content", images relating to the
|
18
|
-
content, look up Delicious tags, and analyze for keywords.
|
7
|
+
Pismo extracts metadata and machine-usable data from mostly unstructured (or poorly structured)
|
8
|
+
HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
|
19
9
|
|
20
|
-
|
10
|
+
For example, if you have a blog post HTML file, Pismo, in theory, should
|
11
|
+
extract the title, the actual "content", and analyze for keywords, among other things.
|
21
12
|
|
22
|
-
|
13
|
+
Pismo only understands (and much prefers) English. Je suis desolé.
|
23
14
|
|
24
|
-
|
15
|
+
== EXAMPLES:
|
25
16
|
|
26
|
-
require 'open-uri'
|
27
17
|
require 'pismo'
|
28
|
-
|
18
|
+
|
19
|
+
# Load a Web page (you can pass an IO object or a string with existing HTML data along too, if you prefer)
|
20
|
+
doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
|
21
|
+
|
29
22
|
doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
|
30
23
|
doc.author # => "Peter Cooper"
|
31
24
|
doc.lede # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
|
32
25
|
doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
|
26
|
+
|
27
|
+
== NEW IN 0.4.0:
|
28
|
+
|
29
|
+
Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
|
30
|
+
|
31
|
+
doc.titles # => [..., ..., ...]
|
32
|
+
doc.ledes # => [..., ..., ...]
|
33
|
+
doc.authors # => [..., ..., ...]
|
34
|
+
doc.feeds # => [..., ..., ...]
|
33
35
|
|
36
|
+
== STATUS:
|
37
|
+
|
38
|
+
Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
|
34
39
|
|
35
40
|
== COMMAND LINE TOOL:
|
36
41
|
|
@@ -55,8 +60,8 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
|
|
55
60
|
* Fork the project.
|
56
61
|
* Make your feature addition or bug fix.
|
57
62
|
* Add tests for it. This is important so I don't break it in a future version unintentionally.
|
58
|
-
* Commit, do not mess with Rakefile, version, or history.
|
59
|
-
* Send me a pull request. I may or may not accept it
|
63
|
+
* Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
|
64
|
+
* Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
|
60
65
|
|
61
66
|
== COPYRIGHT AND LICENSE
|
62
67
|
|
@@ -65,4 +70,4 @@ Apache 2.0 License - See LICENSE for details.
|
|
65
70
|
All except lib/pismo/readability.rb is Copyright (c) 2009, 2010 Peter Cooper
|
66
71
|
lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
|
67
72
|
|
68
|
-
The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability
|
73
|
+
The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability - sorry! I have respected the license, however. I have promised to contribute back to them directly and, hopefully, use that library as a regular dependency. But.. this takes time.
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.4.0
|
data/lib/pismo/document.rb
CHANGED
@@ -23,11 +23,11 @@ module Pismo
|
|
23
23
|
|
24
24
|
def load(handle, url = nil)
|
25
25
|
@url = url if url
|
26
|
-
@url = handle if handle =~
|
26
|
+
@url = handle if handle =~ /\Ahttp/
|
27
27
|
|
28
|
-
@html = if handle =~
|
28
|
+
@html = if handle =~ /\Ahttp/
|
29
29
|
open(handle).read
|
30
|
-
elsif handle.is_a?(StringIO) || handle.is_a?(IO)
|
30
|
+
elsif handle.is_a?(StringIO) || handle.is_a?(IO) || handle.is_a?(Tempfile)
|
31
31
|
handle.read
|
32
32
|
else
|
33
33
|
handle
|
@@ -2,34 +2,62 @@ module Pismo
|
|
2
2
|
# Internal attributes are different pieces of data we can extract from a document's content
|
3
3
|
module InternalAttributes
|
4
4
|
# Returns the title of the page/content - attempts to strip site name, etc, if possible
|
5
|
-
def title
|
5
|
+
def title(all = false)
|
6
6
|
# TODO: Memoizations
|
7
|
-
title = @doc.match(
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
7
|
+
title = @doc.match(
|
8
|
+
[
|
9
|
+
'.entryheader h1', # Ruby Inside/Kubrick
|
10
|
+
'.entry-title a', # Common Blogger/Blogspot rules
|
11
|
+
'.post-title a',
|
12
|
+
'.post_title a',
|
13
|
+
'.posttitle a',
|
14
|
+
'.post-header h1',
|
15
|
+
'.entry-title',
|
16
|
+
'.post-title',
|
17
|
+
'.posttitle',
|
18
|
+
'.post_title',
|
19
|
+
'.pageTitle',
|
20
|
+
'.title h1',
|
21
|
+
'.post h2',
|
22
|
+
'h2.title',
|
23
|
+
'.entry h2', # Common style
|
24
|
+
'.boite_titre a',
|
25
|
+
['meta[@name="title"]', lambda { |el| el.attr('content') }],
|
26
|
+
'#pname a', # Google Code style
|
27
|
+
'h1.headermain',
|
28
|
+
'h1.title',
|
29
|
+
'.mxb h1', # BBC News
|
30
|
+
'#content h1',
|
31
|
+
'#content h2',
|
32
|
+
'#content h3',
|
33
|
+
'a[@rel="bookmark"]',
|
34
|
+
'.products h2'
|
35
|
+
],
|
36
|
+
all
|
21
37
|
)
|
22
38
|
|
23
39
|
# If all else fails, go to the HTML title
|
24
|
-
|
25
|
-
|
26
|
-
return
|
27
|
-
|
28
|
-
|
29
|
-
|
40
|
+
if all
|
41
|
+
return [html_title] if !title
|
42
|
+
return ([*title] + [html_title]).uniq
|
43
|
+
else
|
44
|
+
return html_title if !title
|
45
|
+
return title
|
30
46
|
end
|
31
|
-
|
32
|
-
|
47
|
+
end
|
48
|
+
|
49
|
+
def titles
|
50
|
+
title(true)
|
51
|
+
end
|
52
|
+
|
53
|
+
|
54
|
+
# HTML title
|
55
|
+
def html_title
|
56
|
+
title = @doc.match('title')
|
57
|
+
return unless title
|
58
|
+
|
59
|
+
# Strip off any leading or trailing site names - a scrappy way to try it out..
|
60
|
+
title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
|
33
61
|
end
|
34
62
|
|
35
63
|
# Return an estimate of when the page/content was created
|
@@ -43,7 +71,10 @@ module Pismo
|
|
43
71
|
regexen = [
|
44
72
|
/#{mo}\b\s+\d+\D{1,10}\d{4}/i,
|
45
73
|
/(on\s+)?\d+\s+#{mo}\s+\D{1,10}\d+/i,
|
46
|
-
/(on[^\d+]{1,10})
|
74
|
+
/(on[^\d+]{1,10})\d+(th|st|rd)?.{1,10}#{mo}\b[^\d]{1,10}\d+/i,
|
75
|
+
/\b\d{4}\-\d{2}\-\d{2}\b/i,
|
76
|
+
/\d+(th|st|rd).{1,10}#{mo}\b[^\d]{1,10}\d+/i,
|
77
|
+
/\d+\s+#{mo}\b[^\d]{1,10}\d+/i,
|
47
78
|
/on\s+#{mo}\s+\d+/i,
|
48
79
|
/#{mo}\s+\d+/i,
|
49
80
|
/\d{4}[\.\/\-]\d{2}[\.\/\-]\d{2}/,
|
@@ -54,7 +85,7 @@ module Pismo
|
|
54
85
|
|
55
86
|
regexen.each do |r|
|
56
87
|
datetime = @doc.to_html[r]
|
57
|
-
p datetime
|
88
|
+
# p datetime
|
58
89
|
break if datetime
|
59
90
|
end
|
60
91
|
|
@@ -76,10 +107,13 @@ module Pismo
|
|
76
107
|
# end
|
77
108
|
|
78
109
|
# Returns the author of the page/content
|
79
|
-
def author
|
80
|
-
author = @doc.match(
|
110
|
+
def author(all = false)
|
111
|
+
author = @doc.match([
|
112
|
+
'.post-author .fn',
|
81
113
|
'.wire_author',
|
82
114
|
'.cnnByline b',
|
115
|
+
'.editorlink',
|
116
|
+
'.authors p',
|
83
117
|
['meta[@name="author"]', lambda { |el| el.attr('content') }], # Traditional meta tag style
|
84
118
|
['meta[@name="AUTHOR"]', lambda { |el| el.attr('content') }], # CNN style
|
85
119
|
'.byline a', # Ruby Inside style
|
@@ -94,31 +128,48 @@ module Pismo
|
|
94
128
|
'.auth',
|
95
129
|
'.cT-storyDetails h5', # smh.com.au - worth dropping maybe..
|
96
130
|
['meta[@name="byl"]', lambda { |el| el.attr('content') }],
|
131
|
+
'.timestamp a',
|
97
132
|
'.fn a',
|
98
133
|
'.fn',
|
99
|
-
'.byline-author'
|
100
|
-
|
134
|
+
'.byline-author',
|
135
|
+
'.ArticleAuthor a',
|
136
|
+
'.blog_meta a',
|
137
|
+
'cite a',
|
138
|
+
'cite',
|
139
|
+
'.contributor_details h4 a'
|
140
|
+
], all)
|
101
141
|
|
102
142
|
return unless author
|
103
143
|
|
104
144
|
# Strip off any "By [whoever]" section
|
105
|
-
author
|
145
|
+
if String === author
|
146
|
+
author.sub!(/^(post(ed)?\s)?by\W+/i, '')
|
147
|
+
elsif Array === author
|
148
|
+
author.map! { |a| a.sub(/^(post(ed)?\s)?by\W+/i, '') }.uniq!
|
149
|
+
end
|
106
150
|
|
107
151
|
author
|
108
152
|
end
|
109
153
|
|
154
|
+
def authors
|
155
|
+
author(true)
|
156
|
+
end
|
157
|
+
|
158
|
+
|
110
159
|
# Returns the "description" of the page, usually comes from a meta tag
|
111
160
|
def description
|
112
|
-
@doc.match(
|
161
|
+
@doc.match([
|
113
162
|
['meta[@name="description"]', lambda { |el| el.attr('content') }],
|
114
163
|
['meta[@name="Description"]', lambda { |el| el.attr('content') }],
|
164
|
+
'rdf:Description[@name="dc:description"]',
|
115
165
|
'.description'
|
116
|
-
)
|
166
|
+
])
|
117
167
|
end
|
118
168
|
|
119
|
-
# Returns the "lede" or first paragraph of the story/page
|
120
|
-
def lede
|
121
|
-
lede = @doc.match(
|
169
|
+
# Returns the "lede(s)" or first paragraph(s) of the story/page
|
170
|
+
def lede(all = false)
|
171
|
+
lede = @doc.match([
|
172
|
+
'.post-text p',
|
122
173
|
'#blogpost p',
|
123
174
|
'.subhead',
|
124
175
|
'//div[@class="entrytext"]//p[string-length()>10]', # Ruby Inside / Kubrick style
|
@@ -136,10 +187,24 @@ module Pismo
|
|
136
187
|
'#content p',
|
137
188
|
'#article p',
|
138
189
|
'.post-body',
|
139
|
-
'.entry-content'
|
140
|
-
|
141
|
-
|
142
|
-
|
190
|
+
'.entry-content',
|
191
|
+
'.body p',
|
192
|
+
'.document_description_short p', # Scribd
|
193
|
+
'.single-post p',
|
194
|
+
'p'
|
195
|
+
], all)
|
196
|
+
|
197
|
+
if lede && String === lede
|
198
|
+
return lede[/^(.*?\.\s){2}/m] || lede
|
199
|
+
elsif lede && Array === lede
|
200
|
+
return lede.map { |l| l.to_s[/^(.*?\.\s){2}/m] || l }.uniq
|
201
|
+
else
|
202
|
+
return body ? body[/^(.*?\.\s){2}/m] : nil
|
203
|
+
end
|
204
|
+
end
|
205
|
+
|
206
|
+
def ledes
|
207
|
+
lede(true)
|
143
208
|
end
|
144
209
|
|
145
210
|
# Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
|
@@ -150,7 +215,9 @@ module Pismo
|
|
150
215
|
|
151
216
|
# Convert doc to lowercase, scrub out most HTML tags, then keep track of words
|
152
217
|
cached_title = title
|
153
|
-
body.downcase
|
218
|
+
content_to_use = body.to_s.downcase + description.to_s.downcase
|
219
|
+
|
220
|
+
content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub('. ', ' ').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\+\.\'\+\#\-]*\b/).each do |word|
|
154
221
|
next if word.length > options[:word_length_limit]
|
155
222
|
word.gsub!(/\'\w+/, '')
|
156
223
|
words[word] ||= 0
|
@@ -178,9 +245,9 @@ module Pismo
|
|
178
245
|
|
179
246
|
# Returns URL to the site's favicon
|
180
247
|
def favicon
|
181
|
-
url = @doc.match(
|
248
|
+
url = @doc.match([['link[@rel="fluid-icon"]', lambda { |el| el.attr('href') }], # Get a Fluid icon if possible..
|
182
249
|
['link[@rel="shortcut icon"]', lambda { |el| el.attr('href') }],
|
183
|
-
['link[@rel="icon"]', lambda { |el| el.attr('href') }])
|
250
|
+
['link[@rel="icon"]', lambda { |el| el.attr('href') }]])
|
184
251
|
if url && url !~ /^http/ && @url
|
185
252
|
url = URI.join(@url , url).to_s
|
186
253
|
end
|
@@ -188,17 +255,30 @@ module Pismo
|
|
188
255
|
url
|
189
256
|
end
|
190
257
|
|
191
|
-
# Returns URL of Web feed
|
192
|
-
def feed
|
193
|
-
url = @doc.match(
|
194
|
-
['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]
|
258
|
+
# Returns URL(s) of Web feed(s)
|
259
|
+
def feed(all = false)
|
260
|
+
url = @doc.match([['link[@type="application/rss+xml"]', lambda { |el| el.attr('href') }],
|
261
|
+
['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]], all
|
195
262
|
)
|
196
263
|
|
197
|
-
if url && url !~ /^http/ && @url
|
264
|
+
if url && String === url && url !~ /^http/ && @url
|
198
265
|
url = URI.join(@url , url).to_s
|
266
|
+
elsif url && Array === url
|
267
|
+
url.map! do |u|
|
268
|
+
if u !~ /^http/ && @url
|
269
|
+
URI.join(@url, u).to_s
|
270
|
+
else
|
271
|
+
u
|
272
|
+
end
|
273
|
+
end
|
274
|
+
url.uniq!
|
199
275
|
end
|
200
276
|
|
201
277
|
url
|
202
278
|
end
|
279
|
+
|
280
|
+
def feeds
|
281
|
+
feed(true)
|
282
|
+
end
|
203
283
|
end
|
204
284
|
end
|
data/lib/pismo/readability.rb
CHANGED
@@ -9,6 +9,8 @@
|
|
9
9
|
# http://lab.arc90.com/experiments/readability/js/readability.js
|
10
10
|
# * Copyright (c) 2009 Arc90 Inc
|
11
11
|
# * Readability is licensed under the Apache License, Version 2.0.
|
12
|
+
#
|
13
|
+
# Minor edits and tweaks by Peter Cooper
|
12
14
|
|
13
15
|
require 'nokogiri'
|
14
16
|
|
@@ -70,6 +72,9 @@ module Readability
|
|
70
72
|
|
71
73
|
sibling_score_threshold = [10, best_candidate[:content_score] * 0.2].max
|
72
74
|
output = Nokogiri::XML::Node.new('div', @html)
|
75
|
+
|
76
|
+
return output unless best_candidate[:elem]
|
77
|
+
|
73
78
|
best_candidate[:elem].parent.children.each do |sibling|
|
74
79
|
append = false
|
75
80
|
append = true if sibling == best_candidate[:elem]
|
@@ -105,7 +110,7 @@ module Readability
|
|
105
110
|
end
|
106
111
|
|
107
112
|
best_candidate = sorted_candidates.first || { :elem => @html.css("body").first, :content_score => 0 }
|
108
|
-
debug("Best candidate #{best_candidate[:elem].name}##{best_candidate[:elem][:id]}.#{best_candidate[:elem][:class]} with score #{best_candidate[:content_score]}")
|
113
|
+
#debug("Best candidate #{best_candidate[:elem].name}##{best_candidate[:elem][:id]}.#{best_candidate[:elem][:class]} with score #{best_candidate[:content_score]}")
|
109
114
|
|
110
115
|
best_candidate
|
111
116
|
end
|