pismo 0.5.0 → 0.6.0
Sign up to get free protection for your applications and to get access to all the features.
- data/LICENSE +19 -28
- data/NOTICE +4 -0
- data/README.markdown +37 -40
- data/Rakefile +3 -2
- data/VERSION +1 -1
- data/bin/pismo +15 -7
- data/lib/pismo/document.rb +2 -2
- data/lib/pismo/internal_attributes.rb +23 -16
- data/lib/pismo/reader.rb +390 -0
- data/lib/pismo.rb +3 -2
- data/pismo.gemspec +23 -15
- data/test/corpus/bbcnews2.html +1575 -0
- data/test/corpus/gmane.html +138 -0
- data/test/corpus/metadata_expected.yaml +20 -5
- data/test/corpus/queness.html +919 -0
- data/test/corpus/reader_expected.yaml +45 -0
- data/test/corpus/tweet.html +360 -0
- data/test/corpus/zefrank.html +535 -0
- data/test/test_corpus.rb +9 -1
- metadata +89 -34
- data/lib/pismo/readability.rb +0 -342
- data/test/test_readability.rb +0 -152
data/LICENSE
CHANGED
@@ -1,32 +1,23 @@
|
|
1
|
-
|
1
|
+
Copyright 2009, 2010 Peter Cooper
|
2
2
|
|
3
|
-
|
4
|
-
|
5
|
-
|
6
|
-
you may not use this file except in compliance with the License.
|
7
|
-
You may obtain a copy of the License at
|
8
|
-
|
9
|
-
http://www.apache.org/licenses/LICENSE-2.0
|
10
|
-
|
11
|
-
Unless required by applicable law or agreed to in writing, software
|
12
|
-
distributed under the License is distributed on an "AS IS" BASIS,
|
13
|
-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
14
|
-
See the License for the specific language governing permissions and
|
15
|
-
limitations under the License.
|
3
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
4
|
+
you may not use this file except in compliance with the License.
|
5
|
+
You may obtain a copy of the License at
|
16
6
|
|
7
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
17
8
|
|
18
|
-
|
9
|
+
Unless required by applicable law or agreed to in writing, software
|
10
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
11
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12
|
+
See the License for the specific language governing permissions and
|
13
|
+
limitations under the License.
|
19
14
|
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
distributed under the License is distributed on an "AS IS" BASIS,
|
30
|
-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
31
|
-
See the License for the specific language governing permissions and
|
32
|
-
limitations under the License.
|
15
|
+
--
|
16
|
+
|
17
|
+
In short, you can use Pismo for whatever you like, but please include
|
18
|
+
a brief credit somewhere deep in your license file or similar, and,
|
19
|
+
if you're a nice kinda person, let me know if you're using it and/or
|
20
|
+
share any significant changes or improvements you make.
|
21
|
+
|
22
|
+
Peter Cooper
|
23
|
+
http://twitter.com/peterc
|
data/NOTICE
ADDED
data/README.markdown
CHANGED
@@ -1,67 +1,55 @@
|
|
1
1
|
# pismo - Web page content analysis and metadata extraction
|
2
|
-
http://github.com/peterc/pismo
|
3
2
|
|
4
3
|
## DESCRIPTION:
|
5
4
|
|
6
|
-
Pismo extracts
|
7
|
-
|
5
|
+
Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents.
|
6
|
+
Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
|
7
|
+
Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
|
8
8
|
|
9
|
-
|
10
|
-
extract the title, the actual "content", and analyze for keywords, among other things.
|
9
|
+
All tests pass on Ruby 1.8.7 (MRI) and Ruby 1.9.1-p378 (MRI).
|
11
10
|
|
12
|
-
##
|
11
|
+
## USAGE:
|
12
|
+
|
13
|
+
A basic example of extracting basic metadata from a Web page:
|
13
14
|
|
14
15
|
require 'pismo'
|
15
16
|
|
16
|
-
# Load a Web page (you
|
17
|
+
# Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
|
17
18
|
doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
|
18
19
|
|
19
20
|
doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
|
20
21
|
doc.author # => "Peter Cooper"
|
21
|
-
doc.lede # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
|
22
|
+
doc.lede # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
|
22
23
|
doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
|
23
24
|
|
24
|
-
|
25
|
-
|
26
|
-
Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
|
27
|
-
|
28
|
-
Planned/forthcoming features include the fetching of "external" data like tags from Delicious, content analysis through 3rd party services, and extraction of graphics from the main article text (for thumbnailing, say).
|
29
|
-
|
30
|
-
## NEW IN 0.5.0:
|
25
|
+
There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:
|
31
26
|
|
32
|
-
|
33
|
-
|
34
|
-
You can now access Pismo's stopword list directly:
|
35
|
-
|
36
|
-
Pismo.stopwords # => [.., .., ..]
|
27
|
+
Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
|
37
28
|
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
url = "http://www.rubyinside.com/the-why-what-and-how-of-rubinius-1-0-s-release-3261.html"
|
43
|
-
Pismo[url].title # => "The Why, What, and How of Rubinius 1.0's Release"
|
44
|
-
Pismo[url].author # => "Peter Cooper"
|
29
|
+
The current metadata methods are #title, #titles, #author, #authors, #lede, #keywords, #sentences(qty), #body, #feed, #feeds, #favicon, #description and #datetime. These are not fully documented here yet, you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
|
30
|
+
|
31
|
+
## CAUTIONS / WARNINGS:
|
45
32
|
|
46
|
-
|
33
|
+
There are some shortcomings or problems that I'm aware of and am going to pursue:
|
47
34
|
|
48
|
-
|
35
|
+
* I do not know how Pismo fares on JRuby, Rubinius, or others yet.
|
36
|
+
* The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction.
|
37
|
+
* The author name extraction is quite poor.
|
38
|
+
* The image extraction only handles images with absolute URLs.
|
39
|
+
* The stopword list leaves a bit to be desired. It errs on the side of being too long rather than too short, though (1024 words long!)
|
49
40
|
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
doc.feeds # => [..., ..., ...]
|
54
|
-
|
55
|
-
## COMMAND LINE TOOL:
|
41
|
+
## OTHER GROOVY STUFF:
|
42
|
+
|
43
|
+
### Command Line Tool
|
56
44
|
|
57
45
|
A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
|
58
46
|
great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
|
59
47
|
|
60
|
-
|
48
|
+
#### Usage:
|
61
49
|
|
62
50
|
./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
|
63
51
|
|
64
|
-
|
52
|
+
#### Output:
|
65
53
|
|
66
54
|
---
|
67
55
|
:url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
|
@@ -69,6 +57,15 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
|
|
69
57
|
:lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
|
70
58
|
:author: Peter Cooper
|
71
59
|
:datetime: 2010-01-07 12:00:00 +00:00
|
60
|
+
|
61
|
+
If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded
|
62
|
+
and assigned to both the constant 'P' and the variable @p.
|
63
|
+
|
64
|
+
### Stopword access
|
65
|
+
|
66
|
+
You can access Pismo's stopword list directly:
|
67
|
+
|
68
|
+
Pismo.stopwords # => [.., .., ..]
|
72
69
|
|
73
70
|
## Note on Patches/Pull Requests
|
74
71
|
|
@@ -81,8 +78,8 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
|
|
81
78
|
## COPYRIGHT AND LICENSE
|
82
79
|
|
83
80
|
Apache 2.0 License - See LICENSE for details.
|
81
|
+
Copyright (c) 2009, 2010 Peter Cooper
|
84
82
|
|
85
|
-
|
86
|
-
lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
|
83
|
+
In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.
|
87
84
|
|
88
|
-
|
85
|
+
http://github.com/peterc/pismo
|
data/Rakefile
CHANGED
@@ -13,9 +13,10 @@ begin
|
|
13
13
|
gem.executables = "pismo"
|
14
14
|
gem.default_executable = "pismo"
|
15
15
|
gem.add_development_dependency "shoulda", ">= 0"
|
16
|
+
gem.add_development_dependency "awesome_print"
|
17
|
+
gem.add_dependency "jeweler"
|
16
18
|
gem.add_dependency "nokogiri"
|
17
|
-
gem.add_dependency "
|
18
|
-
gem.add_dependency "httparty"
|
19
|
+
gem.add_dependency "sanitize"
|
19
20
|
gem.add_dependency "fast-stemmer"
|
20
21
|
gem.add_dependency "chronic"
|
21
22
|
end
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.6.0
|
data/bin/pismo
CHANGED
@@ -18,6 +18,7 @@ require 'yaml'
|
|
18
18
|
require 'rubygems'
|
19
19
|
$:.unshift(File.dirname(__FILE__) + "/../lib")
|
20
20
|
require 'pismo'
|
21
|
+
require 'irb'
|
21
22
|
|
22
23
|
url = ARGV.shift
|
23
24
|
|
@@ -27,10 +28,17 @@ end
|
|
27
28
|
|
28
29
|
doc = Pismo.document(url)
|
29
30
|
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
31
|
+
if ARGV.empty?
|
32
|
+
P = doc
|
33
|
+
@p = doc
|
34
|
+
puts "Pismo has loaded #{url} into @p and P"
|
35
|
+
IRB.start
|
36
|
+
else
|
37
|
+
output = { :url => doc.url }
|
38
|
+
|
39
|
+
ARGV.each do |cmd|
|
40
|
+
output[cmd.to_sym] = doc.send(cmd) rescue nil
|
41
|
+
end
|
42
|
+
|
43
|
+
puts output.to_yaml
|
44
|
+
end
|
data/lib/pismo/document.rb
CHANGED
@@ -23,9 +23,9 @@ module Pismo
|
|
23
23
|
|
24
24
|
def load(handle, url = nil)
|
25
25
|
@url = url if url
|
26
|
-
@url = handle if handle =~ /\Ahttp/
|
26
|
+
@url = handle if handle =~ /\Ahttp/i
|
27
27
|
|
28
|
-
@html = if handle =~ /\Ahttp/
|
28
|
+
@html = if handle =~ /\Ahttp/i
|
29
29
|
open(handle).read
|
30
30
|
elsif handle.is_a?(StringIO) || handle.is_a?(IO) || handle.is_a?(Tempfile)
|
31
31
|
handle.read
|
@@ -24,6 +24,7 @@ module Pismo
|
|
24
24
|
'.title h1',
|
25
25
|
'.post h2',
|
26
26
|
'h2.title',
|
27
|
+
'.entry h2 a',
|
27
28
|
'.entry h2', # Common style
|
28
29
|
'.boite_titre a',
|
29
30
|
['meta[@name="title"]', lambda { |el| el.attr('content') }],
|
@@ -66,8 +67,6 @@ module Pismo
|
|
66
67
|
title = @doc.match('title')
|
67
68
|
return unless title
|
68
69
|
title
|
69
|
-
# Strip off any leading or trailing site names - a scrappy way to try it out..
|
70
|
-
#title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
|
71
70
|
end
|
72
71
|
|
73
72
|
# Return an estimate of when the page/content was created
|
@@ -209,16 +208,16 @@ module Pismo
|
|
209
208
|
'.entry-content',
|
210
209
|
'.body p',
|
211
210
|
'.document_description_short p', # Scribd
|
212
|
-
'.single-post p'
|
213
|
-
'p'
|
211
|
+
'.single-post p'
|
214
212
|
], all)
|
215
|
-
|
213
|
+
|
214
|
+
# TODO: Improve sentence extraction - this is dire even if it "works for now"
|
216
215
|
if lede && String === lede
|
217
|
-
return lede[/^(
|
216
|
+
return lede[/^(.*?[\.\!\?]\s){2}/m] || lede
|
218
217
|
elsif lede && Array === lede
|
219
|
-
return lede.map { |l| l.to_s[/^(
|
218
|
+
return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){2}/m] || l }.uniq
|
220
219
|
else
|
221
|
-
return
|
220
|
+
return reader_doc && !reader_doc.sentences(2).empty? ? reader_doc.sentences(2).join(' ') : nil
|
222
221
|
end
|
223
222
|
end
|
224
223
|
|
@@ -226,6 +225,16 @@ module Pismo
|
|
226
225
|
lede(true)
|
227
226
|
end
|
228
227
|
|
228
|
+
# Returns a string containing the first [limit] sentences as determined by the Reader algorithm
|
229
|
+
def sentences(limit = 3)
|
230
|
+
reader_doc && !reader_doc.sentences.empty? ? reader_doc.sentences(limit).join(' ') : nil
|
231
|
+
end
|
232
|
+
|
233
|
+
# Returns any images with absolute URLs in the document
|
234
|
+
def images(limit = 3)
|
235
|
+
reader_doc && !reader_doc.images.empty? ? reader_doc.images(limit) : nil
|
236
|
+
end
|
237
|
+
|
229
238
|
# Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
|
230
239
|
def keywords(options = {})
|
231
240
|
options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
|
@@ -253,15 +262,13 @@ module Pismo
|
|
253
262
|
return w
|
254
263
|
end
|
255
264
|
|
256
|
-
|
265
|
+
def reader_doc
|
266
|
+
@reader_doc ||= Reader::Document.new(@doc.to_s)
|
267
|
+
end
|
268
|
+
|
269
|
+
# Returns body text as determined by Reader algorithm
|
257
270
|
def body
|
258
|
-
@body ||=
|
259
|
-
|
260
|
-
# HACK: Remove annoying DIV that readability leaves around
|
261
|
-
@body.sub!(/\A\<div\>/, '')
|
262
|
-
@body.sub!(/\<\/div\>\Z/, '')
|
263
|
-
|
264
|
-
return @body
|
271
|
+
@body ||= reader_doc.content.strip
|
265
272
|
end
|
266
273
|
|
267
274
|
# Returns URL to the site's favicon
|