pismo 0.5.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/LICENSE +19 -28
- data/NOTICE +4 -0
- data/README.markdown +37 -40
- data/Rakefile +3 -2
- data/VERSION +1 -1
- data/bin/pismo +15 -7
- data/lib/pismo/document.rb +2 -2
- data/lib/pismo/internal_attributes.rb +23 -16
- data/lib/pismo/reader.rb +390 -0
- data/lib/pismo.rb +3 -2
- data/pismo.gemspec +23 -15
- data/test/corpus/bbcnews2.html +1575 -0
- data/test/corpus/gmane.html +138 -0
- data/test/corpus/metadata_expected.yaml +20 -5
- data/test/corpus/queness.html +919 -0
- data/test/corpus/reader_expected.yaml +45 -0
- data/test/corpus/tweet.html +360 -0
- data/test/corpus/zefrank.html +535 -0
- data/test/test_corpus.rb +9 -1
- metadata +89 -34
- data/lib/pismo/readability.rb +0 -342
- data/test/test_readability.rb +0 -152
data/LICENSE
CHANGED
@@ -1,32 +1,23 @@
|
|
1
|
-
|
1
|
+
Copyright 2009, 2010 Peter Cooper
|
2
2
|
|
3
|
-
|
4
|
-
|
5
|
-
|
6
|
-
you may not use this file except in compliance with the License.
|
7
|
-
You may obtain a copy of the License at
|
8
|
-
|
9
|
-
http://www.apache.org/licenses/LICENSE-2.0
|
10
|
-
|
11
|
-
Unless required by applicable law or agreed to in writing, software
|
12
|
-
distributed under the License is distributed on an "AS IS" BASIS,
|
13
|
-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
14
|
-
See the License for the specific language governing permissions and
|
15
|
-
limitations under the License.
|
3
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
4
|
+
you may not use this file except in compliance with the License.
|
5
|
+
You may obtain a copy of the License at
|
16
6
|
|
7
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
17
8
|
|
18
|
-
|
9
|
+
Unless required by applicable law or agreed to in writing, software
|
10
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
11
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12
|
+
See the License for the specific language governing permissions and
|
13
|
+
limitations under the License.
|
19
14
|
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
distributed under the License is distributed on an "AS IS" BASIS,
|
30
|
-
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
31
|
-
See the License for the specific language governing permissions and
|
32
|
-
limitations under the License.
|
15
|
+
--
|
16
|
+
|
17
|
+
In short, you can use Pismo for whatever you like, but please include
|
18
|
+
a brief credit somewhere deep in your license file or similar, and,
|
19
|
+
if you're a nice kinda person, let me know if you're using it and/or
|
20
|
+
share any significant changes or improvements you make.
|
21
|
+
|
22
|
+
Peter Cooper
|
23
|
+
http://twitter.com/peterc
|
data/NOTICE
ADDED
data/README.markdown
CHANGED
@@ -1,67 +1,55 @@
|
|
1
1
|
# pismo - Web page content analysis and metadata extraction
|
2
|
-
http://github.com/peterc/pismo
|
3
2
|
|
4
3
|
## DESCRIPTION:
|
5
4
|
|
6
|
-
Pismo extracts
|
7
|
-
|
5
|
+
Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents.
|
6
|
+
Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
|
7
|
+
Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
|
8
8
|
|
9
|
-
|
10
|
-
extract the title, the actual "content", and analyze for keywords, among other things.
|
9
|
+
All tests pass on Ruby 1.8.7 (MRI) and Ruby 1.9.1-p378 (MRI).
|
11
10
|
|
12
|
-
##
|
11
|
+
## USAGE:
|
12
|
+
|
13
|
+
A basic example of extracting basic metadata from a Web page:
|
13
14
|
|
14
15
|
require 'pismo'
|
15
16
|
|
16
|
-
# Load a Web page (you
|
17
|
+
# Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
|
17
18
|
doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
|
18
19
|
|
19
20
|
doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
|
20
21
|
doc.author # => "Peter Cooper"
|
21
|
-
doc.lede # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
|
22
|
+
doc.lede # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
|
22
23
|
doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
|
23
24
|
|
24
|
-
|
25
|
-
|
26
|
-
Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
|
27
|
-
|
28
|
-
Planned/forthcoming features include the fetching of "external" data like tags from Delicious, content analysis through 3rd party services, and extraction of graphics from the main article text (for thumbnailing, say).
|
29
|
-
|
30
|
-
## NEW IN 0.5.0:
|
25
|
+
There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:
|
31
26
|
|
32
|
-
|
33
|
-
|
34
|
-
You can now access Pismo's stopword list directly:
|
35
|
-
|
36
|
-
Pismo.stopwords # => [.., .., ..]
|
27
|
+
Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
|
37
28
|
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
url = "http://www.rubyinside.com/the-why-what-and-how-of-rubinius-1-0-s-release-3261.html"
|
43
|
-
Pismo[url].title # => "The Why, What, and How of Rubinius 1.0's Release"
|
44
|
-
Pismo[url].author # => "Peter Cooper"
|
29
|
+
The current metadata methods are #title, #titles, #author, #authors, #lede, #keywords, #sentences(qty), #body, #feed, #feeds, #favicon, #description and #datetime. These are not fully documented here yet, you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
|
30
|
+
|
31
|
+
## CAUTIONS / WARNINGS:
|
45
32
|
|
46
|
-
|
33
|
+
There are some shortcomings or problems that I'm aware of and am going to pursue:
|
47
34
|
|
48
|
-
|
35
|
+
* I do not know how Pismo fares on JRuby, Rubinius, or others yet.
|
36
|
+
* The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction.
|
37
|
+
* The author name extraction is quite poor.
|
38
|
+
* The image extraction only handles images with absolute URLs.
|
39
|
+
* The stopword list leaves a bit to be desired. It errs on the side of being too long rather than too short, though (1024 words long!)
|
49
40
|
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
doc.feeds # => [..., ..., ...]
|
54
|
-
|
55
|
-
## COMMAND LINE TOOL:
|
41
|
+
## OTHER GROOVY STUFF:
|
42
|
+
|
43
|
+
### Command Line Tool
|
56
44
|
|
57
45
|
A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
|
58
46
|
great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
|
59
47
|
|
60
|
-
|
48
|
+
#### Usage:
|
61
49
|
|
62
50
|
./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
|
63
51
|
|
64
|
-
|
52
|
+
#### Output:
|
65
53
|
|
66
54
|
---
|
67
55
|
:url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
|
@@ -69,6 +57,15 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
|
|
69
57
|
:lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
|
70
58
|
:author: Peter Cooper
|
71
59
|
:datetime: 2010-01-07 12:00:00 +00:00
|
60
|
+
|
61
|
+
If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded
|
62
|
+
and assigned to both the constant 'P' and the variable @p.
|
63
|
+
|
64
|
+
### Stopword access
|
65
|
+
|
66
|
+
You can access Pismo's stopword list directly:
|
67
|
+
|
68
|
+
Pismo.stopwords # => [.., .., ..]
|
72
69
|
|
73
70
|
## Note on Patches/Pull Requests
|
74
71
|
|
@@ -81,8 +78,8 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
|
|
81
78
|
## COPYRIGHT AND LICENSE
|
82
79
|
|
83
80
|
Apache 2.0 License - See LICENSE for details.
|
81
|
+
Copyright (c) 2009, 2010 Peter Cooper
|
84
82
|
|
85
|
-
|
86
|
-
lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
|
83
|
+
In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.
|
87
84
|
|
88
|
-
|
85
|
+
http://github.com/peterc/pismo
|
data/Rakefile
CHANGED
@@ -13,9 +13,10 @@ begin
|
|
13
13
|
gem.executables = "pismo"
|
14
14
|
gem.default_executable = "pismo"
|
15
15
|
gem.add_development_dependency "shoulda", ">= 0"
|
16
|
+
gem.add_development_dependency "awesome_print"
|
17
|
+
gem.add_dependency "jeweler"
|
16
18
|
gem.add_dependency "nokogiri"
|
17
|
-
gem.add_dependency "
|
18
|
-
gem.add_dependency "httparty"
|
19
|
+
gem.add_dependency "sanitize"
|
19
20
|
gem.add_dependency "fast-stemmer"
|
20
21
|
gem.add_dependency "chronic"
|
21
22
|
end
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.6.0
|
data/bin/pismo
CHANGED
@@ -18,6 +18,7 @@ require 'yaml'
|
|
18
18
|
require 'rubygems'
|
19
19
|
$:.unshift(File.dirname(__FILE__) + "/../lib")
|
20
20
|
require 'pismo'
|
21
|
+
require 'irb'
|
21
22
|
|
22
23
|
url = ARGV.shift
|
23
24
|
|
@@ -27,10 +28,17 @@ end
|
|
27
28
|
|
28
29
|
doc = Pismo.document(url)
|
29
30
|
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
31
|
+
if ARGV.empty?
|
32
|
+
P = doc
|
33
|
+
@p = doc
|
34
|
+
puts "Pismo has loaded #{url} into @p and P"
|
35
|
+
IRB.start
|
36
|
+
else
|
37
|
+
output = { :url => doc.url }
|
38
|
+
|
39
|
+
ARGV.each do |cmd|
|
40
|
+
output[cmd.to_sym] = doc.send(cmd) rescue nil
|
41
|
+
end
|
42
|
+
|
43
|
+
puts output.to_yaml
|
44
|
+
end
|
data/lib/pismo/document.rb
CHANGED
@@ -23,9 +23,9 @@ module Pismo
|
|
23
23
|
|
24
24
|
def load(handle, url = nil)
|
25
25
|
@url = url if url
|
26
|
-
@url = handle if handle =~ /\Ahttp/
|
26
|
+
@url = handle if handle =~ /\Ahttp/i
|
27
27
|
|
28
|
-
@html = if handle =~ /\Ahttp/
|
28
|
+
@html = if handle =~ /\Ahttp/i
|
29
29
|
open(handle).read
|
30
30
|
elsif handle.is_a?(StringIO) || handle.is_a?(IO) || handle.is_a?(Tempfile)
|
31
31
|
handle.read
|
@@ -24,6 +24,7 @@ module Pismo
|
|
24
24
|
'.title h1',
|
25
25
|
'.post h2',
|
26
26
|
'h2.title',
|
27
|
+
'.entry h2 a',
|
27
28
|
'.entry h2', # Common style
|
28
29
|
'.boite_titre a',
|
29
30
|
['meta[@name="title"]', lambda { |el| el.attr('content') }],
|
@@ -66,8 +67,6 @@ module Pismo
|
|
66
67
|
title = @doc.match('title')
|
67
68
|
return unless title
|
68
69
|
title
|
69
|
-
# Strip off any leading or trailing site names - a scrappy way to try it out..
|
70
|
-
#title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
|
71
70
|
end
|
72
71
|
|
73
72
|
# Return an estimate of when the page/content was created
|
@@ -209,16 +208,16 @@ module Pismo
|
|
209
208
|
'.entry-content',
|
210
209
|
'.body p',
|
211
210
|
'.document_description_short p', # Scribd
|
212
|
-
'.single-post p'
|
213
|
-
'p'
|
211
|
+
'.single-post p'
|
214
212
|
], all)
|
215
|
-
|
213
|
+
|
214
|
+
# TODO: Improve sentence extraction - this is dire even if it "works for now"
|
216
215
|
if lede && String === lede
|
217
|
-
return lede[/^(
|
216
|
+
return lede[/^(.*?[\.\!\?]\s){2}/m] || lede
|
218
217
|
elsif lede && Array === lede
|
219
|
-
return lede.map { |l| l.to_s[/^(
|
218
|
+
return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){2}/m] || l }.uniq
|
220
219
|
else
|
221
|
-
return
|
220
|
+
return reader_doc && !reader_doc.sentences(2).empty? ? reader_doc.sentences(2).join(' ') : nil
|
222
221
|
end
|
223
222
|
end
|
224
223
|
|
@@ -226,6 +225,16 @@ module Pismo
|
|
226
225
|
lede(true)
|
227
226
|
end
|
228
227
|
|
228
|
+
# Returns a string containing the first [limit] sentences as determined by the Reader algorithm
|
229
|
+
def sentences(limit = 3)
|
230
|
+
reader_doc && !reader_doc.sentences.empty? ? reader_doc.sentences(limit).join(' ') : nil
|
231
|
+
end
|
232
|
+
|
233
|
+
# Returns any images with absolute URLs in the document
|
234
|
+
def images(limit = 3)
|
235
|
+
reader_doc && !reader_doc.images.empty? ? reader_doc.images(limit) : nil
|
236
|
+
end
|
237
|
+
|
229
238
|
# Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
|
230
239
|
def keywords(options = {})
|
231
240
|
options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
|
@@ -253,15 +262,13 @@ module Pismo
|
|
253
262
|
return w
|
254
263
|
end
|
255
264
|
|
256
|
-
|
265
|
+
def reader_doc
|
266
|
+
@reader_doc ||= Reader::Document.new(@doc.to_s)
|
267
|
+
end
|
268
|
+
|
269
|
+
# Returns body text as determined by Reader algorithm
|
257
270
|
def body
|
258
|
-
@body ||=
|
259
|
-
|
260
|
-
# HACK: Remove annoying DIV that readability leaves around
|
261
|
-
@body.sub!(/\A\<div\>/, '')
|
262
|
-
@body.sub!(/\<\/div\>\Z/, '')
|
263
|
-
|
264
|
-
return @body
|
271
|
+
@body ||= reader_doc.content.strip
|
265
272
|
end
|
266
273
|
|
267
274
|
# Returns URL to the site's favicon
|