pismo 0.5.0 → 0.6.0

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE CHANGED
@@ -1,32 +1,23 @@
1
- All EXCEPT the lib/pismo/readability.rb file:
1
+ Copyright 2009, 2010 Peter Cooper
2
2
 
3
- Copyright 2009, 2010 Peter Cooper
4
-
5
- Licensed under the Apache License, Version 2.0 (the "License");
6
- you may not use this file except in compliance with the License.
7
- You may obtain a copy of the License at
8
-
9
- http://www.apache.org/licenses/LICENSE-2.0
10
-
11
- Unless required by applicable law or agreed to in writing, software
12
- distributed under the License is distributed on an "AS IS" BASIS,
13
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
- See the License for the specific language governing permissions and
15
- limitations under the License.
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
16
6
 
7
+ http://www.apache.org/licenses/LICENSE-2.0
17
8
 
18
- For lib/pismo/readability.rb:
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
19
14
 
20
- Copyright 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
21
-
22
- Licensed under the Apache License, Version 2.0 (the "License");
23
- you may not use this file except in compliance with the License.
24
- You may obtain a copy of the License at
25
-
26
- http://www.apache.org/licenses/LICENSE-2.0
27
-
28
- Unless required by applicable law or agreed to in writing, software
29
- distributed under the License is distributed on an "AS IS" BASIS,
30
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
31
- See the License for the specific language governing permissions and
32
- limitations under the License.
15
+ --
16
+
17
+ In short, you can use Pismo for whatever you like, but please include
18
+ a brief credit somewhere deep in your license file or similar, and,
19
+ if you're a nice kinda person, let me know if you're using it and/or
20
+ share any significant changes or improvements you make.
21
+
22
+ Peter Cooper
23
+ http://twitter.com/peterc
data/NOTICE ADDED
@@ -0,0 +1,4 @@
1
+ Pismo is Copyright (c) 2009, 2010 Peter Cooper
2
+ Pismo is Apache 2.0 Licensed
3
+ Peter Cooper can be found at and contacted via http://twitter.com/peterc
4
+ The source can be found at http://github.com/peterc/pismo
data/README.markdown CHANGED
@@ -1,67 +1,55 @@
1
1
  # pismo - Web page content analysis and metadata extraction
2
- http://github.com/peterc/pismo
3
2
 
4
3
  ## DESCRIPTION:
5
4
 
6
- Pismo extracts metadata and machine-usable data from mostly unstructured (or poorly structured)
7
- English-language HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
5
+ Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents.
6
+ Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
7
+ Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
8
8
 
9
- For example, if you have a blog post HTML file, Pismo, in theory, should
10
- extract the title, the actual "content", and analyze for keywords, among other things.
9
+ All tests pass on Ruby 1.8.7 (MRI) and Ruby 1.9.1-p378 (MRI).
11
10
 
12
- ## EXAMPLES:
11
+ ## USAGE:
12
+
13
+ A basic example of extracting basic metadata from a Web page:
13
14
 
14
15
  require 'pismo'
15
16
 
16
- # Load a Web page (you can pass an IO object or a string with existing HTML data along too, if you prefer)
17
+ # Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
17
18
  doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
18
19
 
19
20
  doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
20
21
  doc.author # => "Peter Cooper"
21
- doc.lede # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
22
+ doc.lede # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
22
23
  doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
23
24
 
24
- ## STATUS:
25
-
26
- Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
27
-
28
- Planned/forthcoming features include the fetching of "external" data like tags from Delicious, content analysis through 3rd party services, and extraction of graphics from the main article text (for thumbnailing, say).
29
-
30
- ## NEW IN 0.5.0:
25
+ There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:
31
26
 
32
- ### Stopword access
33
-
34
- You can now access Pismo's stopword list directly:
35
-
36
- Pismo.stopwords # => [.., .., ..]
27
+ Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
37
28
 
38
- ### Convenience access method for IRB/debugging use
39
-
40
- Now you can get playing with Pismo faster. This is primarily useful for debugging/playing in IRB as it just uses open-uri and the Pismo document is cached in the class against the URL:
41
-
42
- url = "http://www.rubyinside.com/the-why-what-and-how-of-rubinius-1-0-s-release-3261.html"
43
- Pismo[url].title # => "The Why, What, and How of Rubinius 1.0's Release"
44
- Pismo[url].author # => "Peter Cooper"
29
+ The current metadata methods are #title, #titles, #author, #authors, #lede, #keywords, #sentences(qty), #body, #feed, #feeds, #favicon, #description and #datetime. These are not fully documented here yet, you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
30
+
31
+ ## CAUTIONS / WARNINGS:
45
32
 
46
- ### Arrays of all matches for titles, ledes, authors, and feeds
33
+ There are some shortcomings or problems that I'm aware of and am going to pursue:
47
34
 
48
- Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
35
+ * I do not know how Pismo fares on JRuby, Rubinius, or others yet.
36
+ * The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction.
37
+ * The author name extraction is quite poor.
38
+ * The image extraction only handles images with absolute URLs.
39
+ * The stopword list leaves a bit to be desired. It errs on the side of being too long rather than too short, though (1024 words long!)
49
40
 
50
- doc.titles # => [..., ..., ...]
51
- doc.ledes # => [..., ..., ...]
52
- doc.authors # => [..., ..., ...]
53
- doc.feeds # => [..., ..., ...]
54
-
55
- ## COMMAND LINE TOOL:
41
+ ## OTHER GROOVY STUFF:
42
+
43
+ ### Command Line Tool
56
44
 
57
45
  A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
58
46
  great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
59
47
 
60
- ### Usage:
48
+ #### Usage:
61
49
 
62
50
  ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
63
51
 
64
- ### Output:
52
+ #### Output:
65
53
 
66
54
  ---
67
55
  :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
@@ -69,6 +57,15 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
69
57
  :lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
70
58
  :author: Peter Cooper
71
59
  :datetime: 2010-01-07 12:00:00 +00:00
60
+
61
+ If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded
62
+ and assigned to both the constant 'P' and the variable @p.
63
+
64
+ ### Stopword access
65
+
66
+ You can access Pismo's stopword list directly:
67
+
68
+ Pismo.stopwords # => [.., .., ..]
72
69
 
73
70
  ## Note on Patches/Pull Requests
74
71
 
@@ -81,8 +78,8 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
81
78
  ## COPYRIGHT AND LICENSE
82
79
 
83
80
  Apache 2.0 License - See LICENSE for details.
81
+ Copyright (c) 2009, 2010 Peter Cooper
84
82
 
85
- All except lib/pismo/readability.rb is Copyright (c) 2009, 2010 Peter Cooper
86
- lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
83
+ In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.
87
84
 
88
- The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability - sorry! I have respected the license, however. I have promised to contribute back to them directly and, hopefully, use that library as a regular dependency. But.. this takes time.
85
+ http://github.com/peterc/pismo
data/Rakefile CHANGED
@@ -13,9 +13,10 @@ begin
13
13
  gem.executables = "pismo"
14
14
  gem.default_executable = "pismo"
15
15
  gem.add_development_dependency "shoulda", ">= 0"
16
+ gem.add_development_dependency "awesome_print"
17
+ gem.add_dependency "jeweler"
16
18
  gem.add_dependency "nokogiri"
17
- gem.add_dependency "loofah"
18
- gem.add_dependency "httparty"
19
+ gem.add_dependency "sanitize"
19
20
  gem.add_dependency "fast-stemmer"
20
21
  gem.add_dependency "chronic"
21
22
  end
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.5.0
1
+ 0.6.0
data/bin/pismo CHANGED
@@ -18,6 +18,7 @@ require 'yaml'
18
18
  require 'rubygems'
19
19
  $:.unshift(File.dirname(__FILE__) + "/../lib")
20
20
  require 'pismo'
21
+ require 'irb'
21
22
 
22
23
  url = ARGV.shift
23
24
 
@@ -27,10 +28,17 @@ end
27
28
 
28
29
  doc = Pismo.document(url)
29
30
 
30
- output = { :url => doc.url }
31
-
32
- (ARGV.empty? ? Pismo::Document::ATTRIBUTE_METHODS : ARGV).each do |cmd|
33
- output[cmd.to_sym] = doc.send(cmd) rescue nil
34
- end
35
-
36
- puts output.to_yaml
31
+ if ARGV.empty?
32
+ P = doc
33
+ @p = doc
34
+ puts "Pismo has loaded #{url} into @p and P"
35
+ IRB.start
36
+ else
37
+ output = { :url => doc.url }
38
+
39
+ ARGV.each do |cmd|
40
+ output[cmd.to_sym] = doc.send(cmd) rescue nil
41
+ end
42
+
43
+ puts output.to_yaml
44
+ end
@@ -23,9 +23,9 @@ module Pismo
23
23
 
24
24
  def load(handle, url = nil)
25
25
  @url = url if url
26
- @url = handle if handle =~ /\Ahttp/
26
+ @url = handle if handle =~ /\Ahttp/i
27
27
 
28
- @html = if handle =~ /\Ahttp/
28
+ @html = if handle =~ /\Ahttp/i
29
29
  open(handle).read
30
30
  elsif handle.is_a?(StringIO) || handle.is_a?(IO) || handle.is_a?(Tempfile)
31
31
  handle.read
@@ -24,6 +24,7 @@ module Pismo
24
24
  '.title h1',
25
25
  '.post h2',
26
26
  'h2.title',
27
+ '.entry h2 a',
27
28
  '.entry h2', # Common style
28
29
  '.boite_titre a',
29
30
  ['meta[@name="title"]', lambda { |el| el.attr('content') }],
@@ -66,8 +67,6 @@ module Pismo
66
67
  title = @doc.match('title')
67
68
  return unless title
68
69
  title
69
- # Strip off any leading or trailing site names - a scrappy way to try it out..
70
- #title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
71
70
  end
72
71
 
73
72
  # Return an estimate of when the page/content was created
@@ -209,16 +208,16 @@ module Pismo
209
208
  '.entry-content',
210
209
  '.body p',
211
210
  '.document_description_short p', # Scribd
212
- '.single-post p',
213
- 'p'
211
+ '.single-post p'
214
212
  ], all)
215
-
213
+
214
+ # TODO: Improve sentence extraction - this is dire even if it "works for now"
216
215
  if lede && String === lede
217
- return lede[/^(.*?\.\s){2}/m] || lede
216
+ return lede[/^(.*?[\.\!\?]\s){2}/m] || lede
218
217
  elsif lede && Array === lede
219
- return lede.map { |l| l.to_s[/^(.*?\.\s){2}/m] || l }.uniq
218
+ return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){2}/m] || l }.uniq
220
219
  else
221
- return body ? body[/^(.*?\.\s){2}/m] : nil
220
+ return reader_doc && !reader_doc.sentences(2).empty? ? reader_doc.sentences(2).join(' ') : nil
222
221
  end
223
222
  end
224
223
 
@@ -226,6 +225,16 @@ module Pismo
226
225
  lede(true)
227
226
  end
228
227
 
228
+ # Returns a string containing the first [limit] sentences as determined by the Reader algorithm
229
+ def sentences(limit = 3)
230
+ reader_doc && !reader_doc.sentences.empty? ? reader_doc.sentences(limit).join(' ') : nil
231
+ end
232
+
233
+ # Returns any images with absolute URLs in the document
234
+ def images(limit = 3)
235
+ reader_doc && !reader_doc.images.empty? ? reader_doc.images(limit) : nil
236
+ end
237
+
229
238
  # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
230
239
  def keywords(options = {})
231
240
  options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
@@ -253,15 +262,13 @@ module Pismo
253
262
  return w
254
263
  end
255
264
 
256
- # Returns body text as determined by Arc90's Readability algorithm
265
+ def reader_doc
266
+ @reader_doc ||= Reader::Document.new(@doc.to_s)
267
+ end
268
+
269
+ # Returns body text as determined by Reader algorithm
257
270
  def body
258
- @body ||= Readability::Document.new(@doc.to_s).content.strip
259
-
260
- # HACK: Remove annoying DIV that readability leaves around
261
- @body.sub!(/\A\<div\>/, '')
262
- @body.sub!(/\<\/div\>\Z/, '')
263
-
264
- return @body
271
+ @body ||= reader_doc.content.strip
265
272
  end
266
273
 
267
274
  # Returns URL to the site's favicon