RubyGems - pismo - Versions diffs - 0.2.3 → 0.4.0 - Mend

pismo 0.2.3 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

data/README.rdoc +25 -20
data/VERSION +1 -1
data/lib/pismo/document.rb +3 -3
data/lib/pismo/internal_attributes.rb +127 -47
data/lib/pismo/readability.rb +6 -1
data/lib/pismo/stopwords.txt +452 -326
data/lib/pismo.rb +10 -4
data/pismo.gemspec +2 -2
data/test/corpus/metadata_expected.yaml +17 -0
metadata +2 -2

data/README.rdoc CHANGED Viewed

@@ -2,35 +2,40 @@
 * http://github.com/peterc/pismo
-== STATUS:
-pismo is a VERY NEW project developed for use on http://coder.io/ - my forthcoming developer news aggregator. pismo is FAR FROM COMPLETE. If you're brave, you can have a PLAY with it as the examples below and those in the test suite/corpus do work - all tests pass.
-The prime missing features so far are the "external attributes" - where calls are made to external services like Delicious, Yahoo, Bing, etc, for getting third party data about documents. The structures are there but I'm still deciding how best to integrate these ideas.
 == DESCRIPTION:
-Pismo extracts metadata and machine-usable data from otherwise unstructured
-HTML documents, including titles, body text, graphics, date, and keywords.
-For example, if you have a blog post HTML file, Pismo should, in theory, be
-able to extract the title, the actual "content", images relating to the
-content, look up Delicious tags, and analyze for keywords.
+Pismo extracts metadata and machine-usable data from mostly unstructured (or poorly structured)
+HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
-Pismo only understands English. Je suis desolé.
+For example, if you have a blog post HTML file, Pismo, in theory, should
+extract the title, the actual "content", and analyze for keywords, among other things.
-== SYNOPSIS:
+Pismo only understands (and much prefers) English. Je suis desolé.
-* Basic demo:
+== EXAMPLES:
-    require 'open-uri'
     require 'pismo'
-    doc = Pismo::Document.new(open('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html'))
+    # Load a Web page (you can pass an IO object or a string with existing HTML data along too, if you prefer)
+    doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
     doc.title     # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
     doc.author    # => "Peter Cooper"
     doc.lede      # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
     doc.keywords  # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
+== NEW IN 0.4.0:
+  Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
+    doc.titles    # => [..., ..., ...]
+    doc.ledes    # => [..., ..., ...]
+    doc.authors    # => [..., ..., ...]
+    doc.feeds    # => [..., ..., ...]
+== STATUS:
+Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
 == COMMAND LINE TOOL:
@@ -55,8 +60,8 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
 * Fork the project.
 * Make your feature addition or bug fix.
 * Add tests for it. This is important so I don't break it in a future version unintentionally.
-* Commit, do not mess with Rakefile, version, or history.
-* Send me a pull request. I may or may not accept it.
+* Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
+* Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
 == COPYRIGHT AND LICENSE
@@ -65,4 +70,4 @@ Apache 2.0 License - See LICENSE for details.
 All except lib/pismo/readability.rb is Copyright (c) 2009, 2010 Peter Cooper
 lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
-The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability
+The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability - sorry! I have respected the license, however. I have promised to contribute back to them directly and, hopefully, use that library as a regular dependency. But.. this takes time.

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.2.3
1	+ 0.4.0

data/lib/pismo/document.rb CHANGED Viewed

@@ -23,11 +23,11 @@ module Pismo
     def load(handle, url = nil)
       @url = url if url
-      @url = handle if handle =~ /^http/
+      @url = handle if handle =~ /\Ahttp/
-      @html = if handle =~ /^http/
+      @html = if handle =~ /\Ahttp/
                 open(handle).read
-              elsif handle.is_a?(StringIO) || handle.is_a?(IO)
+              elsif handle.is_a?(StringIO) || handle.is_a?(IO) || handle.is_a?(Tempfile)
                 handle.read
               else
                 handle

data/lib/pismo/internal_attributes.rb CHANGED Viewed

@@ -2,34 +2,62 @@ module Pismo
   # Internal attributes are different pieces of data we can extract from a document's content
   module InternalAttributes
     # Returns the title of the page/content - attempts to strip site name, etc, if possible
-    def title
+    def title(all = false)
       # TODO: Memoizations
-      title = @doc.match( 'h2.title',
-                          '.entry h2',                                                      # Common style
-                          '.entryheader h1',                                                # Ruby Inside/Kubrick
-                          '.entry-title a',                                               # Common Blogger/Blogspot rules
-                          '.post-title a',
-                          '.posttitle a',
-                          '.entry-title',
-                          '.post-title',
-                          '.posttitle',
-                          ['meta[@name="title"]', lambda { |el| el.attr('content') }],
-                          '#pname a',                                                       # Google Code style
-                          'h1.headermain',
-                          'h1.title',
-                          '.mxb h1'                                                         # BBC News
+      title = @doc.match(
+                          [
+                            '.entryheader h1',                                                # Ruby Inside/Kubrick
+                            '.entry-title a',                                               # Common Blogger/Blogspot rules
+                            '.post-title a',
+                            '.post_title a',
+                            '.posttitle a',
+                            '.post-header h1',
+                            '.entry-title',
+                            '.post-title',
+                            '.posttitle',
+                            '.post_title',
+                            '.pageTitle',
+                            '.title h1',
+                            '.post h2',
+                            'h2.title',
+                            '.entry h2',                                                      # Common style
+                            '.boite_titre a',
+                            ['meta[@name="title"]', lambda { |el| el.attr('content') }],
+                            '#pname a',                                                       # Google Code style
+                            'h1.headermain',
+                            'h1.title',
+                            '.mxb h1',                                                        # BBC News
+                            '#content h1',
+                            '#content h2',
+                            '#content h3',
+                            'a[@rel="bookmark"]',
+                            '.products h2'
+                          ],
+                          all
                         )
       # If all else fails, go to the HTML title
-      unless title
-        title = @doc.match('title')
-        return unless title
-        # Strip off any leading or trailing site names - a scrappy way to try it out..
-        title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.strip
+      if all
+        return [html_title] if !title
+        return ([*title] + [html_title]).uniq
+      else
+        return html_title if !title
+        return title
       end
-      title
+    end
+    def titles
+      title(true)
+    end
+    # HTML title
+    def html_title
+      title = @doc.match('title')
+      return unless title
+      # Strip off any leading or trailing site names - a scrappy way to try it out..
+      title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
     end
     # Return an estimate of when the page/content was created
@@ -43,7 +71,10 @@ module Pismo
       regexen = [
         /#{mo}\b\s+\d+\D{1,10}\d{4}/i,
         /(on\s+)?\d+\s+#{mo}\s+\D{1,10}\d+/i,
-        /(on[^\d+]{1,10})?\d+(th|st|rd)?.{1,10}#{mo}\b[^\d]{1,10}\d+/i,
+        /(on[^\d+]{1,10})\d+(th|st|rd)?.{1,10}#{mo}\b[^\d]{1,10}\d+/i,
+        /\b\d{4}\-\d{2}\-\d{2}\b/i,
+        /\d+(th|st|rd).{1,10}#{mo}\b[^\d]{1,10}\d+/i,
+        /\d+\s+#{mo}\b[^\d]{1,10}\d+/i,
         /on\s+#{mo}\s+\d+/i,
         /#{mo}\s+\d+/i,
         /\d{4}[\.\/\-]\d{2}[\.\/\-]\d{2}/,
@@ -54,7 +85,7 @@ module Pismo
       regexen.each do |r|
         datetime = @doc.to_html[r]
-        p datetime
+        # p datetime
         break if datetime
       end
@@ -76,10 +107,13 @@ module Pismo
     # end
     # Returns the author of the page/content
-    def author
-      author = @doc.match('.post-author .fn',
+    def author(all = false)
+      author = @doc.match([
+                          '.post-author .fn',
                           '.wire_author',
                           '.cnnByline b',
+                          '.editorlink',
+                          '.authors p',
                           ['meta[@name="author"]', lambda { |el| el.attr('content') }],     # Traditional meta tag style
                           ['meta[@name="AUTHOR"]', lambda { |el| el.attr('content') }],     # CNN style
                           '.byline a',                                                      # Ruby Inside style
@@ -94,31 +128,48 @@ module Pismo
                           '.auth',
                           '.cT-storyDetails h5',                                            # smh.com.au - worth dropping maybe..
                           ['meta[@name="byl"]', lambda { |el| el.attr('content') }],
+                          '.timestamp a',
                           '.fn a',
                           '.fn',
-                          '.byline-author'
-                          )
+                          '.byline-author',
+                          '.ArticleAuthor a',
+                          '.blog_meta a',
+                          'cite a',
+                          'cite',
+                          '.contributor_details h4 a'
+                          ], all)
       return unless author
       # Strip off any "By [whoever]" section
-      author.sub!(/^(post(ed)?\s)?by\W+/i, '')
+      if String === author
+        author.sub!(/^(post(ed)?\s)?by\W+/i, '')
+      elsif Array === author
+        author.map! { |a| a.sub(/^(post(ed)?\s)?by\W+/i, '') }.uniq!
+      end
       author
     end
+    def authors
+      author(true)
+    end
     # Returns the "description" of the page, usually comes from a meta tag
     def description
-      @doc.match(
+      @doc.match([
                   ['meta[@name="description"]', lambda { |el| el.attr('content') }],
                   ['meta[@name="Description"]', lambda { |el| el.attr('content') }],
+                  'rdf:Description[@name="dc:description"]',
                   '.description'
-       )
+       ])
     end
-    # Returns the "lede" or first paragraph of the story/page
-    def lede
-      lede = @doc.match(
+    # Returns the "lede(s)" or first paragraph(s) of the story/page
+    def lede(all = false)
+      lede = @doc.match([
+                  '.post-text p',
                   '#blogpost p',
                   '.subhead',
                   '//div[@class="entrytext"]//p[string-length()>10]',                      # Ruby Inside / Kubrick style
@@ -136,10 +187,24 @@ module Pismo
                   '#content p',
                   '#article p',
                   '.post-body',
-                  '.entry-content'
-                  )
-      lede[/^(.*?\.\s){2}/m] || lede
+                  '.entry-content',
+                  '.body p',
+                  '.document_description_short p',    # Scribd
+                  '.single-post p',
+                  'p'
+                  ], all)
+      if lede && String === lede
+        return lede[/^(.*?\.\s){2}/m] || lede
+      elsif lede && Array === lede
+        return lede.map { |l| l.to_s[/^(.*?\.\s){2}/m] || l }.uniq
+      else
+        return body ? body[/^(.*?\.\s){2}/m] : nil
+      end
+    end
+    def ledes
+      lede(true)
     end
     # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
@@ -150,7 +215,9 @@ module Pismo
       # Convert doc to lowercase, scrub out most HTML tags, then keep track of words
       cached_title = title
-      body.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\'\+\#\.]*\b/).each do |word|
+      content_to_use = body.to_s.downcase + description.to_s.downcase
+      content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub('. ', ' ').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\+\.\'\+\#\-]*\b/).each do |word|
         next if word.length > options[:word_length_limit]
         word.gsub!(/\'\w+/, '')
         words[word] ||= 0
@@ -178,9 +245,9 @@ module Pismo
     # Returns URL to the site's favicon
     def favicon
-      url = @doc.match( ['link[@rel="fluid-icon"]', lambda { |el| el.attr('href') }],      # Get a Fluid icon if possible..
+      url = @doc.match([['link[@rel="fluid-icon"]', lambda { |el| el.attr('href') }],      # Get a Fluid icon if possible..
                         ['link[@rel="shortcut icon"]', lambda { |el| el.attr('href') }],
-                        ['link[@rel="icon"]', lambda { |el| el.attr('href') }])
+                        ['link[@rel="icon"]', lambda { |el| el.attr('href') }]])
       if url && url !~ /^http/ && @url
         url = URI.join(@url , url).to_s
       end
@@ -188,17 +255,30 @@ module Pismo
       url
     end
-    # Returns URL of Web feed
-    def feed
-      url = @doc.match( ['link[@type="application/rss+xml"]', lambda { |el| el.attr('href') }],
-                        ['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]
+    # Returns URL(s) of Web feed(s)
+    def feed(all = false)
+      url = @doc.match([['link[@type="application/rss+xml"]', lambda { |el| el.attr('href') }],
+                        ['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]], all
       )
-      if url && url !~ /^http/ && @url
+      if url && String === url && url !~ /^http/ && @url
         url = URI.join(@url , url).to_s
+      elsif url && Array === url
+        url.map! do |u|
+          if u !~ /^http/ && @url
+            URI.join(@url, u).to_s
+          else
+            u
+          end
+        end
+        url.uniq!
       end
       url
     end
+    def feeds
+      feed(true)
+    end
   end
 end

data/lib/pismo/readability.rb CHANGED Viewed

@@ -9,6 +9,8 @@
 #   http://lab.arc90.com/experiments/readability/js/readability.js
 #   * Copyright (c) 2009 Arc90 Inc
 #   * Readability is licensed under the Apache License, Version 2.0.
+#
+# Minor edits and tweaks by Peter Cooper
 require 'nokogiri'
@@ -70,6 +72,9 @@ module Readability
       sibling_score_threshold = [10, best_candidate[:content_score] * 0.2].max
       output = Nokogiri::XML::Node.new('div', @html)
+      return output unless best_candidate[:elem]
       best_candidate[:elem].parent.children.each do |sibling|
         append = false
         append = true if sibling == best_candidate[:elem]
@@ -105,7 +110,7 @@ module Readability
       end
       best_candidate = sorted_candidates.first || { :elem => @html.css("body").first, :content_score => 0 }
-      debug("Best candidate #{best_candidate[:elem].name}##{best_candidate[:elem][:id]}.#{best_candidate[:elem][:class]} with score #{best_candidate[:content_score]}")
+      #debug("Best candidate #{best_candidate[:elem].name}##{best_candidate[:elem][:id]}.#{best_candidate[:elem][:class]} with score #{best_candidate[:content_score]}")
       best_candidate
     end