RubyGems - pismo - Versions diffs - 0.2.0 - Mend

pismo 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

data/.document +5 -0
data/.gitignore +21 -0
data/LICENSE +32 -0
data/README.rdoc +68 -0
data/Rakefile +95 -0
data/VERSION +1 -0
data/bin/pismo +36 -0
data/lib/pismo/document.rb +50 -0
data/lib/pismo/external_attributes.rb +14 -0
data/lib/pismo/internal_attributes.rb +202 -0
data/lib/pismo/readability.rb +316 -0
data/lib/pismo/stopwords.txt +893 -0
data/lib/pismo.rb +44 -0
data/pismo.gemspec +92 -0
data/test/corpus/bbcnews.html +2131 -0
data/test/corpus/briancray.html +269 -0
data/test/corpus/cant_read.html +426 -0
data/test/corpus/factor.html +1362 -0
data/test/corpus/huffington.html +2932 -0
data/test/corpus/metadata_expected.yaml +81 -0
data/test/corpus/rubyinside.html +318 -0
data/test/corpus/rww.html +1351 -0
data/test/corpus/spolsky.html +298 -0
data/test/corpus/techcrunch.html +1285 -0
data/test/corpus/youtube.html +2348 -0
data/test/helper.rb +15 -0
data/test/test_corpus.rb +33 -0
data/test/test_pismo_document.rb +34 -0
data/test/test_readability.rb +152 -0
metadata +146 -0

data/.document ADDED Viewed

@@ -0,0 +1,5 @@
+README.rdoc
+lib/**/*.rb
+bin/*
+features/**/*.feature
+LICENSE

data/.gitignore ADDED Viewed

@@ -0,0 +1,21 @@
+## MAC OS
+.DS_Store
+## TEXTMATE
+*.tmproj
+tmtags
+## EMACS
+*~
+\#*
+.\#*
+## VIM
+*.swp
+## PROJECT::GENERAL
+coverage
+rdoc
+pkg
+## PROJECT::SPECIFIC

data/LICENSE ADDED Viewed

@@ -0,0 +1,32 @@
+All EXCEPT the lib/pismo/readability.rb file:
+  Copyright 2009, 2010 Peter Cooper
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+For lib/pismo/readability.rb:
+  Copyright 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,68 @@
+= pismo (Web page content analyzer and metadata extractor)
+* http://github.com/peterc/pismo
+== STATUS:
+pismo is a VERY NEW project developed for use on http://coder.io/ - my forthcoming developer news aggregator. pismo is FAR FROM COMPLETE. If you're brave, you can have a PLAY with it as the examples below and those in the test suite/corpus do work - all tests pass.
+The prime missing features so far are the "external attributes" - where calls are made to external services like Delicious, Yahoo, Bing, etc, for getting third party data about documents. The structures are there but I'm still deciding how best to integrate these ideas.
+== DESCRIPTION:
+Pismo extracts metadata and machine-usable data from otherwise unstructured
+HTML documents, including titles, body text, graphics, date, and keywords.
+For example, if you have a blog post HTML file, Pismo should, in theory, be
+able to extract the title, the actual "content", images relating to the
+content, look up Delicious tags, and analyze for keywords.
+Pismo only understands English. Je suis desolé.
+== SYNOPSIS:
+* Basic demo:
+    require 'open-uri'
+    require 'pismo'
+    doc = Pismo::Document.new(open('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html'))
+    doc.title     # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
+    doc.author    # => "Peter Cooper"
+    doc.lede      # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
+    doc.keywords  # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
+== COMMAND LINE TOOL:
+A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
+great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
+* Usage:
+    ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
+* Output:
+    ---
+    :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
+    :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
+    :lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
+    :author: Peter Cooper
+    :datetime: 2010-01-07 12:00:00 +00:00
+== Note on Patches/Pull Requests
+* Fork the project.
+* Make your feature addition or bug fix.
+* Add tests for it. This is important so I don't break it in a future version unintentionally.
+* Commit, do not mess with Rakefile, version, or history.
+* Send me a pull request. I may or may not accept it.
+== COPYRIGHT AND LICENSE
+Apache 2.0 License - See LICENSE for details.
+All except lib/pismo/readability.rb is Copyright (c) 2009, 2010 Peter Cooper
+lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
+The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability

data/Rakefile ADDED Viewed

@@ -0,0 +1,95 @@
+require 'rubygems'
+require 'rake'
+begin
+  require 'jeweler'
+  Jeweler::Tasks.new do |gem|
+    gem.name = "pismo"
+    gem.summary = %Q{Extracts or retrieves content-related metadata from HTML pages}
+    gem.description = %Q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, del.icio.us tags, first image used in the content block, etc.}
+    gem.email = "git@peterc.org"
+    gem.homepage = "http://github.com/peterc/pismo"
+    gem.authors = ["Peter Cooper"]
+    gem.add_development_dependency "shoulda", ">= 0"
+    gem.add_dependency "nokogiri"
+    gem.add_dependency "loofah"
+    gem.add_dependency "httparty"
+    gem.add_dependency "fast-stemmer"
+    gem.add_dependency "chronic"
+  end
+  Jeweler::GemcutterTasks.new
+rescue LoadError
+  puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
+end
+require 'rake/testtask'
+Rake::TestTask.new(:test) do |test|
+  test.libs << 'lib' << 'test'
+  test.pattern = 'test/**/test_*.rb'
+  test.verbose = true
+end
+begin
+  require 'rcov/rcovtask'
+  Rcov::RcovTask.new do |test|
+    test.libs << 'test'
+    test.pattern = 'test/**/test_*.rb'
+    test.verbose = true
+  end
+rescue LoadError
+  task :rcov do
+    abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
+  end
+end
+task :test => :check_dependencies
+task :default => :test
+require 'rake/rdoctask'
+Rake::RDocTask.new do |rdoc|
+  version = File.exist?('VERSION') ? File.read('VERSION') : ""
+  rdoc.rdoc_dir = 'rdoc'
+  rdoc.title = "pismo #{version}"
+  rdoc.rdoc_files.include('README*')
+  rdoc.rdoc_files.include('lib/**/*.rb')
+end
+desc 'Automatically run something when code is changed'
+task :on_update do
+  require 'find'
+  files = {}
+  loop do
+    changed = false
+    Find.find(File.dirname(__FILE__)) do |file|
+      next unless file =~ /\.rb$/
+      ctime = File.ctime(file).to_i
+      if ctime != files[file]
+        files[file] = ctime
+        changed = true
+      end
+    end
+    if changed
+      system ARGV[1] || 'rake'
+      puts "\n" + Time.now.to_s
+    end
+    sleep 4
+  end
+end
+desc 'Console mode'
+task :console do
+  require 'irb'
+  require 'lib/pismo'
+  require 'open-uri'
+  @d = Pismo.document(ARGV[1] || open('./test/corpus/bbcnews.html'))
+  # Get around IRB's issues with ARGV..
+  ARGV = []
+  IRB.start
+end

data/VERSION ADDED Viewed

	@@ -0,0 +1 @@
1	+ 0.2.0

data/bin/pismo ADDED Viewed

@@ -0,0 +1,36 @@
+#!/usr/bin/env ruby
+# pismo
+#
+# get metadata about a page from the command line
+#
+# Usage:
+#     ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title description author
+# Output:
+#     ---
+#     :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
+#     :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
+#     :description: The ideal book for beginners or developers merely new to Ruby. Goes from installation to OOP, webapps, SQL, and GUI apps.
+#     :author: Peter Cooper
+require 'yaml'
+require 'rubygems'
+$:.unshift(File.dirname(__FILE__) + "/../lib")
+require 'pismo'
+url = ARGV.shift
+unless url =~ /^http/
+  url = File.read(url)
+end
+doc = Pismo.document(url)
+output = { :url => doc.url }
+(ARGV.empty? ? Pismo::Document::ATTRIBUTE_METHODS : ARGV).each do |cmd|
+  output[cmd.to_sym] = doc.send(cmd) rescue nil
+end
+puts output.to_yaml

data/lib/pismo/document.rb ADDED Viewed

@@ -0,0 +1,50 @@
+require 'pismo/internal_attributes'
+require 'pismo/external_attributes'
+module Pismo
+  # Pismo::Document represents a single HTML document within Pismo
+  class Document
+    attr_reader :doc, :url
+    ATTRIBUTE_METHODS = InternalAttributes.instance_methods + ExternalAttributes.instance_methods
+    include Pismo::InternalAttributes
+    include Pismo::ExternalAttributes
+    def initialize(handle, url = nil)
+      load(handle, url)
+    end
+    # An HTML representation of the document
+    def html
+      @doc.to_s
+    end
+    def load(handle, url = nil)
+      @url = url if url
+      @url = handle if handle =~ /^http/
+      @html = if handle =~ /^http/
+                open(handle).read
+              elsif handle.is_a?(StringIO) || handle.is_a?(IO)
+                handle.read
+              else
+                handle
+              end
+      @html = clean_html(@html)
+      @doc = Nokogiri::HTML(@html)
+    end
+    def clean_html(html)
+      html.gsub!('&#8217;', '\'')
+      html.gsub!('&#8221;', '"')
+      html.gsub!('&#8211;', '-')
+      html.gsub!('&#8220;', '"')
+      html.gsub!('&nbsp;', ' ')
+      html
+    end
+  end
+end

data/lib/pismo/external_attributes.rb ADDED Viewed

@@ -0,0 +1,14 @@
+module Pismo
+  # External attributes return data that comes from external services or programs (e.g. Delicious tags)
+  module ExternalAttributes
+    #include HTTParty
+    #
+    #def delicious_tags
+    #  delicious_info["top_tags"].sort_by { |k, v| v }.reverse.first(5) rescue []
+    #end
+    #
+    #def delicious_info
+    #  @delicious_info ||= self.class.get('http://feeds.delicious.com/v2/json/urlinfo/' + Digest::MD5.hexdigest(@url)).first rescue nil
+    #end
+  end
+end

data/lib/pismo/internal_attributes.rb ADDED Viewed

@@ -0,0 +1,202 @@
+module Pismo
+  # Internal attributes are different pieces of data we can extract from a document's content
+  module InternalAttributes
+    # Returns the title of the page/content - attempts to strip site name, etc, if possible
+    def title
+      title = @doc.match( 'h2.title',
+                          '.entry h2',                                                      # Common style
+                          '.entryheader h1',                                                # Ruby Inside/Kubrick
+                          '.entry-title a',                                               # Common Blogger/Blogspot rules
+                          '.post-title a',
+                          '.posttitle a',
+                          '.entry-title',
+                          '.post-title',
+                          '.posttitle',
+                          ['meta[@name="title"]', lambda { |el| el.attr('content') }],
+                          '#pname a',                                                       # Google Code style
+                          'h1.headermain',
+                          'h1.title',
+                          '.mxb h1'                                                         # BBC News
+                        )
+      # If all else fails, go to the HTML title
+      unless title
+        title = @doc.match('title')
+        return unless title
+        # Strip off any leading or trailing site names - a scrappy way to try it out..
+        title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.strip
+      end
+      title
+    end
+    # Return an estimate of when the page/content was created
+    # As clients of this library should be doing HTTP retrieval themselves, they can fall to the
+    # Last-Updated HTTP header if they so wish. This method is just rough and based on content only.
+    def datetime
+      # TODO: Clean all this mess up
+      mo = %r{(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)}i
+      regexen = [
+        /#{mo}\b\s+\d+\D{1,10}\d{4}/i,
+        /(on\s+)?\d+\s+#{mo}\s+\D{1,10}\d+/i,
+        /(on[^\d+]{1,10})?\d+(th|st|rd)?.{1,10}#{mo}\b[^\d]{1,10}\d+/i,
+        /on\s+#{mo}\s+\d+/i,
+        /#{mo}\s+\d+/i,
+        /\d{4}[\.\/\-]\d{2}[\.\/\-]\d{2}/,
+        /\d{2}[\.\/\-]\d{2}[\.\/\-]\d{4}/
+      ]
+      datetime = 10
+      regexen.each do |r|
+        datetime = @doc.to_html[r]
+        p datetime
+        break if datetime
+      end
+      return unless datetime && datetime.length > 4
+      # Clean up the string for use by Chronic
+      datetime.strip!
+      datetime.gsub!(/(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)[^\w]*/i, '')
+      datetime.gsub!(/(mon|tues|tue|weds|wed|thurs|thur|thu|fri|sat|sun)[^\w]*/i, '')
+      datetime.sub!(/on\s+/, '')
+      datetime.gsub!(/\,/, '')
+      datetime.sub!(/(\d+)(th|st|rd)/, '\1')
+      Chronic.parse(datetime) || datetime
+    end
+    # TODO: Attempts to work out what type of site or page the page is from the provided URL
+    # def site_type
+    # end
+    # Returns the author of the page/content
+    def author
+      author = @doc.match('.post-author .fn',
+                          '.wire_author',
+                          '.cnnByline b',
+                          ['meta[@name="author"]', lambda { |el| el.attr('content') }],     # Traditional meta tag style
+                          ['meta[@name="AUTHOR"]', lambda { |el| el.attr('content') }],     # CNN style
+                          '.byline a',                                                      # Ruby Inside style
+                          '.post_subheader_left a',                                         # TechCrunch style
+                          '.byl',                                                           # BBC News style
+                          '.meta a',
+                          '.articledata .author a',
+                          '#owners a',                                                      # Google Code style
+                          '.author a',
+                          '.author',
+                          '.auth a',
+                          '.auth',
+                          '.cT-storyDetails h5',                                            # smh.com.au - worth dropping maybe..
+                          ['meta[@name="byl"]', lambda { |el| el.attr('content') }],
+                          '.fn a',
+                          '.fn',
+                          '.byline-author'
+                          )
+      return unless author
+      # Strip off any "By [whoever]" section
+      author.sub!(/^(post(ed)?\s)?by\W+/i, '')
+      author
+    end
+    # Returns the "description" of the page, usually comes from a meta tag
+    def description
+      @doc.match(
+                  ['meta[@name="description"]', lambda { |el| el.attr('content') }],
+                  ['meta[@name="Description"]', lambda { |el| el.attr('content') }],
+                  '.description'
+       )
+    end
+    # Returns the "lede" or first paragraph of the story/page
+    def lede
+      lede = @doc.match(
+                  '#blogpost p',
+                  '.subhead',
+                  '//div[@class="entrytext"]//p[string-length()>10]',                      # Ruby Inside / Kubrick style
+                  'section p',
+                  '.entry .text p',
+                  '.entry-content p',
+                  '#wikicontent p',                                                        # Google Code style
+                  '//td[@class="storybody"]/p[string-length()>10]',                        # BBC News style
+                  '//div[@class="entry"]//p[string-length()>100]',
+                  # The below is a horrible, horrible way to pluck out lead paras from crappy Blogspot blogs that
+                  # don't use <p> tags..
+                  ['.entry-content', lambda { |el| el.inner_html[/(#{el.inner_text[0..4].strip}.*?)\<br/, 1] }],
+                  ['.entry', lambda { |el| el.inner_html[/(#{el.inner_text[0..4].strip}.*?)\<br/, 1] }],
+                  '.entry',
+                  '#content p',
+                  '#article p',
+                  '.post-body',
+                  '.entry-content'
+                  )
+      lede[/^(.*?\.\s){2}/m] || lede
+    end
+    # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
+    def keywords(options = {})
+      options = { :stem_at => 10, :word_length_limit => 15, :limit => 20 }.merge(options)
+      words = {}
+      # Convert doc to lowercase, scrub out most HTML tags
+      body.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\'\#\.]*\b/).each do |word|
+        next if word.length > options[:word_length_limit]
+        word.gsub!(/\'\w+/, '')
+        words[word] ||= 0
+        words[word] += 1
+      end
+      # Stem the words and stop words if necessary
+      d = words.keys.uniq.map { |a| a.length > options[:stem_at] ? a.stem : a }
+      s = File.read(File.dirname(__FILE__) + '/stopwords.txt').split.map { |a| a.length > options[:stem_at] ? a.stem : a }
+      w = words.delete_if { |k1, v1| s.include?(k1) || (v1 < 2 && words.size > 80) }.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
+      return w
+    end
+    # Returns body text as determined by Arc90's Readability algorithm
+    def body
+      @body ||= Readability::Document.new(@doc.to_s).content.strip
+      # HACK: Remove annoying DIV that readability leaves around
+      @body.gsub!(/\A\<div\>/, '')
+      @body.gsub!(/\<\/div\>\Z/, '')
+      return @body
+    end
+    # Returns URL to the site's favicon
+    def favicon
+      url = @doc.match( ['link[@rel="fluid-icon"]', lambda { |el| el.attr('href') }],      # Get a Fluid icon if possible..
+                        ['link[@rel="shortcut icon"]', lambda { |el| el.attr('href') }],
+                        ['link[@rel="icon"]', lambda { |el| el.attr('href') }])
+      if url && url !~ /^http/ && @url
+        url = URI.join(@url , url).to_s
+      end
+      url
+    end
+    # Returns URL of Web feed
+    def feed
+      url = @doc.match( ['link[@type="application/rss+xml"]', lambda { |el| el.attr('href') }],
+                        ['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]
+      )
+      if url && url !~ /^http/ && @url
+        url = URI.join(@url , url).to_s
+      end
+      url
+    end
+  end
+end