RubyGems - pismo - Versions diffs - 0.4.0 → 0.5.0 - Mend

pismo 0.4.0 → 0.5.0

Files changed (10) hide show

data/{README.rdoc → README.markdown} +36 -21
data/VERSION +1 -1
data/lib/pismo/document.rb +4 -0
data/lib/pismo/internal_attributes.rb +30 -9
data/lib/pismo/readability.rb +30 -9
data/lib/pismo/stopwords.txt +2 -0
data/lib/pismo.rb +21 -2
data/pismo.gemspec +4 -4
data/test/corpus/metadata_expected.yaml +2 -1
metadata +4 -4

data/{README.rdoc → README.markdown} RENAMED Viewed

@@ -1,18 +1,15 @@
-= pismo (Web page content analyzer and metadata extractor)
+# pismo - Web page content analysis and metadata extraction
+http://github.com/peterc/pismo
-* http://github.com/peterc/pismo
-== DESCRIPTION:
+## DESCRIPTION:
 Pismo extracts metadata and machine-usable data from mostly unstructured (or poorly structured)
-HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
+English-language HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
 For example, if you have a blog post HTML file, Pismo, in theory, should
 extract the title, the actual "content", and analyze for keywords, among other things.
-Pismo only understands (and much prefers) English. Je suis desolé.
-== EXAMPLES:
+## EXAMPLES:
     require 'pismo'
@@ -23,30 +20,48 @@ Pismo only understands (and much prefers) English. Je suis desolé.
     doc.author    # => "Peter Cooper"
     doc.lede      # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
     doc.keywords  # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
+## STATUS:
-== NEW IN 0.4.0:
+Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
-  Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
+Planned/forthcoming features include the fetching of "external" data like tags from Delicious, content analysis through 3rd party services, and extraction of graphics from the main article text (for thumbnailing, say).
-    doc.titles    # => [..., ..., ...]
-    doc.ledes    # => [..., ..., ...]
-    doc.authors    # => [..., ..., ...]
-    doc.feeds    # => [..., ..., ...]
+## NEW IN 0.5.0:
+### Stopword access
+You can now access Pismo's stopword list directly:
+    Pismo.stopwords    # => [.., .., ..]
-== STATUS:
+### Convenience access method for IRB/debugging use
+Now you can get playing with Pismo faster. This is primarily useful for debugging/playing in IRB as it just uses open-uri and the Pismo document is cached in the class against the URL:
+    url = "http://www.rubyinside.com/the-why-what-and-how-of-rubinius-1-0-s-release-3261.html"
+    Pismo[url].title   # => "The Why, What, and How of Rubinius 1.0's Release"
+    Pismo[url].author  # => "Peter Cooper"
-Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
+### Arrays of all matches for titles, ledes, authors, and feeds
-== COMMAND LINE TOOL:
+Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
+    doc.titles    # => [..., ..., ...]
+    doc.ledes     # => [..., ..., ...]
+    doc.authors   # => [..., ..., ...]
+    doc.feeds     # => [..., ..., ...]
+## COMMAND LINE TOOL:
 A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
 great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
-* Usage:
+### Usage:
     ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
-* Output:
+### Output:
     ---
     :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
@@ -55,7 +70,7 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
     :author: Peter Cooper
     :datetime: 2010-01-07 12:00:00 +00:00
-== Note on Patches/Pull Requests
+## Note on Patches/Pull Requests
 * Fork the project.
 * Make your feature addition or bug fix.
@@ -63,7 +78,7 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
 * Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
 * Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
-== COPYRIGHT AND LICENSE
+## COPYRIGHT AND LICENSE
 Apache 2.0 License - See LICENSE for details.

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.4.0
1	+ 0.5.0

data/lib/pismo/document.rb CHANGED Viewed

@@ -38,6 +38,10 @@ module Pismo
       @doc = Nokogiri::HTML(@html)
     end
+    def match(args = [], all = false)
+      @doc.match([*args], all)
+    end
     def clean_html(html)
       html.gsub!('&#8217;', '\'')
       html.gsub!('&#8221;', '"')

data/lib/pismo/internal_attributes.rb CHANGED Viewed

@@ -6,6 +6,7 @@ module Pismo
       # TODO: Memoizations
       title = @doc.match(
                           [
+                            '#pname a',                                                       # Google Code style
                             '.entryheader h1',                                                # Ruby Inside/Kubrick
                             '.entry-title a',                                               # Common Blogger/Blogspot rules
                             '.post-title a',
@@ -14,16 +15,18 @@ module Pismo
                             '.post-header h1',
                             '.entry-title',
                             '.post-title',
+                            '.post h3 a',
+                            'a.datitle',          # Slashdot style
                             '.posttitle',
                             '.post_title',
                             '.pageTitle',
+                            '#main h1.title',
                             '.title h1',
                             '.post h2',
                             'h2.title',
                             '.entry h2',                                                      # Common style
                             '.boite_titre a',
                             ['meta[@name="title"]', lambda { |el| el.attr('content') }],
-                            '#pname a',                                                       # Google Code style
                             'h1.headermain',
                             'h1.title',
                             '.mxb h1',                                                        # BBC News
@@ -31,7 +34,14 @@ module Pismo
                             '#content h2',
                             '#content h3',
                             'a[@rel="bookmark"]',
-                            '.products h2'
+                            '.products h2',
+                            '.caption h3',
+                            '#main h2',
+                            '#body h1',
+                            '#wrapper h1',
+                            '#page h1',
+                            '.asset-header h1',
+                            '#body_content h2'
                           ],
                           all
                         )
@@ -55,9 +65,9 @@ module Pismo
     def html_title
       title = @doc.match('title')
       return unless title
+      title
       # Strip off any leading or trailing site names - a scrappy way to try it out..
-      title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
+      #title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
     end
     # Return an estimate of when the page/content was created
@@ -115,8 +125,10 @@ module Pismo
                           '.editorlink',
                           '.authors p',
                           ['meta[@name="author"]', lambda { |el| el.attr('content') }],     # Traditional meta tag style
+                          ['meta[@name="Author"]', lambda { |el| el.attr('content') }],     # CNN style
                           ['meta[@name="AUTHOR"]', lambda { |el| el.attr('content') }],     # CNN style
                           '.byline a',                                                      # Ruby Inside style
+                          '.byline',
                           '.post_subheader_left a',                                         # TechCrunch style
                           '.byl',                                                           # BBC News style
                           '.meta a',
@@ -144,6 +156,11 @@ module Pismo
       # Strip off any "By [whoever]" section
       if String === author
         author.sub!(/^(post(ed)?\s)?by\W+/i, '')
+        author.tr!('^a-zA-Z 0-9\'', '|')
+        author = author.split(/\|{2,}/).first.to_s
+        author.gsub!(/\s+/, ' ')
+        author.gsub!(/\|/, '')
+        author.strip!
       elsif Array === author
         author.map! { |a| a.sub(/^(post(ed)?\s)?by\W+/i, '') }.uniq!
       end
@@ -161,6 +178,7 @@ module Pismo
       @doc.match([
                   ['meta[@name="description"]', lambda { |el| el.attr('content') }],
                   ['meta[@name="Description"]', lambda { |el| el.attr('content') }],
+                  ['meta[@name="DESCRIPTION"]', lambda { |el| el.attr('content') }],
                   'rdf:Description[@name="dc:description"]',
                   '.description'
        ])
@@ -171,6 +189,7 @@ module Pismo
       lede = @doc.match([
                   '.post-text p',
                   '#blogpost p',
+                  '.story-teaser',
                   '.subhead',
                   '//div[@class="entrytext"]//p[string-length()>10]',                      # Ruby Inside / Kubrick style
                   'section p',
@@ -209,24 +228,26 @@ module Pismo
     # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
     def keywords(options = {})
-      options = { :stem_at => 10, :word_length_limit => 15, :limit => 20 }.merge(options)
+      options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
       words = {}
       # Convert doc to lowercase, scrub out most HTML tags, then keep track of words
       cached_title = title
       content_to_use = body.to_s.downcase + description.to_s.downcase
-      content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub('. ', ' ').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\+\.\'\+\#\-]*\b/).each do |word|
+      # old regex for safe keeping -- \b[a-z][a-z\+\.\'\+\#\-]*\b
+      content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\/\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.each do |word|
         next if word.length > options[:word_length_limit]
         word.gsub!(/\'\w+/, '')
         words[word] ||= 0
-        words[word] += (cached_title =~ /#{word}/i ? 5 : 1)
+        words[word] += (cached_title.downcase.include?(word) ? 5 : 1)
       end
       # Stem the words and stop words if necessary
       d = words.keys.uniq.map { |a| a.length > options[:stem_at] ? a.stem : a }
-      s = File.read(File.dirname(__FILE__) + '/stopwords.txt').split.map { |a| a.length > options[:stem_at] ? a.stem : a }
+      s = Pismo.stopwords.map { |a| a.length > options[:stem_at] ? a.stem : a }
       w = words.delete_if { |k1, v1| s.include?(k1) || (v1 < 2 && words.size > 80) }.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
       return w

data/lib/pismo/readability.rb CHANGED Viewed

@@ -14,6 +14,8 @@
 require 'nokogiri'
+IS_RUBY19 = "a".respond_to?(:encoding)
 module Readability
   class Document
     TEXT_LENGTH_THRESHOLD = 25
@@ -28,14 +30,14 @@ module Readability
     end
     def make_html
-      @html = Nokogiri::HTML(@input, nil, 'UTF-8')
+      @html = Nokogiri::HTML(@input) #, nil, 'UTF-8')
     end
     REGEXES = {
         :unlikelyCandidatesRe => /combx|comment|disqus|foot|header|menu|meta|nav|rss|shoutbox|sidebar|sponsor/i,
         :okMaybeItsACandidateRe => /and|article|body|column|main/i,
-        :positiveRe => /article|body|content|entry|hentry|page|pagination|post|text/i,
-        :negativeRe => /combx|comment|contact|foot|footer|footnote|link|media|meta|promo|related|scroll|shoutbox|sponsor|tags/i,
+        :positiveRe => /article|body|content|entry|hentry|page|pagination|post|story|text/i,
+        :negativeRe => /combx|comment|contact|foot|box_wrap|footer|footnote|link|media|meta|promo|related|scroll|shoutbox|sponsor|tags/i,
         :divToPElementsRe => /<(a|blockquote|dl|div|img|ol|p|pre|table|ul)/i,
         :replaceBrsRe => /(<br[^>]*>[ \n\r\t]*){2,}/i,
         :replaceFontsRe => /<(\/?)font[^>]*>/i,
@@ -135,8 +137,16 @@ module Readability
         candidates[grand_parent_node] ||= score_node(grand_parent_node) if grand_parent_node
         content_score = 1
-        content_score += inner_text.split(',').length
-        content_score += [(inner_text.length / 100).to_i, 3].min
+        begin
+          content_score += inner_text.split(',').length
+          content_score += [(inner_text.length / 100).to_i, 3].min
+        rescue => e
+          raise e unless IS_RUBY19
+          inner_text.force_encoding('ASCII-8BIT')
+          content_score += inner_text.split(',').length
+          content_score += [(inner_text.length / 100).to_i, 3].min
+        end
         candidates[parent_node][:content_score] += content_score
         candidates[grand_parent_node][:content_score] += content_score / 2.0 if grand_parent_node
@@ -209,7 +219,8 @@ module Readability
       @html.css("*").each do |elem|
         if elem.name.downcase == "div"
           # transform <div>s that do not contain other block elements into <p>s
-          if elem.inner_html !~ REGEXES[:divToPElementsRe]
+          elem_inner_html = IS_RUBY19 ? elem.inner_html.dup.force_encoding('ASCII-8BIT') : elem.inner_html
+          if elem_inner_html !~ REGEXES[:divToPElementsRe]
             debug("Altering div(##{elem[:id]}.#{elem[:class]}) to p");
             elem.name = "p"
           end
@@ -255,7 +266,7 @@ module Readability
         if weight + content_score < 0
           el.remove
           debug("Conditionally cleaned #{name}##{el[:id]}.#{el[:class]} with weight #{weight} and content score #{content_score} because score + content score was less than zero.")
-        elsif el.text.count(",") < 10
+        elsif (IS_RUBY19 && el.text.force_encoding("ASCII-8BIT").count(",") < 10) || (!IS_RUBY19 && el.text.count(",") < 10)
           counts = %w[p img li a embed input].inject({}) { |m, kind| m[kind] = el.css(kind).length; m }
           counts["li"] -= 100
@@ -308,13 +319,23 @@ module Readability
           # Otherwise, replace the element with its contents
         else
-          el.swap(el.text)
+          begin
+            el.swap(el.text)
+          rescue => e
+            raise e unless IS_RUBY19
+            el.swap(el.text.force_encoding("ASCII-8BIT"))
+          end
         end
       end
       # Get rid of duplicate whitespace
-      node.to_html.gsub(/[\r\n\f]+/, "\n" ).gsub(/[\t ]+/, " ").gsub(/&nbsp;/, " ")
+      begin
+        node.to_html.gsub(/[\r\n\f]+/, "\n" ).gsub(/[\t ]+/, " ").gsub(/&nbsp;/, " ")
+      rescue => e
+        raise e unless IS_RUBY19
+        node.to_html.force_encoding("ASCII-8BIT").gsub(/[\r\n\f]+/, "\n" ).gsub(/[\t ]+/, " ").gsub(/&nbsp;/, " ")
+      end
     end
   end

data/lib/pismo/stopwords.txt CHANGED Viewed

@@ -1016,7 +1016,9 @@ your
 yours
 yourself
 yourselves
+generally
 z
 Zachary
 zero
 Zoe
+congratulations

data/lib/pismo.rb CHANGED Viewed

@@ -11,11 +11,24 @@ require 'pismo/document'
 require 'pismo/readability'
 module Pismo
-  # Sugar method to make creating document objects nicer
+  # Sugar methods to make creating document objects nicer
   def self.document(handle, url = nil)
     Document.new(handle, url)
   end
+  # Load a URL, as with Pismo['http://www.rubyinside.com'], and caches the Pismo document
+  # (mostly useful for debugging use)
+  def self.[](url)
+    @docs ||= {}
+    @docs[url] ||= Pismo::Document.new(open(url))
+  end
+  # Return stopword list
+  def self.stopwords
+    @stopwords ||= File.read(File.dirname(__FILE__) + '/pismo/stopwords.txt').split rescue []
+  end
   class NFunctions
     def self.match_href(list, expression)
       list.find_all { |node| node['href'] =~ /#{expression}/ }
@@ -33,7 +46,13 @@ class Nokogiri::HTML::Document
     r = [] if all
     [*queries].each do |query|
       if query.is_a?(String)
-        result = self.search(query).first.inner_text.strip rescue nil
+        if el = self.search(query).first
+          if el.name.downcase == "meta"
+            result = el['content'].strip rescue nil
+          else
+            result = el.inner_text.strip rescue nil
+          end
+        end
       elsif query.is_a?(Array)
         result = query[1].call(self.search(query.first).first).strip rescue nil
       end

data/pismo.gemspec CHANGED Viewed

@@ -5,24 +5,24 @@
 Gem::Specification.new do |s|
   s.name = %q{pismo}
-  s.version = "0.4.0"
+  s.version = "0.5.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Peter Cooper"]
-  s.date = %q{2010-05-15}
+  s.date = %q{2010-06-01}
   s.default_executable = %q{pismo}
   s.description = %q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
   s.email = %q{git@peterc.org}
   s.executables = ["pismo"]
   s.extra_rdoc_files = [
     "LICENSE",
-     "README.rdoc"
+     "README.markdown"
   ]
   s.files = [
     ".document",
      ".gitignore",
      "LICENSE",
-     "README.rdoc",
+     "README.markdown",
      "Rakefile",
      "VERSION",
      "bin/pismo",

data/test/corpus/metadata_expected.yaml CHANGED Viewed

@@ -21,6 +21,7 @@
   :title: Gay Muslims made homeless by family violence
   :titles:
     - Gay Muslims made homeless by family violence
+    - BBC News - Gay Muslims made homeless by family violence
   :author: Poonam Taneja
   :authors:
     - Poonam Taneja
@@ -39,7 +40,7 @@
   :authors:
     - ymo1965
 :spolsky:
-  :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
+  :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
   :description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
   :ledes:
     - Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be?

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pismo
 version: !ruby/object:Gem::Version
-  version: 0.4.0
+  version: 0.5.0
 platform: ruby
 authors:
 - Peter Cooper
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-05-15 00:00:00 +01:00
+date: 2010-06-01 00:00:00 +01:00
 default_executable: pismo
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -80,12 +80,12 @@ extensions: []
 extra_rdoc_files:
 - LICENSE
-- README.rdoc
+- README.markdown
 files:
 - .document
 - .gitignore
 - LICENSE
-- README.rdoc
+- README.markdown
 - Rakefile
 - VERSION
 - bin/pismo