RubyGems - pismo - Versions diffs - 0.6.2 → 0.7.0 - Mend

pismo 0.6.2 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

data/README.markdown +9 -5
data/VERSION +1 -1
data/bin/pismo +1 -0
data/lib/pismo/document.rb +11 -4
data/lib/pismo/internal_attributes.rb +2 -2
data/lib/pismo/reader.rb +36 -27
data/lib/pismo/stopwords.txt +10 -69
data/pismo.gemspec +2 -2
data/test/corpus/metadata_expected.yaml +1 -2
data/test/corpus/reader_expected.yaml +2 -5
metadata +2 -2

data/README.markdown CHANGED

@@ -27,6 +27,7 @@ There's also a shorter "convenience" method which might be handy in IRB - it doe
     Pismo['http://www.rubyflow.com/items/4082'].title   # => "Install Ruby as a non-root User"
 The current metadata methods are:
 * title
 * titles
 * author
@@ -50,11 +51,14 @@ The html_body and body methods will be of particular interest. They return the "
 There are some shortcomings or problems that I'm aware of and am going to pursue:
-* I do not know how Pismo fares on JRuby, Rubinius, or others yet.
-* The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction.
-* The author name extraction is quite poor.
-* The image extraction only handles images with absolute URLs.
-* The stopword list leaves a bit to be desired. It errs on the side of being too long rather than too short, though (1024 words long!)
+* I do not know how Pismo fares on Rubinius or other versions of 1.9 (e.g. 1.9.2) yet
+* pismo does not install on JRuby due to a problem in the fast-stemmer dependency
+* Some users have had issues with using Pismo from irb. This appears to be related to Nokogiri use causing a segfault
+* The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction
+* The author name extraction isn't very strong and is best avoided for now
+* The image extraction only deals with images with absolute URLs
+* The stopword list is a little too long (~1000 words) and needs to be trimmed
+* The corpus in test/corpus needs significantly extending
 ## OTHER GROOVY STUFF:

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.6.2
1	+ 0.7.0

data/bin/pismo CHANGED

@@ -32,6 +32,7 @@ if ARGV.empty?
   P = doc
   @p = doc
   puts "Pismo has loaded #{url} into @p and P"
+  puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
   IRB.start
 else
   output = { :url => doc.url }

data/lib/pismo/document.rb CHANGED

@@ -33,7 +33,7 @@ module Pismo
                 handle
               end
-      @html = clean_html(@html)
+      @html = self.class.clean_html(@html)
       @doc = Nokogiri::HTML(@html)
     end
@@ -42,11 +42,18 @@ module Pismo
       @doc.match([*args], all)
     end
-    def clean_html(html)
-      html.gsub!('&#8217;', '\'')
-      html.gsub!('&#8221;', '"')
+    def self.clean_html(html)
+      # Normalize stupid entities
+      # TODO: Optimize this so we don't need all these sequential gsubs
+      html.gsub!("&#8194;", " ")
+      html.gsub!("&#8195;", " ")
+      html.gsub!("&#8201;", " ")
       html.gsub!('&#8211;', '-')
+      html.gsub!("&#8216;", "'")
+      html.gsub!('&#8217;', "'")
       html.gsub!('&#8220;', '"')
+      html.gsub!('&#8221;', '"')
+      html.gsub!("&#8230;", '...')
       html.gsub!('&nbsp;', ' ')
       html
     end

data/lib/pismo/internal_attributes.rb CHANGED

@@ -130,7 +130,6 @@ module Pismo
                           '.byline',
                           '.post_subheader_left a',                                         # TechCrunch style
                           '.byl',                                                           # BBC News style
-                          '.meta a',
                           '.articledata .author a',
                           '#owners a',                                                      # Google Code style
                           '.author a',
@@ -147,7 +146,8 @@ module Pismo
                           '.blog_meta a',
                           'cite a',
                           'cite',
-                          '.contributor_details h4 a'
+                          '.contributor_details h4 a',
+                          '.meta a'
                           ], all)
       return unless author

data/lib/pismo/reader.rb CHANGED

@@ -8,7 +8,7 @@ module Pismo
       attr_reader :raw_content, :doc, :content_candidates
       # Elements to keep for /input/ sanitization
-      OK_ELEMENTS = %w{a td br th tbody table tr div span img strong em b i body html head title p h1 h2 h3 h4 h5 h6 pre code tt ul li ol blockquote font big small section article abbr audio video cite dd dt figure caption sup form dl dt dd}
+      OK_ELEMENTS = %w{a td br th tbody table tr div span img strong em b i body html head title p h1 h2 h3 h4 h5 h6 pre code tt ul li ol blockquote font big small section article abbr audio video cite dd dt figure caption sup form dl dt dd center}
       # Build a tree of attributes that are allowed for each element.. doing it this messy way due to how Sanitize works, alas
       OK_ATTRIBUTES = {}
@@ -21,7 +21,7 @@ module Pismo
       GOOD_WORDS = %w{content post blogpost main story body entry text desc asset hentry single entrytext postcontent bodycontent}.uniq
       # Words that indicate crap in general
-      BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor ads commentform fbfans login similar thumb link blogroll grid twitter wrapper container nav sitesub printfooter editsection visualclear catlinks hidden toc contentsub caption disqus rss shoutbox sponsor}.uniq
+      BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor ads commentform fbfans login similar thumb link blogroll grid twitter wrapper container nav sitesub printfooter editsection visualclear catlinks hidden toc contentsub caption disqus rss shoutbox sponsor blogcomments}.uniq
       # Words that kill a branch dead
       FATAL_WORDS = %w{comments comment bookmarks social links ads related similar footer digg totop metadata sitesub nav sidebar commenting options addcomment leaderboard offscreen job prevlink prevnext navigation reply-link hide hidden sidebox archives vcard}
@@ -39,7 +39,7 @@ module Pismo
       # Create a document object based on the raw HTML content provided
       def initialize(raw_content)
-        @raw_content = raw_content
+        @raw_content = Pismo::Document.clean_html(raw_content)
         build_doc
       end
@@ -59,6 +59,17 @@ module Pismo
         # Remove scripts manually, Sanitize and/or Nokogiri seem to go a bit funny with them
         @raw_content.gsub!(/\<script .*?\<\/script\>/im, '')
+        # Get rid of bullshit "smart" quotes and other Unicode nonsense
+        @raw_content.force_encoding("ASCII-8BIT") if RUBY_VERSION > "1.9"
+        @raw_content.gsub!("\xe2\x80\x89", " ")
+        @raw_content.gsub!("\xe2\x80\x99", "'")
+        @raw_content.gsub!("\xe2\x80\x98", "'")
+        @raw_content.gsub!("\xe2\x80\x9c", '"')
+        @raw_content.gsub!("\xe2\x80\x9d", '"')
+        @raw_content.gsub!("\xe2\x80\xf6", '.')
+        @raw_content.force_encoding("UTF-8") if RUBY_VERSION > "1.9"
         # Sanitize the HTML
         @raw_content = Sanitize.clean(@raw_content,
@@ -70,8 +81,6 @@ module Pismo
         @doc = Nokogiri::HTML(@raw_content, nil, 'utf-8')
-        #ap @raw_content
-        #exit
         build_analysis_tree
       end
@@ -102,20 +111,34 @@ module Pismo
           # Assume that no content we'll want comes in a total package of fewer than 80 characters!
           next unless el.text.to_s.strip.length >= 80
-          ids = (el['id'].to_s + ' ' + el['class'].to_s).downcase.strip.scan(/[a-z]+/)
           path_segments = el.path.scan(/[a-z]+/)[2..-1] || []
           depth = path_segments.length
+          local_ids = (el['id'].to_s + ' ' + el['class'].to_s).downcase.strip.scan(/[a-z]+/)
+          ids = local_ids
+          cp = el.parent
+          (depth - 1).times do
+            ids += (cp['id'].to_s + ' ' + cp['class'].to_s).downcase.strip.scan(/[a-z]+/)
+            cp = cp.parent
+          end if depth > 1
+          #puts "IDS"
+          #ap ids
+          #puts "LOCAL IDS"
+          #ap local_ids
           branch = {}
           branch[:ids] = ids
+          branch[:local_ids] = local_ids
           branch[:score] = -(BAD_WORDS & ids).size
-          branch[:score] += (GOOD_WORDS & ids).size
-          next if branch[:score] < 0
+          branch[:score] += ((GOOD_WORDS & ids).size * 2)
+          next if branch[:score] < -5
           #puts "#{ids.join(",")} - #{branch[:score].to_s} - #{el.text.to_s.strip.length}"
           # Elements that have an ID or class are more likely to be our winners
-          branch[:score] += 2 unless ids.empty?
+          branch[:score] += 2 unless local_ids.empty?
           branch[:name] = el.name
           branch[:depth] = depth
@@ -198,6 +221,7 @@ module Pismo
           branch[:score] -= 5 if branch[:bad_child_count] > 20
           branch[:score] += depth
+          branch[:score] *= 0.8 if ids.length > 10
@@ -212,8 +236,7 @@ module Pismo
         # Sort the branches by their score in reverse order
         @content_candidates = sorted_tree.reverse.first([5, sorted_tree.length].min)
-        @content_candidates #.map { |i| [i[0], i[1][:name], i[1][:ids].join(','), i[1][:score] ]}
-        #ap @content_candidates
+        #ap @content_candidates #.map { |i| [i[0], i[1][:name], i[1][:ids].join(','), i[1][:score] ]}
         #t2 = Time.now.to_i + (Time.now.usec.to_f / 1000000)
         #puts t2 - t1
         #exit
@@ -278,7 +301,7 @@ module Pismo
             next
           end
-          if el.name == "p" && el.text !~ /\.(\s|$)/ && el.inner_html !~ /\<img/
+          if el.name == "p" && el.text !~ /(\.|\?|\!|\"|\')(\s|$)/ && el.inner_html !~ /\<img/
             el.remove
             next
           end
@@ -321,29 +344,15 @@ module Pismo
         # Remove empty tags
         clean_html.gsub!(/<(\w+)><\/\1>/, "")
-        # Trim leading space from lines but without removing blank lines
-        #clean_html.gsub!(/^\ +(?=\S)/, '')
         # Just a messy, hacky way to make output look nicer with subsequent paragraphs..
         clean_html.gsub!(/<\/(div|p|h1|h2|h3|h4|h5|h6)>/, '</\1>' + "\n\n")
-        # Get rid of bullshit "smart" quotes
-        clean_html.force_encoding("ASCII-8BIT") if RUBY_VERSION > "1.9"
-        clean_html.gsub!("\xe2\x80\x89", " ")
-        clean_html.gsub!("\xe2\x80\x99", "'")
-        clean_html.gsub!("\xe2\x80\x98", "'")
-        clean_html.gsub!("\xe2\x80\x9c", '"')
-        clean_html.gsub!("\xe2\x80\x9d", '"')
-        clean_html.force_encoding("UTF-8") if RUBY_VERSION > "1.9"
         @content[[clean, index]] = clean_html
       end
       def sentences(qty = 3)
-      #  ap content
         clean_content = Sanitize.clean(content, :elements => NON_HEADER_ELEMENTS, :attributes => OK_CLEAN_ATTRIBUTES, :remove_contents => %w{h1 h2 h3 h4 h5 h6})
-        #ap clean_content
-      #exit
         fodder = ''
         doc = Nokogiri::HTML(clean_content, nil, 'utf-8')

data/lib/pismo/stopwords.txt CHANGED

@@ -1,9 +1,3 @@
-0
-1
-10
-100
-20
-a
 a's
 Aaliyah
 Aaron
@@ -17,8 +11,6 @@ accordingly
 across
 actually
 Adam
-add
-added
 Addison
 Adrian
 after
@@ -67,7 +59,6 @@ annual
 another
 Anthony
 Antonio
-any
 anybody
 anyhow
 anyone
@@ -107,7 +98,6 @@ Avery
 away
 awesome
 awfully
-b
 Bailey
 based
 basically
@@ -118,7 +108,6 @@ become
 becomes
 becoming
 been
-before
 beforehand
 behind
 being
@@ -130,16 +119,12 @@ beside
 besides
 best
 better
-between
 beyond
 big
 biggest
-bit
-bits
 Blake
 both
 bother
-box
 Brady
 Brandon
 Brayden
@@ -152,10 +137,7 @@ Brooke
 Brooklyn
 Bryan
 Bryce
-built
 but
-by
-c
 c'mon
 c's
 Caden
@@ -192,15 +174,14 @@ Cody
 Cole
 Colin
 Colton
-com
 come
 comes
 coming
 comment
 company
-compared
 compelling
 concerning
+congratulations
 Connor
 consequently
 consider
@@ -220,7 +201,6 @@ covering
 cunt
 currently
 customizable
-d
 damn
 Daniel
 Danielle
@@ -258,7 +238,6 @@ driven
 drove
 during
 Dylan
-e
 each
 easier
 edu
@@ -278,8 +257,6 @@ end
 english
 enough
 entirely
-episodes
-equals
 Eric
 Erin
 es
@@ -305,12 +282,10 @@ existing
 extensive
 extra
 extremely
-f
 Faith
 false
 fame
 far
-favorite
 feb
 february
 feel
@@ -335,19 +310,20 @@ fuck
 full
 further
 furthermore
-g
 Gabriel
 Gabriella
 Gabrielle
 Garrett
+gave
 Gavin
+generally
 get
 gets
 getting
+give
 given
 gives
 glory
-go
 goal
 goes
 going
@@ -358,7 +334,6 @@ gotten
 Grace
 great
 greetings
-h
 had
 hadn't
 Hailey
@@ -395,23 +370,18 @@ himself
 hire
 his
 hither
-homepage
 hopefully
-hour
-hours
 how
 howbeit
 however
 huge
 Hunter
-i
 i'd
 i'll
 i'm
 i've
 Ian
 ie
-if
 ignored
 imagine
 immediate
@@ -428,7 +398,6 @@ indicates
 informative
 inhibits
 inner
-inside
 insofar
 instead
 interest
@@ -448,9 +417,7 @@ it'll
 it's
 its
 itself
-itunes
 Ivan
-j
 Jack
 Jackson
 Jacob
@@ -492,7 +459,6 @@ jun
 june
 just
 Justin
-k
 Kaden
 Kaitlyn
 Kaleb
@@ -513,7 +479,6 @@ known
 knows
 Kyle
 Kylie
-l
 la
 Landon
 last
@@ -541,8 +506,6 @@ line
 listing
 listings
 little
-live
-loading
 Logan
 look
 looking
@@ -555,7 +518,6 @@ ltd
 Lucas
 Luis
 Luke
-m
 Mackenzie
 Madeline
 Madison
@@ -592,8 +554,6 @@ Michelle
 might
 Miguel
 mile
-minute
-minutes
 more
 moreover
 Morgan
@@ -601,11 +561,9 @@ most
 mostly
 moving
 much
-multiple
 must
 my
 myself
-n
 name
 namely
 Natalie
@@ -644,7 +602,6 @@ novel
 november
 now
 nowhere
-o
 Obie
 obviously
 oct
@@ -670,7 +627,6 @@ or
 org
 oriented
 Oscar
-other
 others
 otherwise
 ought
@@ -678,12 +634,9 @@ our
 ours
 ourselves
 out
-outside
-over
 overall
 Owen
 own
-p
 Paige
 par
 Parker
@@ -694,7 +647,6 @@ Patrick
 Paul
 peasy
 per
-perform
 perhaps
 piece
 placed
@@ -714,11 +666,9 @@ proud
 provide
 provides
 put
-q
 que
 quite
 qv
-r
 Rachel
 rather
 rd
@@ -744,7 +694,6 @@ Riley
 Robert
 run
 Ryan
-s
 safest
 said
 Samantha
@@ -778,7 +727,6 @@ september
 serious
 seriously
 set
-Seth
 settings
 seven
 several
@@ -822,6 +770,7 @@ step
 Stephanie
 Steven
 still
+stuff
 sub
 subscribe
 such
@@ -831,7 +780,6 @@ sup
 sur
 sure
 Sydney
-t
 t's
 take
 taken
@@ -871,6 +819,8 @@ they'd
 they'll
 they're
 they've
+thing
+things
 think
 third
 this
@@ -909,12 +859,9 @@ twice
 two
 Tyler
 typically
-u
 ultra
 un
-under
 unfortunately
-unless
 unlikely
 unsurprisingly
 until
@@ -929,7 +876,6 @@ uses
 using
 usually
 uucp
-v
 value
 Vanessa
 various
@@ -940,7 +886,6 @@ Victoria
 Vincent
 viz
 vs
-w
 walks
 want
 wants
@@ -982,6 +927,8 @@ who's
 whoever
 whole
 whom
+approximate
+approximately
 whose
 why
 will
@@ -1000,11 +947,8 @@ would
 wouldn't
 wrapped
 Wyatt
-x
 Xavier
-y
 yeah
-years
 yes
 yet
 you
@@ -1016,9 +960,6 @@ your
 yours
 yourself
 yourselves
-generally
-z
 Zachary
 zero
-Zoe
-congratulations
+Zoe

data/pismo.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{pismo}
-  s.version = "0.6.2"
+  s.version = "0.7.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Peter Cooper"]
-  s.date = %q{2010-06-20}
+  s.date = %q{2010-07-27}
   s.default_executable = %q{pismo}
   s.description = %q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
   s.email = %q{git@peterc.org}

data/test/corpus/metadata_expected.yaml CHANGED

@@ -42,7 +42,7 @@
 :spolsky:
   :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
   :description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
-  :lede: I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese?
+  :lede: Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "????
   :author: Joel Spolsky
   :favicon: /favicon.ico
   :feed: http://www.joelonsoftware.com/rss.xml
@@ -61,7 +61,6 @@
 :tweet:
   :lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X. Wow..!
   :sentences: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS Wow..!
-  :datetime: 2010-06-05 12:00:00 +01:00
 :cant_read:
   :sentences: "For those of us who grew up as weird kids in the 1980s, the work of Berkeley Breathed was as important as those twin eternal pillars of weird-kid-dom: Monty Python and Mad magazine. In a word: seminal. In two words: fucking seminal."
 :gmane:

data/test/corpus/reader_expected.yaml CHANGED

@@ -27,16 +27,13 @@
 - "I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor."
 - "I don't think it will be, but you never know."
 :spolsky:
-- "I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff."
-- "A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese."
+- "Ever wonder about that mysterious Content-Type tag?"
+- "You know, the one you're supposed to put in HTML and you never quite know what it should be?"
 :techcrunch:
 - "Last week, we covered Googlle opening a school in India."
 - "Googlle, not to be confused with Google."
 :tweet:
 - "Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS Wow..!"
-:youtube:
-- "The location filter shows you popular videos from the selected country or region on lists like Most Viewed and in search results.If you would like to change either of these preferences, please use the links in the footer at the bottom of the page."
-- "Click \"OK\" to accept these settings or click \"Cancel\" to set your language preference to \"English (UK)\" and your location filter to \"Worldwide\"."
 :zefrank:
 - "If there's anyone who knows how to marshal an online audience, it's Ze Frank."
 - "Ze is best-known for his 2006 program \"The Show,\" in which he made a new 2-3 minute video every day for 1 year."

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pismo
 version: !ruby/object:Gem::Version
-  version: 0.6.2
+  version: 0.7.0
 platform: ruby
 authors:
 - Peter Cooper
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-06-20 00:00:00 +01:00
+date: 2010-07-27 00:00:00 +01:00
 default_executable: pismo
 dependencies:
 - !ruby/object:Gem::Dependency