pismo 0.7.0 → 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +5 -0
- data/Gemfile +4 -0
- data/README.markdown +13 -2
- data/Rakefile +2 -28
- data/bin/pismo +1 -1
- data/lib/pismo.rb +7 -3
- data/lib/pismo/internal_attributes.rb +20 -21
- data/lib/pismo/stopwords.txt +40 -3
- data/lib/pismo/version.rb +3 -0
- data/pismo.gemspec +24 -94
- data/test/corpus/metadata_expected.yaml +8 -8
- metadata +81 -45
- data/VERSION +0 -1
    
        data/.gitignore
    CHANGED
    
    
    
        data/Gemfile
    ADDED
    
    
    
        data/README.markdown
    CHANGED
    
    | @@ -6,7 +6,11 @@ Pismo extracts machine-usable metadata from unstructured (or poorly structured) | |
| 6 6 | 
             
            Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
         | 
| 7 7 | 
             
            Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
         | 
| 8 8 |  | 
| 9 | 
            -
            All tests pass on Ruby 1.8.7  | 
| 9 | 
            +
            All tests pass on Ruby 1.8.7, Ruby 1.9.2 (both MRI) and JRuby 1.5.6.
         | 
| 10 | 
            +
             | 
| 11 | 
            +
            ## NEWS:
         | 
| 12 | 
            +
             | 
| 13 | 
            +
            December 19, 2010: Version 1.7.1 has been released - it includes a patch from Darcy Laycock to fix keyword extraction problems on some pages, has switched from Jeweler to Bundler for management of the gem, and adds support for JRuby 1.5.6 by skipping stemming on that platform.
         | 
| 10 14 |  | 
| 11 15 | 
             
            ## USAGE:
         | 
| 12 16 |  | 
| @@ -46,12 +50,19 @@ The current metadata methods are: | |
| 46 50 | 
             
            These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
         | 
| 47 51 |  | 
| 48 52 | 
             
            The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            New! The keywords method accepts optional arguments. These are the current defaults:
         | 
| 55 | 
            +
             | 
| 56 | 
            +
                :stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2
         | 
| 57 | 
            +
                
         | 
| 58 | 
            +
            You can also pass an array to keywords with :hints => arr if you want only words of your choosing to be found.
         | 
| 49 59 |  | 
| 50 60 | 
             
            ## CAVEATS AND SHORTCOMINGS:
         | 
| 51 61 |  | 
| 52 62 | 
             
            There are some shortcomings or problems that I'm aware of and am going to pursue:
         | 
| 53 63 |  | 
| 54 | 
            -
            * I do not know how Pismo fares on Rubinius | 
| 64 | 
            +
            * I do not know how Pismo fares on Rubinius
         | 
| 65 | 
            +
            * pismo requires Bundler - get it :-)
         | 
| 55 66 | 
             
            * pismo does not install on JRuby due to a problem in the fast-stemmer dependency
         | 
| 56 67 | 
             
            * Some users have had issues with using Pismo from irb. This appears to be related to Nokogiri use causing a segfault
         | 
| 57 68 | 
             
            * The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction
         | 
    
        data/Rakefile
    CHANGED
    
    | @@ -1,29 +1,5 @@ | |
| 1 | 
            -
            require ' | 
| 2 | 
            -
             | 
| 3 | 
            -
             | 
| 4 | 
            -
            begin
         | 
| 5 | 
            -
              require 'jeweler'
         | 
| 6 | 
            -
              Jeweler::Tasks.new do |gem|
         | 
| 7 | 
            -
                gem.name = "pismo"
         | 
| 8 | 
            -
                gem.summary = %Q{Extracts or retrieves content-related metadata from HTML pages}
         | 
| 9 | 
            -
                gem.description = %Q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
         | 
| 10 | 
            -
                gem.email = "git@peterc.org"
         | 
| 11 | 
            -
                gem.homepage = "http://github.com/peterc/pismo"
         | 
| 12 | 
            -
                gem.authors = ["Peter Cooper"]
         | 
| 13 | 
            -
                gem.executables = "pismo"
         | 
| 14 | 
            -
                gem.default_executable = "pismo"
         | 
| 15 | 
            -
                gem.add_development_dependency "shoulda", ">= 0"
         | 
| 16 | 
            -
                gem.add_development_dependency "awesome_print"
         | 
| 17 | 
            -
                gem.add_dependency "jeweler"
         | 
| 18 | 
            -
                gem.add_dependency "nokogiri"
         | 
| 19 | 
            -
                gem.add_dependency "sanitize"
         | 
| 20 | 
            -
                gem.add_dependency "fast-stemmer"
         | 
| 21 | 
            -
                gem.add_dependency "chronic"
         | 
| 22 | 
            -
              end
         | 
| 23 | 
            -
              Jeweler::GemcutterTasks.new
         | 
| 24 | 
            -
            rescue LoadError
         | 
| 25 | 
            -
              puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
         | 
| 26 | 
            -
            end
         | 
| 1 | 
            +
            require 'bundler'
         | 
| 2 | 
            +
            Bundler::GemHelper.install_tasks
         | 
| 27 3 |  | 
| 28 4 | 
             
            require 'rake/testtask'
         | 
| 29 5 | 
             
            Rake::TestTask.new(:test) do |test|
         | 
| @@ -45,8 +21,6 @@ rescue LoadError | |
| 45 21 | 
             
              end
         | 
| 46 22 | 
             
            end
         | 
| 47 23 |  | 
| 48 | 
            -
            task :test => :check_dependencies
         | 
| 49 | 
            -
             | 
| 50 24 | 
             
            task :default => :test
         | 
| 51 25 |  | 
| 52 26 | 
             
            require 'rake/rdoctask'
         | 
    
        data/bin/pismo
    CHANGED
    
    | @@ -32,7 +32,7 @@ if ARGV.empty? | |
| 32 32 | 
             
              P = doc
         | 
| 33 33 | 
             
              @p = doc
         | 
| 34 34 | 
             
              puts "Pismo has loaded #{url} into @p and P"
         | 
| 35 | 
            -
              puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
         | 
| 35 | 
            +
              #puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
         | 
| 36 36 | 
             
              IRB.start
         | 
| 37 37 | 
             
            else
         | 
| 38 38 | 
             
              output = { :url => doc.url }
         | 
    
        data/lib/pismo.rb
    CHANGED
    
    | @@ -2,7 +2,6 @@ | |
| 2 2 |  | 
| 3 3 | 
             
            require 'open-uri'
         | 
| 4 4 | 
             
            require 'nokogiri'
         | 
| 5 | 
            -
            require 'fast_stemmer'
         | 
| 6 5 | 
             
            require 'chronic'
         | 
| 7 6 | 
             
            require 'sanitize'
         | 
| 8 7 | 
             
            require 'tempfile'
         | 
| @@ -11,6 +10,12 @@ $: << File.dirname(__FILE__) | |
| 11 10 | 
             
            require 'pismo/document'
         | 
| 12 11 | 
             
            require 'pismo/reader'
         | 
| 13 12 |  | 
| 13 | 
            +
            if RUBY_PLATFORM == "java"
         | 
| 14 | 
            +
              class String; def stem; self; end; end
         | 
| 15 | 
            +
            else
         | 
| 16 | 
            +
              require 'fast_stemmer'
         | 
| 17 | 
            +
            end
         | 
| 18 | 
            +
             | 
| 14 19 | 
             
            module Pismo
         | 
| 15 20 | 
             
              # Sugar methods to make creating document objects nicer
         | 
| 16 21 | 
             
              def self.document(handle, url = nil)
         | 
| @@ -59,8 +64,7 @@ class Nokogiri::HTML::Document | |
| 59 64 | 
             
                  end
         | 
| 60 65 |  | 
| 61 66 | 
             
                  if result
         | 
| 62 | 
            -
             | 
| 63 | 
            -
                  #  result.gsub!(/\342\200\224/, '-')
         | 
| 67 | 
            +
                    # TODO: Sort out sanitization in a more centralized way
         | 
| 64 68 | 
             
                    result.gsub!('’', '\'')
         | 
| 65 69 | 
             
                    result.gsub!('—', '-')
         | 
| 66 70 | 
             
                    if all
         | 
| @@ -15,6 +15,7 @@ module Pismo | |
| 15 15 | 
             
                                        '.post-header h1',
         | 
| 16 16 | 
             
                                        '.entry-title',
         | 
| 17 17 | 
             
                                        '.post-title',
         | 
| 18 | 
            +
                                        '.post h1',
         | 
| 18 19 | 
             
                                        '.post h3 a',
         | 
| 19 20 | 
             
                                        'a.datitle',          # Slashdot style
         | 
| 20 21 | 
             
                                        '.posttitle',
         | 
| @@ -93,9 +94,7 @@ module Pismo | |
| 93 94 | 
             
                  datetime = 10
         | 
| 94 95 |  | 
| 95 96 | 
             
                  regexen.each do |r|
         | 
| 96 | 
            -
                    datetime = @doc.to_html[r]
         | 
| 97 | 
            -
                    # p datetime
         | 
| 98 | 
            -
                    break if datetime
         | 
| 97 | 
            +
                    break if datetime = @doc.to_html[r]
         | 
| 99 98 | 
             
                  end
         | 
| 100 99 |  | 
| 101 100 | 
             
                  return unless datetime && datetime.length > 4
         | 
| @@ -111,10 +110,6 @@ module Pismo | |
| 111 110 | 
             
                  Chronic.parse(datetime) || datetime
         | 
| 112 111 | 
             
                end
         | 
| 113 112 |  | 
| 114 | 
            -
                # TODO: Attempts to work out what type of site or page the page is from the provided URL
         | 
| 115 | 
            -
                # def site_type
         | 
| 116 | 
            -
                # end
         | 
| 117 | 
            -
                
         | 
| 118 113 | 
             
                # Returns the author of the page/content
         | 
| 119 114 | 
             
                def author(all = false)
         | 
| 120 115 | 
             
                  author = @doc.match([
         | 
| @@ -189,13 +184,15 @@ module Pismo | |
| 189 184 | 
             
                              '.post-text p',
         | 
| 190 185 | 
             
                              '#blogpost p',
         | 
| 191 186 | 
             
                              '.story-teaser',
         | 
| 192 | 
            -
                              ' | 
| 187 | 
            +
                              '.article .body p',
         | 
| 188 | 
            +
                              '//div[@class="entrytext"]//p[string-length()>40]',                      # Ruby Inside / Kubrick style
         | 
| 193 189 | 
             
                              'section p',
         | 
| 194 190 | 
             
                              '.entry .text p',
         | 
| 191 | 
            +
                              '.hentry .content p',
         | 
| 195 192 | 
             
                              '.entry-content p',
         | 
| 196 193 | 
             
                              '#wikicontent p',                                                        # Google Code style
         | 
| 197 194 | 
             
                              '.wikistyle p',                                                          # GitHub style
         | 
| 198 | 
            -
                              '//td[@class="storybody"]/p[string-length()> | 
| 195 | 
            +
                              '//td[@class="storybody"]/p[string-length()>40]',                        # BBC News style
         | 
| 199 196 | 
             
                              '//div[@class="entry"]//p[string-length()>100]',
         | 
| 200 197 | 
             
                              # The below is a horrible, horrible way to pluck out lead paras from crappy Blogspot blogs that
         | 
| 201 198 | 
             
                              # don't use <p> tags..
         | 
| @@ -212,16 +209,16 @@ module Pismo | |
| 212 209 |  | 
| 213 210 | 
             
                  # TODO: Improve sentence extraction - this is dire even if it "works for now"
         | 
| 214 211 | 
             
                  if lede && String === lede
         | 
| 215 | 
            -
                    return (lede[/^(.*?[\.\!\?]\s){ | 
| 212 | 
            +
                    return (lede[/^(.*?[\.\!\?]\s){1,3}/m] || lede).to_s.strip
         | 
| 216 213 | 
             
                  elsif lede && Array === lede
         | 
| 217 | 
            -
                    return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){ | 
| 214 | 
            +
                    return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){1,3}/m].strip || l }.uniq
         | 
| 218 215 | 
             
                  else
         | 
| 219 | 
            -
                    return reader_doc && !reader_doc.sentences( | 
| 216 | 
            +
                    return reader_doc && !reader_doc.sentences(4).empty? ? reader_doc.sentences(4).join(' ') : nil
         | 
| 220 217 | 
             
                  end
         | 
| 221 218 | 
             
                end
         | 
| 222 219 |  | 
| 223 220 | 
             
                def ledes
         | 
| 224 | 
            -
                  lede(true)
         | 
| 221 | 
            +
                  lede(true) rescue []
         | 
| 225 222 | 
             
                end
         | 
| 226 223 |  | 
| 227 224 | 
             
                # Returns a string containing the first [limit] sentences as determined by the Reader algorithm
         | 
| @@ -236,29 +233,31 @@ module Pismo | |
| 236 233 |  | 
| 237 234 | 
             
                # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
         | 
| 238 235 | 
             
                def keywords(options = {})
         | 
| 239 | 
            -
                  options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
         | 
| 236 | 
            +
                  options = { :stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2 }.merge(options)
         | 
| 240 237 |  | 
| 241 238 | 
             
                  words = {}
         | 
| 242 239 |  | 
| 243 240 | 
             
                  # Convert doc to lowercase, scrub out most HTML tags, then keep track of words
         | 
| 244 | 
            -
                  cached_title = title
         | 
| 241 | 
            +
                  cached_title = title.to_s
         | 
| 245 242 | 
             
                  content_to_use = body.to_s.downcase + " " + description.to_s.downcase
         | 
| 246 243 |  | 
| 247 244 | 
             
                  # old regex for safe keeping -- \b[a-z][a-z\+\.\'\+\#\-]*\b
         | 
| 248 | 
            -
                  content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\' | 
| 245 | 
            +
                  content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.compact.each do |word|
         | 
| 249 246 | 
             
                    next if word.length > options[:word_length_limit]
         | 
| 250 | 
            -
                    word.gsub!( | 
| 247 | 
            +
                    word.gsub!(/^[\']/, '')
         | 
| 248 | 
            +
                    word.gsub!(/[\.\-\']$/, '')
         | 
| 249 | 
            +
                    next if options[:hints] && !options[:hints].include?(word)
         | 
| 251 250 | 
             
                    words[word] ||= 0
         | 
| 252 | 
            -
                    words[word] += (cached_title.downcase | 
| 251 | 
            +
                    words[word] += (cached_title.downcase =~ /\b#{word}\b/ ? 5 : 1)
         | 
| 253 252 | 
             
                  end
         | 
| 254 253 |  | 
| 255 254 | 
             
                  # Stem the words and stop words if necessary
         | 
| 256 255 | 
             
                  d = words.keys.uniq.map { |a| a.length > options[:stem_at] ? a.stem : a }
         | 
| 257 256 | 
             
                  s = Pismo.stopwords.map { |a| a.length > options[:stem_at] ? a.stem : a }
         | 
| 258 257 |  | 
| 259 | 
            -
             | 
| 260 | 
            -
                   | 
| 261 | 
            -
                   | 
| 258 | 
            +
                  words.delete_if { |k1, v1| v1 < options[:minimum_score] }
         | 
| 259 | 
            +
                  words.delete_if { |k1, v1| s.include?(k1) } if options[:remove_stopwords]
         | 
| 260 | 
            +
                  words.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
         | 
| 262 261 | 
             
                end
         | 
| 263 262 |  | 
| 264 263 | 
             
                def reader_doc
         | 
    
        data/lib/pismo/stopwords.txt
    CHANGED
    
    | @@ -1,3 +1,4 @@ | |
| 1 | 
            +
            a
         | 
| 1 2 | 
             
            a's
         | 
| 2 3 | 
             
            Aaliyah
         | 
| 3 4 | 
             
            Aaron
         | 
| @@ -70,6 +71,8 @@ apart | |
| 70 71 | 
             
            appear
         | 
| 71 72 | 
             
            appreciate
         | 
| 72 73 | 
             
            appropriate
         | 
| 74 | 
            +
            approximate
         | 
| 75 | 
            +
            approximately
         | 
| 73 76 | 
             
            apr
         | 
| 74 77 | 
             
            april
         | 
| 75 78 | 
             
            are
         | 
| @@ -138,6 +141,7 @@ Brooklyn | |
| 138 141 | 
             
            Bryan
         | 
| 139 142 | 
             
            Bryce
         | 
| 140 143 | 
             
            but
         | 
| 144 | 
            +
            by
         | 
| 141 145 | 
             
            c'mon
         | 
| 142 146 | 
             
            c's
         | 
| 143 147 | 
             
            Caden
         | 
| @@ -238,6 +242,7 @@ driven | |
| 238 242 | 
             
            drove
         | 
| 239 243 | 
             
            during
         | 
| 240 244 | 
             
            Dylan
         | 
| 245 | 
            +
            e
         | 
| 241 246 | 
             
            each
         | 
| 242 247 | 
             
            easier
         | 
| 243 248 | 
             
            edu
         | 
| @@ -282,6 +287,7 @@ existing | |
| 282 287 | 
             
            extensive
         | 
| 283 288 | 
             
            extra
         | 
| 284 289 | 
             
            extremely
         | 
| 290 | 
            +
            f
         | 
| 285 291 | 
             
            Faith
         | 
| 286 292 | 
             
            false
         | 
| 287 293 | 
             
            fame
         | 
| @@ -310,6 +316,7 @@ fuck | |
| 310 316 | 
             
            full
         | 
| 311 317 | 
             
            further
         | 
| 312 318 | 
             
            furthermore
         | 
| 319 | 
            +
            g
         | 
| 313 320 | 
             
            Gabriel
         | 
| 314 321 | 
             
            Gabriella
         | 
| 315 322 | 
             
            Gabrielle
         | 
| @@ -334,6 +341,7 @@ gotten | |
| 334 341 | 
             
            Grace
         | 
| 335 342 | 
             
            great
         | 
| 336 343 | 
             
            greetings
         | 
| 344 | 
            +
            h
         | 
| 337 345 | 
             
            had
         | 
| 338 346 | 
             
            hadn't
         | 
| 339 347 | 
             
            Hailey
         | 
| @@ -376,12 +384,14 @@ howbeit | |
| 376 384 | 
             
            however
         | 
| 377 385 | 
             
            huge
         | 
| 378 386 | 
             
            Hunter
         | 
| 387 | 
            +
            i
         | 
| 379 388 | 
             
            i'd
         | 
| 380 389 | 
             
            i'll
         | 
| 381 390 | 
             
            i'm
         | 
| 382 391 | 
             
            i've
         | 
| 383 392 | 
             
            Ian
         | 
| 384 393 | 
             
            ie
         | 
| 394 | 
            +
            if
         | 
| 385 395 | 
             
            ignored
         | 
| 386 396 | 
             
            imagine
         | 
| 387 397 | 
             
            immediate
         | 
| @@ -418,6 +428,7 @@ it's | |
| 418 428 | 
             
            its
         | 
| 419 429 | 
             
            itself
         | 
| 420 430 | 
             
            Ivan
         | 
| 431 | 
            +
            j
         | 
| 421 432 | 
             
            Jack
         | 
| 422 433 | 
             
            Jackson
         | 
| 423 434 | 
             
            Jacob
         | 
| @@ -440,6 +451,7 @@ Jessica | |
| 440 451 | 
             
            Jesus
         | 
| 441 452 | 
             
            jim
         | 
| 442 453 | 
             
            jimmy
         | 
| 454 | 
            +
            jnr
         | 
| 443 455 | 
             
            Jocelyn
         | 
| 444 456 | 
             
            Joel
         | 
| 445 457 | 
             
            John
         | 
| @@ -450,6 +462,7 @@ Jose | |
| 450 462 | 
             
            Joseph
         | 
| 451 463 | 
             
            Joshua
         | 
| 452 464 | 
             
            Josiah
         | 
| 465 | 
            +
            jr
         | 
| 453 466 | 
             
            Juan
         | 
| 454 467 | 
             
            jul
         | 
| 455 468 | 
             
            Julia
         | 
| @@ -459,6 +472,7 @@ jun | |
| 459 472 | 
             
            june
         | 
| 460 473 | 
             
            just
         | 
| 461 474 | 
             
            Justin
         | 
| 475 | 
            +
            k
         | 
| 462 476 | 
             
            Kaden
         | 
| 463 477 | 
             
            Kaitlyn
         | 
| 464 478 | 
             
            Kaleb
         | 
| @@ -479,6 +493,7 @@ known | |
| 479 493 | 
             
            knows
         | 
| 480 494 | 
             
            Kyle
         | 
| 481 495 | 
             
            Kylie
         | 
| 496 | 
            +
            l
         | 
| 482 497 | 
             
            la
         | 
| 483 498 | 
             
            Landon
         | 
| 484 499 | 
             
            last
         | 
| @@ -518,6 +533,7 @@ ltd | |
| 518 533 | 
             
            Lucas
         | 
| 519 534 | 
             
            Luis
         | 
| 520 535 | 
             
            Luke
         | 
| 536 | 
            +
            m
         | 
| 521 537 | 
             
            Mackenzie
         | 
| 522 538 | 
             
            Madeline
         | 
| 523 539 | 
             
            Madison
         | 
| @@ -564,6 +580,7 @@ much | |
| 564 580 | 
             
            must
         | 
| 565 581 | 
             
            my
         | 
| 566 582 | 
             
            myself
         | 
| 583 | 
            +
            n
         | 
| 567 584 | 
             
            name
         | 
| 568 585 | 
             
            namely
         | 
| 569 586 | 
             
            Natalie
         | 
| @@ -602,6 +619,7 @@ novel | |
| 602 619 | 
             
            november
         | 
| 603 620 | 
             
            now
         | 
| 604 621 | 
             
            nowhere
         | 
| 622 | 
            +
            o
         | 
| 605 623 | 
             
            Obie
         | 
| 606 624 | 
             
            obviously
         | 
| 607 625 | 
             
            oct
         | 
| @@ -637,6 +655,7 @@ out | |
| 637 655 | 
             
            overall
         | 
| 638 656 | 
             
            Owen
         | 
| 639 657 | 
             
            own
         | 
| 658 | 
            +
            p
         | 
| 640 659 | 
             
            Paige
         | 
| 641 660 | 
             
            par
         | 
| 642 661 | 
             
            Parker
         | 
| @@ -666,9 +685,11 @@ proud | |
| 666 685 | 
             
            provide
         | 
| 667 686 | 
             
            provides
         | 
| 668 687 | 
             
            put
         | 
| 688 | 
            +
            q
         | 
| 669 689 | 
             
            que
         | 
| 670 690 | 
             
            quite
         | 
| 671 691 | 
             
            qv
         | 
| 692 | 
            +
            r
         | 
| 672 693 | 
             
            Rachel
         | 
| 673 694 | 
             
            rather
         | 
| 674 695 | 
             
            rd
         | 
| @@ -694,6 +715,7 @@ Riley | |
| 694 715 | 
             
            Robert
         | 
| 695 716 | 
             
            run
         | 
| 696 717 | 
             
            Ryan
         | 
| 718 | 
            +
            s
         | 
| 697 719 | 
             
            safest
         | 
| 698 720 | 
             
            said
         | 
| 699 721 | 
             
            Samantha
         | 
| @@ -764,6 +786,7 @@ specify | |
| 764 786 | 
             
            specifying
         | 
| 765 787 | 
             
            spoke
         | 
| 766 788 | 
             
            spread
         | 
| 789 | 
            +
            sr
         | 
| 767 790 | 
             
            stand
         | 
| 768 791 | 
             
            started
         | 
| 769 792 | 
             
            step
         | 
| @@ -780,6 +803,7 @@ sup | |
| 780 803 | 
             
            sur
         | 
| 781 804 | 
             
            sure
         | 
| 782 805 | 
             
            Sydney
         | 
| 806 | 
            +
            t
         | 
| 783 807 | 
             
            t's
         | 
| 784 808 | 
             
            take
         | 
| 785 809 | 
             
            taken
         | 
| @@ -859,6 +883,7 @@ twice | |
| 859 883 | 
             
            two
         | 
| 860 884 | 
             
            Tyler
         | 
| 861 885 | 
             
            typically
         | 
| 886 | 
            +
            u
         | 
| 862 887 | 
             
            ultra
         | 
| 863 888 | 
             
            un
         | 
| 864 889 | 
             
            unfortunately
         | 
| @@ -876,6 +901,7 @@ uses | |
| 876 901 | 
             
            using
         | 
| 877 902 | 
             
            usually
         | 
| 878 903 | 
             
            uucp
         | 
| 904 | 
            +
            v
         | 
| 879 905 | 
             
            value
         | 
| 880 906 | 
             
            Vanessa
         | 
| 881 907 | 
             
            various
         | 
| @@ -886,6 +912,7 @@ Victoria | |
| 886 912 | 
             
            Vincent
         | 
| 887 913 | 
             
            viz
         | 
| 888 914 | 
             
            vs
         | 
| 915 | 
            +
            w
         | 
| 889 916 | 
             
            walks
         | 
| 890 917 | 
             
            want
         | 
| 891 918 | 
             
            wants
         | 
| @@ -927,8 +954,6 @@ who's | |
| 927 954 | 
             
            whoever
         | 
| 928 955 | 
             
            whole
         | 
| 929 956 | 
             
            whom
         | 
| 930 | 
            -
            approximate
         | 
| 931 | 
            -
            approximately
         | 
| 932 957 | 
             
            whose
         | 
| 933 958 | 
             
            why
         | 
| 934 959 | 
             
            will
         | 
| @@ -948,6 +973,7 @@ wouldn't | |
| 948 973 | 
             
            wrapped
         | 
| 949 974 | 
             
            Wyatt
         | 
| 950 975 | 
             
            Xavier
         | 
| 976 | 
            +
            y
         | 
| 951 977 | 
             
            yeah
         | 
| 952 978 | 
             
            yes
         | 
| 953 979 | 
             
            yet
         | 
| @@ -960,6 +986,17 @@ your | |
| 960 986 | 
             
            yours
         | 
| 961 987 | 
             
            yourself
         | 
| 962 988 | 
             
            yourselves
         | 
| 989 | 
            +
            z
         | 
| 963 990 | 
             
            Zachary
         | 
| 964 991 | 
             
            zero
         | 
| 965 | 
            -
            Zoe
         | 
| 992 | 
            +
            Zoe
         | 
| 993 | 
            +
            0
         | 
| 994 | 
            +
            1
         | 
| 995 | 
            +
            2
         | 
| 996 | 
            +
            3
         | 
| 997 | 
            +
            4
         | 
| 998 | 
            +
            5
         | 
| 999 | 
            +
            6
         | 
| 1000 | 
            +
            7
         | 
| 1001 | 
            +
            8
         | 
| 1002 | 
            +
            9
         | 
    
        data/pismo.gemspec
    CHANGED
    
    | @@ -1,101 +1,31 @@ | |
| 1 | 
            -
            # Generated by jeweler
         | 
| 2 | 
            -
            # DO NOT EDIT THIS FILE DIRECTLY
         | 
| 3 | 
            -
            # Instead, edit Jeweler::Tasks in Rakefile, and run the gemspec command
         | 
| 4 1 | 
             
            # -*- encoding: utf-8 -*-
         | 
| 2 | 
            +
            $:.push File.expand_path("../lib", __FILE__)
         | 
| 3 | 
            +
            require "pismo/version"
         | 
| 5 4 |  | 
| 6 5 | 
             
            Gem::Specification.new do |s|
         | 
| 7 | 
            -
              s.name | 
| 8 | 
            -
              s.version | 
| 9 | 
            -
             | 
| 10 | 
            -
              s. | 
| 11 | 
            -
              s. | 
| 12 | 
            -
              s. | 
| 13 | 
            -
              s. | 
| 6 | 
            +
              s.name        = "pismo"
         | 
| 7 | 
            +
              s.version     = Pismo::VERSION
         | 
| 8 | 
            +
              s.platform    = Gem::Platform::RUBY
         | 
| 9 | 
            +
              s.authors     = ["Peter Cooper"]
         | 
| 10 | 
            +
              s.email       = ["git@peterc.org"]
         | 
| 11 | 
            +
              s.homepage    = "http://github.com/peterc/pismo"
         | 
| 12 | 
            +
              s.summary     = %q{TODO: Write a gem summary}
         | 
| 14 13 | 
             
              s.description = %q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
         | 
| 15 | 
            -
              s. | 
| 16 | 
            -
              s. | 
| 17 | 
            -
              s. | 
| 18 | 
            -
                "LICENSE",
         | 
| 19 | 
            -
                 "README.markdown"
         | 
| 20 | 
            -
              ]
         | 
| 21 | 
            -
              s.files = [
         | 
| 22 | 
            -
                ".document",
         | 
| 23 | 
            -
                 ".gitignore",
         | 
| 24 | 
            -
                 "LICENSE",
         | 
| 25 | 
            -
                 "NOTICE",
         | 
| 26 | 
            -
                 "README.markdown",
         | 
| 27 | 
            -
                 "Rakefile",
         | 
| 28 | 
            -
                 "VERSION",
         | 
| 29 | 
            -
                 "bin/pismo",
         | 
| 30 | 
            -
                 "lib/pismo.rb",
         | 
| 31 | 
            -
                 "lib/pismo/document.rb",
         | 
| 32 | 
            -
                 "lib/pismo/external_attributes.rb",
         | 
| 33 | 
            -
                 "lib/pismo/internal_attributes.rb",
         | 
| 34 | 
            -
                 "lib/pismo/reader.rb",
         | 
| 35 | 
            -
                 "lib/pismo/stopwords.txt",
         | 
| 36 | 
            -
                 "pismo.gemspec",
         | 
| 37 | 
            -
                 "test/corpus/bbcnews.html",
         | 
| 38 | 
            -
                 "test/corpus/bbcnews2.html",
         | 
| 39 | 
            -
                 "test/corpus/briancray.html",
         | 
| 40 | 
            -
                 "test/corpus/cant_read.html",
         | 
| 41 | 
            -
                 "test/corpus/factor.html",
         | 
| 42 | 
            -
                 "test/corpus/gmane.html",
         | 
| 43 | 
            -
                 "test/corpus/huffington.html",
         | 
| 44 | 
            -
                 "test/corpus/metadata_expected.yaml",
         | 
| 45 | 
            -
                 "test/corpus/metadata_expected.yaml.old",
         | 
| 46 | 
            -
                 "test/corpus/queness.html",
         | 
| 47 | 
            -
                 "test/corpus/reader_expected.yaml",
         | 
| 48 | 
            -
                 "test/corpus/rubyinside.html",
         | 
| 49 | 
            -
                 "test/corpus/rww.html",
         | 
| 50 | 
            -
                 "test/corpus/spolsky.html",
         | 
| 51 | 
            -
                 "test/corpus/techcrunch.html",
         | 
| 52 | 
            -
                 "test/corpus/tweet.html",
         | 
| 53 | 
            -
                 "test/corpus/youtube.html",
         | 
| 54 | 
            -
                 "test/corpus/zefrank.html",
         | 
| 55 | 
            -
                 "test/helper.rb",
         | 
| 56 | 
            -
                 "test/test_corpus.rb",
         | 
| 57 | 
            -
                 "test/test_pismo_document.rb"
         | 
| 58 | 
            -
              ]
         | 
| 59 | 
            -
              s.homepage = %q{http://github.com/peterc/pismo}
         | 
| 60 | 
            -
              s.rdoc_options = ["--charset=UTF-8"]
         | 
| 61 | 
            -
              s.require_paths = ["lib"]
         | 
| 62 | 
            -
              s.rubygems_version = %q{1.3.5}
         | 
| 63 | 
            -
              s.summary = %q{Extracts or retrieves content-related metadata from HTML pages}
         | 
| 64 | 
            -
              s.test_files = [
         | 
| 65 | 
            -
                "test/helper.rb",
         | 
| 66 | 
            -
                 "test/test_corpus.rb",
         | 
| 67 | 
            -
                 "test/test_pismo_document.rb"
         | 
| 68 | 
            -
              ]
         | 
| 14 | 
            +
              s.summary     = %q{Extracts or retrieves content-related metadata from HTML pages}
         | 
| 15 | 
            +
              s.date        = %q{2010-07-27}
         | 
| 16 | 
            +
              s.default_executable = %q{pismo}
         | 
| 69 17 |  | 
| 70 | 
            -
               | 
| 71 | 
            -
                current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
         | 
| 72 | 
            -
                s.specification_version = 3
         | 
| 18 | 
            +
              s.rubyforge_project = "pismo"
         | 
| 73 19 |  | 
| 74 | 
            -
             | 
| 75 | 
            -
             | 
| 76 | 
            -
             | 
| 77 | 
            -
             | 
| 78 | 
            -
             | 
| 79 | 
            -
             | 
| 80 | 
            -
             | 
| 81 | 
            -
             | 
| 82 | 
            -
             | 
| 83 | 
            -
             | 
| 84 | 
            -
             | 
| 85 | 
            -
                  s.add_dependency(%q<jeweler>, [">= 0"])
         | 
| 86 | 
            -
                  s.add_dependency(%q<nokogiri>, [">= 0"])
         | 
| 87 | 
            -
                  s.add_dependency(%q<sanitize>, [">= 0"])
         | 
| 88 | 
            -
                  s.add_dependency(%q<fast-stemmer>, [">= 0"])
         | 
| 89 | 
            -
                  s.add_dependency(%q<chronic>, [">= 0"])
         | 
| 90 | 
            -
                end
         | 
| 91 | 
            -
              else
         | 
| 92 | 
            -
                s.add_dependency(%q<shoulda>, [">= 0"])
         | 
| 93 | 
            -
                s.add_dependency(%q<awesome_print>, [">= 0"])
         | 
| 94 | 
            -
                s.add_dependency(%q<jeweler>, [">= 0"])
         | 
| 95 | 
            -
                s.add_dependency(%q<nokogiri>, [">= 0"])
         | 
| 96 | 
            -
                s.add_dependency(%q<sanitize>, [">= 0"])
         | 
| 97 | 
            -
                s.add_dependency(%q<fast-stemmer>, [">= 0"])
         | 
| 98 | 
            -
                s.add_dependency(%q<chronic>, [">= 0"])
         | 
| 99 | 
            -
              end
         | 
| 20 | 
            +
              s.files         = `git ls-files`.split("\n")
         | 
| 21 | 
            +
              s.test_files    = `git ls-files -- {test,spec,features}/*`.split("\n")
         | 
| 22 | 
            +
              s.executables   = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
         | 
| 23 | 
            +
              s.require_paths = ["lib"]
         | 
| 24 | 
            +
              
         | 
| 25 | 
            +
              s.add_dependency(%q<shoulda>, [">= 0"])
         | 
| 26 | 
            +
              s.add_dependency(%q<awesome_print>, [">= 0"])
         | 
| 27 | 
            +
              s.add_dependency(%q<nokogiri>, [">= 0"])
         | 
| 28 | 
            +
              s.add_dependency(%q<sanitize>, [">= 0"])
         | 
| 29 | 
            +
              s.add_dependency(%q<fast-stemmer>, [">= 0"])
         | 
| 30 | 
            +
              s.add_dependency(%q<chronic>, [">= 0"])
         | 
| 100 31 | 
             
            end
         | 
| 101 | 
            -
             | 
| @@ -2,14 +2,14 @@ | |
| 2 2 | 
             
            :rww: 
         | 
| 3 3 | 
             
              :title: "Cartoon: Apple Tablet: Now With Barometer and Bird Call Generator"
         | 
| 4 4 | 
             
              :feed: http://www.readwriteweb.com/rss.xml
         | 
| 5 | 
            -
              :lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator.
         | 
| 5 | 
            +
              :lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator. I'm never sure if Apple does themselves more good than harm with the secrecy and anticipation that surrounds the run-up to these announcements.
         | 
| 6 6 | 
             
              :feeds:
         | 
| 7 7 | 
             
                - http://www.readwriteweb.com/rss.xml
         | 
| 8 8 | 
             
                - http://www.readwriteweb.com/archives/2010/01/cartoon_apple_tablet_now_with_barometer_and_bird_c.xml  
         | 
| 9 9 | 
             
            :briancray: 
         | 
| 10 10 | 
             
              :title: 5 great examples of popular blog posts that you should know
         | 
| 11 11 | 
             
              :feed: http://feeds.feedburner.com/briancray/blog
         | 
| 12 | 
            -
              :lede: "This is a mock post. | 
| 12 | 
            +
              :lede: "This is a mock post."
         | 
| 13 13 | 
             
            :huffington:
         | 
| 14 14 | 
             
              :title: Afghans Losing Hope After 8 Years Of War
         | 
| 15 15 | 
             
              :author: TODD PITMAN
         | 
| @@ -31,9 +31,9 @@ | |
| 31 31 | 
             
              :feed: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml
         | 
| 32 32 | 
             
            :factor:
         | 
| 33 33 | 
             
              :title: Factor's bootstrap process explained
         | 
| 34 | 
            -
              :lede: "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap."
         | 
| 34 | 
            +
              :lede: "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code."
         | 
| 35 35 | 
             
              :ledes:
         | 
| 36 | 
            -
                - "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap."
         | 
| 36 | 
            +
                - "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code."
         | 
| 37 37 | 
             
            :youtube:
         | 
| 38 38 | 
             
              :title: YMO - Rydeen (Official Video)
         | 
| 39 39 | 
             
              :author: ymo1965
         | 
| @@ -42,7 +42,7 @@ | |
| 42 42 | 
             
            :spolsky:
         | 
| 43 43 | 
             
              :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
         | 
| 44 44 | 
             
              :description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
         | 
| 45 | 
            -
              :lede: Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "????
         | 
| 45 | 
            +
              :lede: "Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line \"???? "
         | 
| 46 46 | 
             
              :author: Joel Spolsky
         | 
| 47 47 | 
             
              :favicon: /favicon.ico
         | 
| 48 48 | 
             
              :feed: http://www.joelonsoftware.com/rss.xml
         | 
| @@ -52,14 +52,14 @@ | |
| 52 52 | 
             
            :rubyinside: 
         | 
| 53 53 | 
             
              :title: "CoffeeScript: A New Language With A Pure Ruby Compiler"
         | 
| 54 54 | 
             
              :author: Peter Cooper
         | 
| 55 | 
            -
              :lede: CoffeeScript (GitHub repo) is a new programming language with a pure Ruby compiler. | 
| 55 | 
            +
              :lede: CoffeeScript (GitHub repo) is a new programming language with a pure Ruby compiler.
         | 
| 56 56 | 
             
              :feed: http://www.rubyinside.com/feed/
         | 
| 57 57 | 
             
            :zefrank:
         | 
| 58 58 | 
             
              :sentences: If there's anyone who knows how to marshal an online audience, it's Ze Frank. Ze is best-known for his 2006 program "The Show," in which he made a new 2-3 minute video every day for 1 year. Topics ranged from "fingers in food" to the mysteries of airport signage to a tour de force summary of creatives' addiction to un-executed ideas, aka brain crack.
         | 
| 59 59 | 
             
              :title: "Ze Frank on Imaginary Audiences :: Articles :: The 99 Percent"
         | 
| 60 60 | 
             
              :description: We chat with the Internet's most notorious mass-collaboration instigator Ze Frank about idea execution and how to build armies of sportsracers.
         | 
| 61 61 | 
             
            :tweet:
         | 
| 62 | 
            -
              :lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X. | 
| 62 | 
            +
              :lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X.
         | 
| 63 63 | 
             
              :sentences: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS Wow..!
         | 
| 64 64 | 
             
            :cant_read:
         | 
| 65 65 | 
             
              :sentences: "For those of us who grew up as weird kids in the 1980s, the work of Berkeley Breathed was as important as those twin eternal pillars of weird-kid-dom: Monty Python and Mad magazine. In a word: seminal. In two words: fucking seminal."
         | 
| @@ -67,6 +67,6 @@ | |
| 67 67 | 
             
              :sentences: I am pleased to report that the GCC Steering Committee and the FSF have approved the use of C++ in GCC itself. Of course, there's no reason for us to use C++ features just because we can. The goal is a better compiler for users, not a C++ code base for its own sake.
         | 
| 68 68 | 
             
            :queness:
         | 
| 69 69 | 
             
              :title: 18 Incredible CSS3 Effects You Have Never Seen Before
         | 
| 70 | 
            -
              :lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it."
         | 
| 70 | 
            +
              :lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it. Also, I have started to implement it to my own project as well and I really love it!"
         | 
| 71 71 | 
             
              :sentences: CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it.
         | 
| 72 72 | 
             
              :datetime: 2010-06-02 12:00:00 +01:00
         | 
    
        metadata
    CHANGED
    
    | @@ -1,7 +1,12 @@ | |
| 1 1 | 
             
            --- !ruby/object:Gem::Specification 
         | 
| 2 2 | 
             
            name: pismo
         | 
| 3 3 | 
             
            version: !ruby/object:Gem::Version 
         | 
| 4 | 
            -
               | 
| 4 | 
            +
              prerelease: false
         | 
| 5 | 
            +
              segments: 
         | 
| 6 | 
            +
              - 0
         | 
| 7 | 
            +
              - 7
         | 
| 8 | 
            +
              - 1
         | 
| 9 | 
            +
              version: 0.7.1
         | 
| 5 10 | 
             
            platform: ruby
         | 
| 6 11 | 
             
            authors: 
         | 
| 7 12 | 
             
            - Peter Cooper
         | 
| @@ -14,91 +19,99 @@ default_executable: pismo | |
| 14 19 | 
             
            dependencies: 
         | 
| 15 20 | 
             
            - !ruby/object:Gem::Dependency 
         | 
| 16 21 | 
             
              name: shoulda
         | 
| 17 | 
            -
               | 
| 18 | 
            -
               | 
| 19 | 
            -
             | 
| 22 | 
            +
              prerelease: false
         | 
| 23 | 
            +
              requirement: &id001 !ruby/object:Gem::Requirement 
         | 
| 24 | 
            +
                none: false
         | 
| 20 25 | 
             
                requirements: 
         | 
| 21 26 | 
             
                - - ">="
         | 
| 22 27 | 
             
                  - !ruby/object:Gem::Version 
         | 
| 28 | 
            +
                    segments: 
         | 
| 29 | 
            +
                    - 0
         | 
| 23 30 | 
             
                    version: "0"
         | 
| 24 | 
            -
             | 
| 31 | 
            +
              type: :runtime
         | 
| 32 | 
            +
              version_requirements: *id001
         | 
| 25 33 | 
             
            - !ruby/object:Gem::Dependency 
         | 
| 26 34 | 
             
              name: awesome_print
         | 
| 27 | 
            -
               | 
| 28 | 
            -
               | 
| 29 | 
            -
             | 
| 35 | 
            +
              prerelease: false
         | 
| 36 | 
            +
              requirement: &id002 !ruby/object:Gem::Requirement 
         | 
| 37 | 
            +
                none: false
         | 
| 30 38 | 
             
                requirements: 
         | 
| 31 39 | 
             
                - - ">="
         | 
| 32 40 | 
             
                  - !ruby/object:Gem::Version 
         | 
| 41 | 
            +
                    segments: 
         | 
| 42 | 
            +
                    - 0
         | 
| 33 43 | 
             
                    version: "0"
         | 
| 34 | 
            -
                version: 
         | 
| 35 | 
            -
            - !ruby/object:Gem::Dependency 
         | 
| 36 | 
            -
              name: jeweler
         | 
| 37 44 | 
             
              type: :runtime
         | 
| 38 | 
            -
               | 
| 39 | 
            -
              version_requirements: !ruby/object:Gem::Requirement 
         | 
| 40 | 
            -
                requirements: 
         | 
| 41 | 
            -
                - - ">="
         | 
| 42 | 
            -
                  - !ruby/object:Gem::Version 
         | 
| 43 | 
            -
                    version: "0"
         | 
| 44 | 
            -
                version: 
         | 
| 45 | 
            +
              version_requirements: *id002
         | 
| 45 46 | 
             
            - !ruby/object:Gem::Dependency 
         | 
| 46 47 | 
             
              name: nokogiri
         | 
| 47 | 
            -
               | 
| 48 | 
            -
               | 
| 49 | 
            -
             | 
| 48 | 
            +
              prerelease: false
         | 
| 49 | 
            +
              requirement: &id003 !ruby/object:Gem::Requirement 
         | 
| 50 | 
            +
                none: false
         | 
| 50 51 | 
             
                requirements: 
         | 
| 51 52 | 
             
                - - ">="
         | 
| 52 53 | 
             
                  - !ruby/object:Gem::Version 
         | 
| 54 | 
            +
                    segments: 
         | 
| 55 | 
            +
                    - 0
         | 
| 53 56 | 
             
                    version: "0"
         | 
| 54 | 
            -
             | 
| 57 | 
            +
              type: :runtime
         | 
| 58 | 
            +
              version_requirements: *id003
         | 
| 55 59 | 
             
            - !ruby/object:Gem::Dependency 
         | 
| 56 60 | 
             
              name: sanitize
         | 
| 57 | 
            -
               | 
| 58 | 
            -
               | 
| 59 | 
            -
             | 
| 61 | 
            +
              prerelease: false
         | 
| 62 | 
            +
              requirement: &id004 !ruby/object:Gem::Requirement 
         | 
| 63 | 
            +
                none: false
         | 
| 60 64 | 
             
                requirements: 
         | 
| 61 65 | 
             
                - - ">="
         | 
| 62 66 | 
             
                  - !ruby/object:Gem::Version 
         | 
| 67 | 
            +
                    segments: 
         | 
| 68 | 
            +
                    - 0
         | 
| 63 69 | 
             
                    version: "0"
         | 
| 64 | 
            -
             | 
| 70 | 
            +
              type: :runtime
         | 
| 71 | 
            +
              version_requirements: *id004
         | 
| 65 72 | 
             
            - !ruby/object:Gem::Dependency 
         | 
| 66 73 | 
             
              name: fast-stemmer
         | 
| 67 | 
            -
               | 
| 68 | 
            -
               | 
| 69 | 
            -
             | 
| 74 | 
            +
              prerelease: false
         | 
| 75 | 
            +
              requirement: &id005 !ruby/object:Gem::Requirement 
         | 
| 76 | 
            +
                none: false
         | 
| 70 77 | 
             
                requirements: 
         | 
| 71 78 | 
             
                - - ">="
         | 
| 72 79 | 
             
                  - !ruby/object:Gem::Version 
         | 
| 80 | 
            +
                    segments: 
         | 
| 81 | 
            +
                    - 0
         | 
| 73 82 | 
             
                    version: "0"
         | 
| 74 | 
            -
             | 
| 83 | 
            +
              type: :runtime
         | 
| 84 | 
            +
              version_requirements: *id005
         | 
| 75 85 | 
             
            - !ruby/object:Gem::Dependency 
         | 
| 76 86 | 
             
              name: chronic
         | 
| 77 | 
            -
               | 
| 78 | 
            -
               | 
| 79 | 
            -
             | 
| 87 | 
            +
              prerelease: false
         | 
| 88 | 
            +
              requirement: &id006 !ruby/object:Gem::Requirement 
         | 
| 89 | 
            +
                none: false
         | 
| 80 90 | 
             
                requirements: 
         | 
| 81 91 | 
             
                - - ">="
         | 
| 82 92 | 
             
                  - !ruby/object:Gem::Version 
         | 
| 93 | 
            +
                    segments: 
         | 
| 94 | 
            +
                    - 0
         | 
| 83 95 | 
             
                    version: "0"
         | 
| 84 | 
            -
             | 
| 96 | 
            +
              type: :runtime
         | 
| 97 | 
            +
              version_requirements: *id006
         | 
| 85 98 | 
             
            description: Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.
         | 
| 86 | 
            -
            email:  | 
| 99 | 
            +
            email: 
         | 
| 100 | 
            +
            - git@peterc.org
         | 
| 87 101 | 
             
            executables: 
         | 
| 88 102 | 
             
            - pismo
         | 
| 89 103 | 
             
            extensions: []
         | 
| 90 104 |  | 
| 91 | 
            -
            extra_rdoc_files: 
         | 
| 92 | 
            -
             | 
| 93 | 
            -
            - README.markdown
         | 
| 105 | 
            +
            extra_rdoc_files: []
         | 
| 106 | 
            +
             | 
| 94 107 | 
             
            files: 
         | 
| 95 108 | 
             
            - .document
         | 
| 96 109 | 
             
            - .gitignore
         | 
| 110 | 
            +
            - Gemfile
         | 
| 97 111 | 
             
            - LICENSE
         | 
| 98 112 | 
             
            - NOTICE
         | 
| 99 113 | 
             
            - README.markdown
         | 
| 100 114 | 
             
            - Rakefile
         | 
| 101 | 
            -
            - VERSION
         | 
| 102 115 | 
             
            - bin/pismo
         | 
| 103 116 | 
             
            - lib/pismo.rb
         | 
| 104 117 | 
             
            - lib/pismo/document.rb
         | 
| @@ -106,6 +119,7 @@ files: | |
| 106 119 | 
             
            - lib/pismo/internal_attributes.rb
         | 
| 107 120 | 
             
            - lib/pismo/reader.rb
         | 
| 108 121 | 
             
            - lib/pismo/stopwords.txt
         | 
| 122 | 
            +
            - lib/pismo/version.rb
         | 
| 109 123 | 
             
            - pismo.gemspec
         | 
| 110 124 | 
             
            - test/corpus/bbcnews.html
         | 
| 111 125 | 
             
            - test/corpus/bbcnews2.html
         | 
| @@ -133,30 +147,52 @@ homepage: http://github.com/peterc/pismo | |
| 133 147 | 
             
            licenses: []
         | 
| 134 148 |  | 
| 135 149 | 
             
            post_install_message: 
         | 
| 136 | 
            -
            rdoc_options: 
         | 
| 137 | 
            -
             | 
| 150 | 
            +
            rdoc_options: []
         | 
| 151 | 
            +
             | 
| 138 152 | 
             
            require_paths: 
         | 
| 139 153 | 
             
            - lib
         | 
| 140 154 | 
             
            required_ruby_version: !ruby/object:Gem::Requirement 
         | 
| 155 | 
            +
              none: false
         | 
| 141 156 | 
             
              requirements: 
         | 
| 142 157 | 
             
              - - ">="
         | 
| 143 158 | 
             
                - !ruby/object:Gem::Version 
         | 
| 159 | 
            +
                  segments: 
         | 
| 160 | 
            +
                  - 0
         | 
| 144 161 | 
             
                  version: "0"
         | 
| 145 | 
            -
              version: 
         | 
| 146 162 | 
             
            required_rubygems_version: !ruby/object:Gem::Requirement 
         | 
| 163 | 
            +
              none: false
         | 
| 147 164 | 
             
              requirements: 
         | 
| 148 165 | 
             
              - - ">="
         | 
| 149 166 | 
             
                - !ruby/object:Gem::Version 
         | 
| 167 | 
            +
                  segments: 
         | 
| 168 | 
            +
                  - 0
         | 
| 150 169 | 
             
                  version: "0"
         | 
| 151 | 
            -
              version: 
         | 
| 152 170 | 
             
            requirements: []
         | 
| 153 171 |  | 
| 154 | 
            -
            rubyforge_project: 
         | 
| 155 | 
            -
            rubygems_version: 1.3. | 
| 172 | 
            +
            rubyforge_project: pismo
         | 
| 173 | 
            +
            rubygems_version: 1.3.7
         | 
| 156 174 | 
             
            signing_key: 
         | 
| 157 175 | 
             
            specification_version: 3
         | 
| 158 176 | 
             
            summary: Extracts or retrieves content-related metadata from HTML pages
         | 
| 159 177 | 
             
            test_files: 
         | 
| 178 | 
            +
            - test/corpus/bbcnews.html
         | 
| 179 | 
            +
            - test/corpus/bbcnews2.html
         | 
| 180 | 
            +
            - test/corpus/briancray.html
         | 
| 181 | 
            +
            - test/corpus/cant_read.html
         | 
| 182 | 
            +
            - test/corpus/factor.html
         | 
| 183 | 
            +
            - test/corpus/gmane.html
         | 
| 184 | 
            +
            - test/corpus/huffington.html
         | 
| 185 | 
            +
            - test/corpus/metadata_expected.yaml
         | 
| 186 | 
            +
            - test/corpus/metadata_expected.yaml.old
         | 
| 187 | 
            +
            - test/corpus/queness.html
         | 
| 188 | 
            +
            - test/corpus/reader_expected.yaml
         | 
| 189 | 
            +
            - test/corpus/rubyinside.html
         | 
| 190 | 
            +
            - test/corpus/rww.html
         | 
| 191 | 
            +
            - test/corpus/spolsky.html
         | 
| 192 | 
            +
            - test/corpus/techcrunch.html
         | 
| 193 | 
            +
            - test/corpus/tweet.html
         | 
| 194 | 
            +
            - test/corpus/youtube.html
         | 
| 195 | 
            +
            - test/corpus/zefrank.html
         | 
| 160 196 | 
             
            - test/helper.rb
         | 
| 161 197 | 
             
            - test/test_corpus.rb
         | 
| 162 198 | 
             
            - test/test_pismo_document.rb
         | 
    
        data/VERSION
    DELETED
    
    | @@ -1 +0,0 @@ | |
| 1 | 
            -
            0.7.0
         |