nddrylliog_pismo 0.7.3

Sign up to get free protection for your applications and to get access to all the features.
Files changed (43) hide show
  1. data/.document +5 -0
  2. data/.gitignore +29 -0
  3. data/Gemfile +4 -0
  4. data/LICENSE +23 -0
  5. data/NOTICE +4 -0
  6. data/README.markdown +131 -0
  7. data/Rakefile +72 -0
  8. data/bin/pismo +45 -0
  9. data/lib/pismo.rb +82 -0
  10. data/lib/pismo/document.rb +67 -0
  11. data/lib/pismo/external_attributes.rb +14 -0
  12. data/lib/pismo/internal_attributes.rb +316 -0
  13. data/lib/pismo/reader.rb +19 -0
  14. data/lib/pismo/reader/base.rb +259 -0
  15. data/lib/pismo/reader/cluster.rb +171 -0
  16. data/lib/pismo/reader/tree.rb +154 -0
  17. data/lib/pismo/stopwords.txt +1002 -0
  18. data/lib/pismo/version.rb +3 -0
  19. data/pismo.gemspec +30 -0
  20. data/test/corpus/bbcnews.html +2131 -0
  21. data/test/corpus/bbcnews2.html +1575 -0
  22. data/test/corpus/briancray.html +269 -0
  23. data/test/corpus/cant_read.html +426 -0
  24. data/test/corpus/factor.html +1362 -0
  25. data/test/corpus/gmane.html +138 -0
  26. data/test/corpus/huffington.html +2932 -0
  27. data/test/corpus/metadata_expected.yaml +72 -0
  28. data/test/corpus/metadata_expected.yaml.old +122 -0
  29. data/test/corpus/queness.html +919 -0
  30. data/test/corpus/reader_expected.yaml +39 -0
  31. data/test/corpus/readers/cluster_expected.yaml +45 -0
  32. data/test/corpus/readers/tree_expected.yaml +55 -0
  33. data/test/corpus/rubyinside.html +318 -0
  34. data/test/corpus/rww.html +1351 -0
  35. data/test/corpus/spolsky.html +298 -0
  36. data/test/corpus/techcrunch.html +1285 -0
  37. data/test/corpus/tweet.html +360 -0
  38. data/test/corpus/youtube.html +2348 -0
  39. data/test/corpus/zefrank.html +535 -0
  40. data/test/helper.rb +15 -0
  41. data/test/test_corpus.rb +54 -0
  42. data/test/test_pismo_document.rb +34 -0
  43. metadata +156 -0
@@ -0,0 +1,5 @@
1
+ README.rdoc
2
+ lib/**/*.rb
3
+ bin/*
4
+ features/**/*.feature
5
+ LICENSE
@@ -0,0 +1,29 @@
1
+ pkg/*
2
+ *.gem
3
+ .bundle
4
+ Gemfile.lock
5
+
6
+ ## MAC OS
7
+ .DS_Store
8
+
9
+ ## TEXTMATE
10
+ *.tmproj
11
+ tmtags
12
+
13
+ ## NETBEANS
14
+ nbproject
15
+
16
+ ## EMACS
17
+ *~
18
+ \#*
19
+ .\#*
20
+
21
+ ## VIM
22
+ *.swp
23
+
24
+ ## PROJECT::GENERAL
25
+ coverage
26
+ rdoc
27
+ pkg
28
+
29
+ ## PROJECT::SPECIFIC
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in pismo.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,23 @@
1
+ Copyright 2009, 2010 Peter Cooper
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
14
+
15
+ --
16
+
17
+ In short, you can use Pismo for whatever you like, but please include
18
+ a brief credit somewhere deep in your license file or similar, and,
19
+ if you're a nice kinda person, let me know if you're using it and/or
20
+ share any significant changes or improvements you make.
21
+
22
+ Peter Cooper
23
+ http://twitter.com/peterc
data/NOTICE ADDED
@@ -0,0 +1,4 @@
1
+ Pismo is Copyright (c) 2009, 2010 Peter Cooper
2
+ Pismo is Apache 2.0 Licensed
3
+ Peter Cooper can be found at and contacted via http://twitter.com/peterc
4
+ The source can be found at http://github.com/peterc/pismo
@@ -0,0 +1,131 @@
1
+ # pismo - Web page content analysis and metadata extraction
2
+
3
+ ## DESCRIPTION:
4
+
5
+ Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents.
6
+ Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
7
+ Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
8
+
9
+ All tests pass on Ruby 1.8.7, Ruby 1.9.2 (both MRI) and JRuby 1.5.6.
10
+
11
+ ## NEWS:
12
+
13
+ December 19, 2010: Version 1.7.2 has been released - it includes a patch from Darcy Laycock to fix keyword extraction problems on some pages, has switched from Jeweler to Bundler for management of the gem, and adds support for JRuby 1.5.6 by skipping stemming on that platform.
14
+
15
+ ## USAGE:
16
+
17
+ A basic example of extracting basic metadata from a Web page:
18
+
19
+ require 'pismo'
20
+
21
+ # Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
22
+ doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
23
+
24
+ doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
25
+ doc.author # => "Peter Cooper"
26
+ doc.lede # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
27
+ doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
28
+
29
+ There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:
30
+
31
+ Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
32
+
33
+ The current metadata methods are:
34
+
35
+ * title
36
+ * titles
37
+ * author
38
+ * authors
39
+ * lede
40
+ * keywords
41
+ * sentences(qty)
42
+ * body
43
+ * html_body
44
+ * feed
45
+ * feeds
46
+ * favicon
47
+ * description
48
+ * datetime
49
+
50
+ These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
51
+
52
+ The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader". #body returns it as plain-text, #html_body maintains some basic HTML styling.
53
+
54
+ The default reader is the "tree" reader. This works in a similar fashion to Arc90's Readability or Safari Reader algorithm.
55
+
56
+ New! The keywords method accepts optional arguments. These are the current defaults:
57
+
58
+ :stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2
59
+
60
+ You can also pass an array to keywords with :hints => arr if you want only words of your choosing to be found.
61
+
62
+ ## CAVEATS AND SHORTCOMINGS:
63
+
64
+ There are some shortcomings or problems that I'm aware of and am going to pursue:
65
+
66
+ * I do not know how Pismo fares on Rubinius
67
+ * pismo requires Bundler - get it :-)
68
+ * pismo does not install on JRuby due to a problem in the fast-stemmer dependency
69
+ * Some users have had issues with using Pismo from irb. This appears to be related to Nokogiri use causing a segfault
70
+ * The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction
71
+ * The author name extraction isn't very strong and is best avoided for now
72
+ * The image extraction only deals with images with absolute URLs
73
+ * The stopword list is a little too long (~1000 words) and needs to be trimmed
74
+ * The corpus in test/corpus needs significantly extending
75
+
76
+ ## OTHER GROOVY STUFF:
77
+
78
+ ### Command Line Tool
79
+
80
+ A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
81
+ great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
82
+
83
+ #### Usage:
84
+
85
+ ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
86
+
87
+ #### Output:
88
+
89
+ ---
90
+ :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
91
+ :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
92
+ :lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
93
+ :author: Peter Cooper
94
+ :datetime: 2010-01-07 12:00:00 +00:00
95
+
96
+ If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded
97
+ and assigned to both the constant 'P' and the variable @p.
98
+
99
+ ### Stopword access
100
+
101
+ You can access Pismo's stopword list directly:
102
+
103
+ Pismo.stopwords # => [.., .., ..]
104
+
105
+ ### Alternate readers
106
+
107
+ Pismo supports different readers for extracting the #body and #html_body from the web page.
108
+
109
+ The "cluster" reader uses an algorithm that tries to cluster contiguous content blocks together to identify the main document body. This is based on the ExtractContent gem (http://rubyforge.org/projects/extractcontent/).
110
+
111
+ The reader can be specified as part of #Document.new :
112
+
113
+ doc = Document.new(url, :reader => :cluster)
114
+
115
+
116
+ ## Note on Patches/Pull Requests
117
+
118
+ * Fork the project.
119
+ * Make your feature addition or bug fix.
120
+ * Add tests for it. This is important so I don't break it in a future version unintentionally.
121
+ * Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
122
+ * Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
123
+
124
+ ## COPYRIGHT AND LICENSE
125
+
126
+ Apache 2.0 License - See LICENSE for details.
127
+ Copyright (c) 2009, 2010 Peter Cooper
128
+
129
+ In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.
130
+
131
+ http://github.com/peterc/pismo
@@ -0,0 +1,72 @@
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
3
+
4
+ require 'rake/testtask'
5
+ Rake::TestTask.new(:test) do |test|
6
+ test.libs << 'lib' << 'test'
7
+ test.pattern = 'test/**/test_*.rb'
8
+ test.verbose = true
9
+ end
10
+
11
+ begin
12
+ require 'rcov/rcovtask'
13
+ Rcov::RcovTask.new do |test|
14
+ test.libs << 'test'
15
+ test.pattern = 'test/**/test_*.rb'
16
+ test.verbose = true
17
+ end
18
+ rescue LoadError
19
+ task :rcov do
20
+ abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
21
+ end
22
+ end
23
+
24
+ task :default => :test
25
+
26
+ require 'rake/rdoctask'
27
+ Rake::RDocTask.new do |rdoc|
28
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
29
+
30
+ rdoc.rdoc_dir = 'rdoc'
31
+ rdoc.title = "pismo #{version}"
32
+ rdoc.rdoc_files.include('README*')
33
+ rdoc.rdoc_files.include('lib/**/*.rb')
34
+ end
35
+
36
+ desc 'Automatically run something when code is changed'
37
+ task :on_update do
38
+ require 'find'
39
+ files = {}
40
+
41
+ loop do
42
+ changed = false
43
+ Find.find(File.dirname(__FILE__)) do |file|
44
+ next unless file =~ /\.rb$/
45
+ ctime = File.ctime(file).to_i
46
+
47
+ if ctime != files[file]
48
+ files[file] = ctime
49
+ changed = true
50
+ end
51
+ end
52
+
53
+ if changed
54
+ system ARGV[1] || 'rake'
55
+ puts "\n" + Time.now.to_s
56
+ end
57
+
58
+ sleep 4
59
+ end
60
+ end
61
+
62
+ desc 'Console mode'
63
+ task :console do
64
+ require 'irb'
65
+ require 'lib/pismo'
66
+ require 'open-uri'
67
+ @d = Pismo.document(ARGV[1] || open('./test/corpus/bbcnews.html'))
68
+
69
+ # Get around IRB's issues with ARGV..
70
+ ARGV = []
71
+ IRB.start
72
+ end
@@ -0,0 +1,45 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # pismo
4
+ #
5
+ # get metadata about a page from the command line
6
+ #
7
+ # Usage:
8
+ # ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title description author
9
+ # Output:
10
+ # ---
11
+ # :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
12
+ # :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
13
+ # :description: The ideal book for beginners or developers merely new to Ruby. Goes from installation to OOP, webapps, SQL, and GUI apps.
14
+ # :author: Peter Cooper
15
+
16
+
17
+ require 'yaml'
18
+ require 'rubygems'
19
+ $:.unshift(File.dirname(__FILE__) + "/../lib")
20
+ require 'pismo'
21
+ require 'irb'
22
+
23
+ url = ARGV.shift
24
+
25
+ unless url =~ /^http/
26
+ url = File.read(url)
27
+ end
28
+
29
+ doc = Pismo.document(url)
30
+
31
+ if ARGV.empty?
32
+ P = doc
33
+ @p = doc
34
+ puts "Pismo has loaded #{url} into @p and P"
35
+ #puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
36
+ IRB.start
37
+ else
38
+ output = { :url => doc.url }
39
+
40
+ ARGV.each do |cmd|
41
+ output[cmd.to_sym] = doc.send(cmd)
42
+ end
43
+
44
+ puts output.to_yaml
45
+ end
@@ -0,0 +1,82 @@
1
+ # encoding: utf-8
2
+
3
+ require 'open-uri'
4
+ require 'nokogiri'
5
+ require 'chronic'
6
+ require 'sanitize'
7
+ require 'tempfile'
8
+
9
+ $: << File.dirname(__FILE__)
10
+ require 'pismo/document'
11
+ require 'pismo/reader'
12
+ require 'pismo/reader/base'
13
+ require 'pismo/reader/tree'
14
+ require 'pismo/reader/cluster'
15
+
16
+ if RUBY_PLATFORM == "java"
17
+ class String; def stem; self; end; end
18
+ else
19
+ require 'fast_stemmer'
20
+ end
21
+
22
+ module Pismo
23
+ # Sugar methods to make creating document objects nicer
24
+ def self.document(handle, options = {})
25
+ Document.new(handle, options)
26
+ end
27
+
28
+ # Load a URL, as with Pismo['http://www.rubyinside.com'], and caches the Pismo document
29
+ # (mostly useful for debugging use)
30
+ def self.[](url)
31
+ @docs ||= {}
32
+ @docs[url] ||= Pismo::Document.new(url)
33
+ end
34
+
35
+
36
+ # Return stopword list
37
+ def self.stopwords
38
+ @stopwords ||= File.read(File.dirname(__FILE__) + '/pismo/stopwords.txt').split rescue []
39
+ end
40
+
41
+ class NFunctions
42
+ def self.match_href(list, expression)
43
+ list.find_all { |node| node['href'] =~ /#{expression}/ }
44
+ end
45
+ end
46
+ end
47
+
48
+ # Add some sugar to Nokogiri
49
+ class Nokogiri::HTML::Document
50
+ def get_the(search)
51
+ self.search(search).first rescue nil
52
+ end
53
+
54
+ def match(queries = [], all = false)
55
+ r = [] if all
56
+ [*queries].each do |query|
57
+ if query.is_a?(String)
58
+ if el = self.search(query).first
59
+ if el.name.downcase == "meta"
60
+ result = el['content'].strip rescue nil
61
+ else
62
+ result = el.inner_text.strip rescue nil
63
+ end
64
+ end
65
+ elsif query.is_a?(Array)
66
+ result = query[1].call(self.search(query.first).first).strip rescue nil
67
+ end
68
+
69
+ if result
70
+ # TODO: Sort out sanitization in a more centralized way
71
+ result.gsub!('’', '\'')
72
+ result.gsub!('—', '-')
73
+ if all
74
+ r << result
75
+ else
76
+ return result
77
+ end
78
+ end
79
+ end
80
+ all && !r.empty? ? r : nil
81
+ end
82
+ end
@@ -0,0 +1,67 @@
1
+ require 'pismo/internal_attributes'
2
+ require 'pismo/external_attributes'
3
+
4
+ module Pismo
5
+
6
+ # Pismo::Document represents a single HTML document within Pismo
7
+ class Document
8
+ attr_reader :doc, :url, :options
9
+
10
+ ATTRIBUTE_METHODS = InternalAttributes.instance_methods + ExternalAttributes.instance_methods
11
+
12
+ include Pismo::InternalAttributes
13
+ include Pismo::ExternalAttributes
14
+
15
+ def initialize(handle, options = {})
16
+ @options = options
17
+ url = @options.delete(:url)
18
+ load(handle, url)
19
+ end
20
+
21
+ # An HTML representation of the document
22
+ def html
23
+ @doc.to_s
24
+ end
25
+
26
+ def load(handle, url = nil)
27
+ @url = url if url
28
+ @url = handle if handle =~ /\Ahttp/i
29
+
30
+ @html = if handle =~ /\Ahttp/i
31
+ open(handle).read
32
+ elsif handle.is_a?(StringIO) || handle.is_a?(IO) || handle.is_a?(Tempfile)
33
+ handle.read
34
+ else
35
+ handle
36
+ end
37
+
38
+ @html = self.class.clean_html(@html)
39
+
40
+ @doc = Nokogiri::HTML(@html)
41
+ end
42
+
43
+ def match(args = [], all = false)
44
+ @doc.match([*args], all)
45
+ end
46
+
47
+ def self.clean_html(html)
48
+ # Normalize stupid entities
49
+ # TODO: Optimize this so we don't need all these sequential gsubs
50
+ html.encode!('UTF-8', 'UTF-8', :invalid => :replace)
51
+ html.gsub!("&#8194;", " ")
52
+ html.gsub!("&#8195;", " ")
53
+ html.gsub!("&#8201;", " ")
54
+ html.gsub!('&#8211;', '-')
55
+ html.gsub!("&#8216;", "'")
56
+ html.gsub!('&#8217;', "'")
57
+ html.gsub!('&#8220;', '"')
58
+ html.gsub!('&#8221;', '"')
59
+ html.gsub!("&#8230;", '...')
60
+ html.gsub!('&nbsp;', ' ')
61
+ html.gsub!('&lt;', '<')
62
+ html.gsub!('&gt;', '>')
63
+ html.gsub!('&amp;', '&')
64
+ html
65
+ end
66
+ end
67
+ end