pismo 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/.document ADDED
@@ -0,0 +1,5 @@
1
+ README.rdoc
2
+ lib/**/*.rb
3
+ bin/*
4
+ features/**/*.feature
5
+ LICENSE
data/.gitignore ADDED
@@ -0,0 +1,21 @@
1
+ ## MAC OS
2
+ .DS_Store
3
+
4
+ ## TEXTMATE
5
+ *.tmproj
6
+ tmtags
7
+
8
+ ## EMACS
9
+ *~
10
+ \#*
11
+ .\#*
12
+
13
+ ## VIM
14
+ *.swp
15
+
16
+ ## PROJECT::GENERAL
17
+ coverage
18
+ rdoc
19
+ pkg
20
+
21
+ ## PROJECT::SPECIFIC
data/LICENSE ADDED
@@ -0,0 +1,32 @@
1
+ All EXCEPT the lib/pismo/readability.rb file:
2
+
3
+ Copyright 2009, 2010 Peter Cooper
4
+
5
+ Licensed under the Apache License, Version 2.0 (the "License");
6
+ you may not use this file except in compliance with the License.
7
+ You may obtain a copy of the License at
8
+
9
+ http://www.apache.org/licenses/LICENSE-2.0
10
+
11
+ Unless required by applicable law or agreed to in writing, software
12
+ distributed under the License is distributed on an "AS IS" BASIS,
13
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ See the License for the specific language governing permissions and
15
+ limitations under the License.
16
+
17
+
18
+ For lib/pismo/readability.rb:
19
+
20
+ Copyright 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
21
+
22
+ Licensed under the Apache License, Version 2.0 (the "License");
23
+ you may not use this file except in compliance with the License.
24
+ You may obtain a copy of the License at
25
+
26
+ http://www.apache.org/licenses/LICENSE-2.0
27
+
28
+ Unless required by applicable law or agreed to in writing, software
29
+ distributed under the License is distributed on an "AS IS" BASIS,
30
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
31
+ See the License for the specific language governing permissions and
32
+ limitations under the License.
data/README.rdoc ADDED
@@ -0,0 +1,68 @@
1
+ = pismo (Web page content analyzer and metadata extractor)
2
+
3
+ * http://github.com/peterc/pismo
4
+
5
+ == STATUS:
6
+
7
+ pismo is a VERY NEW project developed for use on http://coder.io/ - my forthcoming developer news aggregator. pismo is FAR FROM COMPLETE. If you're brave, you can have a PLAY with it as the examples below and those in the test suite/corpus do work - all tests pass.
8
+
9
+ The prime missing features so far are the "external attributes" - where calls are made to external services like Delicious, Yahoo, Bing, etc, for getting third party data about documents. The structures are there but I'm still deciding how best to integrate these ideas.
10
+
11
+ == DESCRIPTION:
12
+
13
+ Pismo extracts metadata and machine-usable data from otherwise unstructured
14
+ HTML documents, including titles, body text, graphics, date, and keywords.
15
+
16
+ For example, if you have a blog post HTML file, Pismo should, in theory, be
17
+ able to extract the title, the actual "content", images relating to the
18
+ content, look up Delicious tags, and analyze for keywords.
19
+
20
+ Pismo only understands English. Je suis desolé.
21
+
22
+ == SYNOPSIS:
23
+
24
+ * Basic demo:
25
+
26
+ require 'open-uri'
27
+ require 'pismo'
28
+ doc = Pismo::Document.new(open('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html'))
29
+ doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
30
+ doc.author # => "Peter Cooper"
31
+ doc.lede # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
32
+ doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
33
+
34
+
35
+ == COMMAND LINE TOOL:
36
+
37
+ A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
38
+ great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
39
+
40
+ * Usage:
41
+
42
+ ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
43
+
44
+ * Output:
45
+
46
+ ---
47
+ :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
48
+ :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
49
+ :lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
50
+ :author: Peter Cooper
51
+ :datetime: 2010-01-07 12:00:00 +00:00
52
+
53
+ == Note on Patches/Pull Requests
54
+
55
+ * Fork the project.
56
+ * Make your feature addition or bug fix.
57
+ * Add tests for it. This is important so I don't break it in a future version unintentionally.
58
+ * Commit, do not mess with Rakefile, version, or history.
59
+ * Send me a pull request. I may or may not accept it.
60
+
61
+ == COPYRIGHT AND LICENSE
62
+
63
+ Apache 2.0 License - See LICENSE for details.
64
+
65
+ All except lib/pismo/readability.rb is Copyright (c) 2009, 2010 Peter Cooper
66
+ lib/pismo/readability.rb is Copyright (c) 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs
67
+
68
+ The readability stuff was ganked from http://github.com/iterationlabs/ruby-readability
data/Rakefile ADDED
@@ -0,0 +1,95 @@
1
+ require 'rubygems'
2
+ require 'rake'
3
+
4
+ begin
5
+ require 'jeweler'
6
+ Jeweler::Tasks.new do |gem|
7
+ gem.name = "pismo"
8
+ gem.summary = %Q{Extracts or retrieves content-related metadata from HTML pages}
9
+ gem.description = %Q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, del.icio.us tags, first image used in the content block, etc.}
10
+ gem.email = "git@peterc.org"
11
+ gem.homepage = "http://github.com/peterc/pismo"
12
+ gem.authors = ["Peter Cooper"]
13
+ gem.add_development_dependency "shoulda", ">= 0"
14
+ gem.add_dependency "nokogiri"
15
+ gem.add_dependency "loofah"
16
+ gem.add_dependency "httparty"
17
+ gem.add_dependency "fast-stemmer"
18
+ gem.add_dependency "chronic"
19
+ end
20
+ Jeweler::GemcutterTasks.new
21
+ rescue LoadError
22
+ puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
23
+ end
24
+
25
+ require 'rake/testtask'
26
+ Rake::TestTask.new(:test) do |test|
27
+ test.libs << 'lib' << 'test'
28
+ test.pattern = 'test/**/test_*.rb'
29
+ test.verbose = true
30
+ end
31
+
32
+ begin
33
+ require 'rcov/rcovtask'
34
+ Rcov::RcovTask.new do |test|
35
+ test.libs << 'test'
36
+ test.pattern = 'test/**/test_*.rb'
37
+ test.verbose = true
38
+ end
39
+ rescue LoadError
40
+ task :rcov do
41
+ abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
42
+ end
43
+ end
44
+
45
+ task :test => :check_dependencies
46
+
47
+ task :default => :test
48
+
49
+ require 'rake/rdoctask'
50
+ Rake::RDocTask.new do |rdoc|
51
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
52
+
53
+ rdoc.rdoc_dir = 'rdoc'
54
+ rdoc.title = "pismo #{version}"
55
+ rdoc.rdoc_files.include('README*')
56
+ rdoc.rdoc_files.include('lib/**/*.rb')
57
+ end
58
+
59
+ desc 'Automatically run something when code is changed'
60
+ task :on_update do
61
+ require 'find'
62
+ files = {}
63
+
64
+ loop do
65
+ changed = false
66
+ Find.find(File.dirname(__FILE__)) do |file|
67
+ next unless file =~ /\.rb$/
68
+ ctime = File.ctime(file).to_i
69
+
70
+ if ctime != files[file]
71
+ files[file] = ctime
72
+ changed = true
73
+ end
74
+ end
75
+
76
+ if changed
77
+ system ARGV[1] || 'rake'
78
+ puts "\n" + Time.now.to_s
79
+ end
80
+
81
+ sleep 4
82
+ end
83
+ end
84
+
85
+ desc 'Console mode'
86
+ task :console do
87
+ require 'irb'
88
+ require 'lib/pismo'
89
+ require 'open-uri'
90
+ @d = Pismo.document(ARGV[1] || open('./test/corpus/bbcnews.html'))
91
+
92
+ # Get around IRB's issues with ARGV..
93
+ ARGV = []
94
+ IRB.start
95
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 0.2.0
data/bin/pismo ADDED
@@ -0,0 +1,36 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # pismo
4
+ #
5
+ # get metadata about a page from the command line
6
+ #
7
+ # Usage:
8
+ # ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title description author
9
+ # Output:
10
+ # ---
11
+ # :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
12
+ # :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
13
+ # :description: The ideal book for beginners or developers merely new to Ruby. Goes from installation to OOP, webapps, SQL, and GUI apps.
14
+ # :author: Peter Cooper
15
+
16
+
17
+ require 'yaml'
18
+ require 'rubygems'
19
+ $:.unshift(File.dirname(__FILE__) + "/../lib")
20
+ require 'pismo'
21
+
22
+ url = ARGV.shift
23
+
24
+ unless url =~ /^http/
25
+ url = File.read(url)
26
+ end
27
+
28
+ doc = Pismo.document(url)
29
+
30
+ output = { :url => doc.url }
31
+
32
+ (ARGV.empty? ? Pismo::Document::ATTRIBUTE_METHODS : ARGV).each do |cmd|
33
+ output[cmd.to_sym] = doc.send(cmd) rescue nil
34
+ end
35
+
36
+ puts output.to_yaml
@@ -0,0 +1,50 @@
1
+ require 'pismo/internal_attributes'
2
+ require 'pismo/external_attributes'
3
+
4
+ module Pismo
5
+
6
+ # Pismo::Document represents a single HTML document within Pismo
7
+ class Document
8
+ attr_reader :doc, :url
9
+
10
+ ATTRIBUTE_METHODS = InternalAttributes.instance_methods + ExternalAttributes.instance_methods
11
+
12
+ include Pismo::InternalAttributes
13
+ include Pismo::ExternalAttributes
14
+
15
+ def initialize(handle, url = nil)
16
+ load(handle, url)
17
+ end
18
+
19
+ # An HTML representation of the document
20
+ def html
21
+ @doc.to_s
22
+ end
23
+
24
+ def load(handle, url = nil)
25
+ @url = url if url
26
+ @url = handle if handle =~ /^http/
27
+
28
+ @html = if handle =~ /^http/
29
+ open(handle).read
30
+ elsif handle.is_a?(StringIO) || handle.is_a?(IO)
31
+ handle.read
32
+ else
33
+ handle
34
+ end
35
+
36
+ @html = clean_html(@html)
37
+
38
+ @doc = Nokogiri::HTML(@html)
39
+ end
40
+
41
+ def clean_html(html)
42
+ html.gsub!('&#8217;', '\'')
43
+ html.gsub!('&#8221;', '"')
44
+ html.gsub!('&#8211;', '-')
45
+ html.gsub!('&#8220;', '"')
46
+ html.gsub!('&nbsp;', ' ')
47
+ html
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,14 @@
1
+ module Pismo
2
+ # External attributes return data that comes from external services or programs (e.g. Delicious tags)
3
+ module ExternalAttributes
4
+ #include HTTParty
5
+ #
6
+ #def delicious_tags
7
+ # delicious_info["top_tags"].sort_by { |k, v| v }.reverse.first(5) rescue []
8
+ #end
9
+ #
10
+ #def delicious_info
11
+ # @delicious_info ||= self.class.get('http://feeds.delicious.com/v2/json/urlinfo/' + Digest::MD5.hexdigest(@url)).first rescue nil
12
+ #end
13
+ end
14
+ end
@@ -0,0 +1,202 @@
1
+ module Pismo
2
+ # Internal attributes are different pieces of data we can extract from a document's content
3
+ module InternalAttributes
4
+ # Returns the title of the page/content - attempts to strip site name, etc, if possible
5
+ def title
6
+ title = @doc.match( 'h2.title',
7
+ '.entry h2', # Common style
8
+ '.entryheader h1', # Ruby Inside/Kubrick
9
+ '.entry-title a', # Common Blogger/Blogspot rules
10
+ '.post-title a',
11
+ '.posttitle a',
12
+ '.entry-title',
13
+ '.post-title',
14
+ '.posttitle',
15
+ ['meta[@name="title"]', lambda { |el| el.attr('content') }],
16
+ '#pname a', # Google Code style
17
+ 'h1.headermain',
18
+ 'h1.title',
19
+ '.mxb h1' # BBC News
20
+ )
21
+
22
+ # If all else fails, go to the HTML title
23
+ unless title
24
+ title = @doc.match('title')
25
+ return unless title
26
+
27
+ # Strip off any leading or trailing site names - a scrappy way to try it out..
28
+ title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.strip
29
+ end
30
+
31
+ title
32
+ end
33
+
34
+ # Return an estimate of when the page/content was created
35
+ # As clients of this library should be doing HTTP retrieval themselves, they can fall to the
36
+ # Last-Updated HTTP header if they so wish. This method is just rough and based on content only.
37
+ def datetime
38
+ # TODO: Clean all this mess up
39
+
40
+ mo = %r{(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)}i
41
+
42
+ regexen = [
43
+ /#{mo}\b\s+\d+\D{1,10}\d{4}/i,
44
+ /(on\s+)?\d+\s+#{mo}\s+\D{1,10}\d+/i,
45
+ /(on[^\d+]{1,10})?\d+(th|st|rd)?.{1,10}#{mo}\b[^\d]{1,10}\d+/i,
46
+ /on\s+#{mo}\s+\d+/i,
47
+ /#{mo}\s+\d+/i,
48
+ /\d{4}[\.\/\-]\d{2}[\.\/\-]\d{2}/,
49
+ /\d{2}[\.\/\-]\d{2}[\.\/\-]\d{4}/
50
+ ]
51
+
52
+ datetime = 10
53
+
54
+ regexen.each do |r|
55
+ datetime = @doc.to_html[r]
56
+ p datetime
57
+ break if datetime
58
+ end
59
+
60
+ return unless datetime && datetime.length > 4
61
+
62
+ # Clean up the string for use by Chronic
63
+ datetime.strip!
64
+ datetime.gsub!(/(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)[^\w]*/i, '')
65
+ datetime.gsub!(/(mon|tues|tue|weds|wed|thurs|thur|thu|fri|sat|sun)[^\w]*/i, '')
66
+ datetime.sub!(/on\s+/, '')
67
+ datetime.gsub!(/\,/, '')
68
+ datetime.sub!(/(\d+)(th|st|rd)/, '\1')
69
+
70
+ Chronic.parse(datetime) || datetime
71
+ end
72
+
73
+ # TODO: Attempts to work out what type of site or page the page is from the provided URL
74
+ # def site_type
75
+ # end
76
+
77
+ # Returns the author of the page/content
78
+ def author
79
+ author = @doc.match('.post-author .fn',
80
+ '.wire_author',
81
+ '.cnnByline b',
82
+ ['meta[@name="author"]', lambda { |el| el.attr('content') }], # Traditional meta tag style
83
+ ['meta[@name="AUTHOR"]', lambda { |el| el.attr('content') }], # CNN style
84
+ '.byline a', # Ruby Inside style
85
+ '.post_subheader_left a', # TechCrunch style
86
+ '.byl', # BBC News style
87
+ '.meta a',
88
+ '.articledata .author a',
89
+ '#owners a', # Google Code style
90
+ '.author a',
91
+ '.author',
92
+ '.auth a',
93
+ '.auth',
94
+ '.cT-storyDetails h5', # smh.com.au - worth dropping maybe..
95
+ ['meta[@name="byl"]', lambda { |el| el.attr('content') }],
96
+ '.fn a',
97
+ '.fn',
98
+ '.byline-author'
99
+ )
100
+
101
+ return unless author
102
+
103
+ # Strip off any "By [whoever]" section
104
+ author.sub!(/^(post(ed)?\s)?by\W+/i, '')
105
+
106
+ author
107
+ end
108
+
109
+ # Returns the "description" of the page, usually comes from a meta tag
110
+ def description
111
+ @doc.match(
112
+ ['meta[@name="description"]', lambda { |el| el.attr('content') }],
113
+ ['meta[@name="Description"]', lambda { |el| el.attr('content') }],
114
+ '.description'
115
+ )
116
+ end
117
+
118
+ # Returns the "lede" or first paragraph of the story/page
119
+ def lede
120
+ lede = @doc.match(
121
+ '#blogpost p',
122
+ '.subhead',
123
+ '//div[@class="entrytext"]//p[string-length()>10]', # Ruby Inside / Kubrick style
124
+ 'section p',
125
+ '.entry .text p',
126
+ '.entry-content p',
127
+ '#wikicontent p', # Google Code style
128
+ '//td[@class="storybody"]/p[string-length()>10]', # BBC News style
129
+ '//div[@class="entry"]//p[string-length()>100]',
130
+ # The below is a horrible, horrible way to pluck out lead paras from crappy Blogspot blogs that
131
+ # don't use <p> tags..
132
+ ['.entry-content', lambda { |el| el.inner_html[/(#{el.inner_text[0..4].strip}.*?)\<br/, 1] }],
133
+ ['.entry', lambda { |el| el.inner_html[/(#{el.inner_text[0..4].strip}.*?)\<br/, 1] }],
134
+ '.entry',
135
+ '#content p',
136
+ '#article p',
137
+ '.post-body',
138
+ '.entry-content'
139
+ )
140
+
141
+ lede[/^(.*?\.\s){2}/m] || lede
142
+ end
143
+
144
+ # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
145
+ def keywords(options = {})
146
+ options = { :stem_at => 10, :word_length_limit => 15, :limit => 20 }.merge(options)
147
+
148
+ words = {}
149
+
150
+ # Convert doc to lowercase, scrub out most HTML tags
151
+ body.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\'\#\.]*\b/).each do |word|
152
+ next if word.length > options[:word_length_limit]
153
+ word.gsub!(/\'\w+/, '')
154
+ words[word] ||= 0
155
+ words[word] += 1
156
+ end
157
+
158
+ # Stem the words and stop words if necessary
159
+ d = words.keys.uniq.map { |a| a.length > options[:stem_at] ? a.stem : a }
160
+ s = File.read(File.dirname(__FILE__) + '/stopwords.txt').split.map { |a| a.length > options[:stem_at] ? a.stem : a }
161
+
162
+ w = words.delete_if { |k1, v1| s.include?(k1) || (v1 < 2 && words.size > 80) }.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
163
+ return w
164
+ end
165
+
166
+ # Returns body text as determined by Arc90's Readability algorithm
167
+ def body
168
+ @body ||= Readability::Document.new(@doc.to_s).content.strip
169
+
170
+ # HACK: Remove annoying DIV that readability leaves around
171
+ @body.gsub!(/\A\<div\>/, '')
172
+ @body.gsub!(/\<\/div\>\Z/, '')
173
+
174
+ return @body
175
+ end
176
+
177
+ # Returns URL to the site's favicon
178
+ def favicon
179
+ url = @doc.match( ['link[@rel="fluid-icon"]', lambda { |el| el.attr('href') }], # Get a Fluid icon if possible..
180
+ ['link[@rel="shortcut icon"]', lambda { |el| el.attr('href') }],
181
+ ['link[@rel="icon"]', lambda { |el| el.attr('href') }])
182
+ if url && url !~ /^http/ && @url
183
+ url = URI.join(@url , url).to_s
184
+ end
185
+
186
+ url
187
+ end
188
+
189
+ # Returns URL of Web feed
190
+ def feed
191
+ url = @doc.match( ['link[@type="application/rss+xml"]', lambda { |el| el.attr('href') }],
192
+ ['link[@type="application/atom+xml"]', lambda { |el| el.attr('href') }]
193
+ )
194
+
195
+ if url && url !~ /^http/ && @url
196
+ url = URI.join(@url , url).to_s
197
+ end
198
+
199
+ url
200
+ end
201
+ end
202
+ end