pismo 0.4.0 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,18 +1,15 @@
1
- = pismo (Web page content analyzer and metadata extractor)
1
+ # pismo - Web page content analysis and metadata extraction
2
+ http://github.com/peterc/pismo
2
3
 
3
- * http://github.com/peterc/pismo
4
-
5
- == DESCRIPTION:
4
+ ## DESCRIPTION:
6
5
 
7
6
  Pismo extracts metadata and machine-usable data from mostly unstructured (or poorly structured)
8
- HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
7
+ English-language HTML documents. These data include titles, feed URLs, ledes, body text, graphics, date, and keywords.
9
8
 
10
9
  For example, if you have a blog post HTML file, Pismo, in theory, should
11
10
  extract the title, the actual "content", and analyze for keywords, among other things.
12
11
 
13
- Pismo only understands (and much prefers) English. Je suis desolé.
14
-
15
- == EXAMPLES:
12
+ ## EXAMPLES:
16
13
 
17
14
  require 'pismo'
18
15
 
@@ -23,30 +20,48 @@ Pismo only understands (and much prefers) English. Je suis desolé.
23
20
  doc.author # => "Peter Cooper"
24
21
  doc.lede # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
25
22
  doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
23
+
24
+ ## STATUS:
26
25
 
27
- == NEW IN 0.4.0:
26
+ Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
28
27
 
29
- Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
28
+ Planned/forthcoming features include the fetching of "external" data like tags from Delicious, content analysis through 3rd party services, and extraction of graphics from the main article text (for thumbnailing, say).
30
29
 
31
- doc.titles # => [..., ..., ...]
32
- doc.ledes # => [..., ..., ...]
33
- doc.authors # => [..., ..., ...]
34
- doc.feeds # => [..., ..., ...]
30
+ ## NEW IN 0.5.0:
31
+
32
+ ### Stopword access
33
+
34
+ You can now access Pismo's stopword list directly:
35
+
36
+ Pismo.stopwords # => [.., .., ..]
35
37
 
36
- == STATUS:
38
+ ### Convenience access method for IRB/debugging use
39
+
40
+ Now you can get playing with Pismo faster. This is primarily useful for debugging/playing in IRB as it just uses open-uri and the Pismo document is cached in the class against the URL:
41
+
42
+ url = "http://www.rubyinside.com/the-why-what-and-how-of-rubinius-1-0-s-release-3261.html"
43
+ Pismo[url].title # => "The Why, What, and How of Rubinius 1.0's Release"
44
+ Pismo[url].author # => "Peter Cooper"
37
45
 
38
- Pismo is a work in progress and is being used heavily in the development of http://coder.io/. Pismo is used in production systems on both Ruby 1.8 and 1.9. I do not know how it fares on JRuby, Rubinius, or others yet.
46
+ ### Arrays of all matches for titles, ledes, authors, and feeds
39
47
 
40
- == COMMAND LINE TOOL:
48
+ Pismo is not perfect and you might like to instead see all of the potential titles/ledes/authors or feeds that Pismo can find. You can now do this and judge them by your metrics.
49
+
50
+ doc.titles # => [..., ..., ...]
51
+ doc.ledes # => [..., ..., ...]
52
+ doc.authors # => [..., ..., ...]
53
+ doc.feeds # => [..., ..., ...]
54
+
55
+ ## COMMAND LINE TOOL:
41
56
 
42
57
  A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is
43
58
  great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
44
59
 
45
- * Usage:
60
+ ### Usage:
46
61
 
47
62
  ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
48
63
 
49
- * Output:
64
+ ### Output:
50
65
 
51
66
  ---
52
67
  :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
@@ -55,7 +70,7 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
55
70
  :author: Peter Cooper
56
71
  :datetime: 2010-01-07 12:00:00 +00:00
57
72
 
58
- == Note on Patches/Pull Requests
73
+ ## Note on Patches/Pull Requests
59
74
 
60
75
  * Fork the project.
61
76
  * Make your feature addition or bug fix.
@@ -63,7 +78,7 @@ great for testing, or perhaps calling it from a non Ruby script. The output is c
63
78
  * Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
64
79
  * Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
65
80
 
66
- == COPYRIGHT AND LICENSE
81
+ ## COPYRIGHT AND LICENSE
67
82
 
68
83
  Apache 2.0 License - See LICENSE for details.
69
84
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.4.0
1
+ 0.5.0
@@ -38,6 +38,10 @@ module Pismo
38
38
  @doc = Nokogiri::HTML(@html)
39
39
  end
40
40
 
41
+ def match(args = [], all = false)
42
+ @doc.match([*args], all)
43
+ end
44
+
41
45
  def clean_html(html)
42
46
  html.gsub!('’', '\'')
43
47
  html.gsub!('”', '"')
@@ -6,6 +6,7 @@ module Pismo
6
6
  # TODO: Memoizations
7
7
  title = @doc.match(
8
8
  [
9
+ '#pname a', # Google Code style
9
10
  '.entryheader h1', # Ruby Inside/Kubrick
10
11
  '.entry-title a', # Common Blogger/Blogspot rules
11
12
  '.post-title a',
@@ -14,16 +15,18 @@ module Pismo
14
15
  '.post-header h1',
15
16
  '.entry-title',
16
17
  '.post-title',
18
+ '.post h3 a',
19
+ 'a.datitle', # Slashdot style
17
20
  '.posttitle',
18
21
  '.post_title',
19
22
  '.pageTitle',
23
+ '#main h1.title',
20
24
  '.title h1',
21
25
  '.post h2',
22
26
  'h2.title',
23
27
  '.entry h2', # Common style
24
28
  '.boite_titre a',
25
29
  ['meta[@name="title"]', lambda { |el| el.attr('content') }],
26
- '#pname a', # Google Code style
27
30
  'h1.headermain',
28
31
  'h1.title',
29
32
  '.mxb h1', # BBC News
@@ -31,7 +34,14 @@ module Pismo
31
34
  '#content h2',
32
35
  '#content h3',
33
36
  'a[@rel="bookmark"]',
34
- '.products h2'
37
+ '.products h2',
38
+ '.caption h3',
39
+ '#main h2',
40
+ '#body h1',
41
+ '#wrapper h1',
42
+ '#page h1',
43
+ '.asset-header h1',
44
+ '#body_content h2'
35
45
  ],
36
46
  all
37
47
  )
@@ -55,9 +65,9 @@ module Pismo
55
65
  def html_title
56
66
  title = @doc.match('title')
57
67
  return unless title
58
-
68
+ title
59
69
  # Strip off any leading or trailing site names - a scrappy way to try it out..
60
- title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
70
+ #title = title.split(/\s+(\-|\||\:)\s+/).sort_by { |i| i.length }.last.to_s.strip
61
71
  end
62
72
 
63
73
  # Return an estimate of when the page/content was created
@@ -115,8 +125,10 @@ module Pismo
115
125
  '.editorlink',
116
126
  '.authors p',
117
127
  ['meta[@name="author"]', lambda { |el| el.attr('content') }], # Traditional meta tag style
128
+ ['meta[@name="Author"]', lambda { |el| el.attr('content') }], # CNN style
118
129
  ['meta[@name="AUTHOR"]', lambda { |el| el.attr('content') }], # CNN style
119
130
  '.byline a', # Ruby Inside style
131
+ '.byline',
120
132
  '.post_subheader_left a', # TechCrunch style
121
133
  '.byl', # BBC News style
122
134
  '.meta a',
@@ -144,6 +156,11 @@ module Pismo
144
156
  # Strip off any "By [whoever]" section
145
157
  if String === author
146
158
  author.sub!(/^(post(ed)?\s)?by\W+/i, '')
159
+ author.tr!('^a-zA-Z 0-9\'', '|')
160
+ author = author.split(/\|{2,}/).first.to_s
161
+ author.gsub!(/\s+/, ' ')
162
+ author.gsub!(/\|/, '')
163
+ author.strip!
147
164
  elsif Array === author
148
165
  author.map! { |a| a.sub(/^(post(ed)?\s)?by\W+/i, '') }.uniq!
149
166
  end
@@ -161,6 +178,7 @@ module Pismo
161
178
  @doc.match([
162
179
  ['meta[@name="description"]', lambda { |el| el.attr('content') }],
163
180
  ['meta[@name="Description"]', lambda { |el| el.attr('content') }],
181
+ ['meta[@name="DESCRIPTION"]', lambda { |el| el.attr('content') }],
164
182
  'rdf:Description[@name="dc:description"]',
165
183
  '.description'
166
184
  ])
@@ -171,6 +189,7 @@ module Pismo
171
189
  lede = @doc.match([
172
190
  '.post-text p',
173
191
  '#blogpost p',
192
+ '.story-teaser',
174
193
  '.subhead',
175
194
  '//div[@class="entrytext"]//p[string-length()>10]', # Ruby Inside / Kubrick style
176
195
  'section p',
@@ -209,24 +228,26 @@ module Pismo
209
228
 
210
229
  # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
211
230
  def keywords(options = {})
212
- options = { :stem_at => 10, :word_length_limit => 15, :limit => 20 }.merge(options)
231
+ options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
213
232
 
214
233
  words = {}
215
234
 
216
235
  # Convert doc to lowercase, scrub out most HTML tags, then keep track of words
217
236
  cached_title = title
218
237
  content_to_use = body.to_s.downcase + description.to_s.downcase
219
-
220
- content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub('. ', ' ').gsub(/\&\w+\;/, '').scan(/\b[a-z][a-z\+\.\'\+\#\-]*\b/).each do |word|
238
+
239
+ # old regex for safe keeping -- \b[a-z][a-z\+\.\'\+\#\-]*\b
240
+ content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\/\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.each do |word|
221
241
  next if word.length > options[:word_length_limit]
222
242
  word.gsub!(/\'\w+/, '')
223
243
  words[word] ||= 0
224
- words[word] += (cached_title =~ /#{word}/i ? 5 : 1)
244
+ words[word] += (cached_title.downcase.include?(word) ? 5 : 1)
225
245
  end
226
246
 
227
247
  # Stem the words and stop words if necessary
228
248
  d = words.keys.uniq.map { |a| a.length > options[:stem_at] ? a.stem : a }
229
- s = File.read(File.dirname(__FILE__) + '/stopwords.txt').split.map { |a| a.length > options[:stem_at] ? a.stem : a }
249
+ s = Pismo.stopwords.map { |a| a.length > options[:stem_at] ? a.stem : a }
250
+
230
251
 
231
252
  w = words.delete_if { |k1, v1| s.include?(k1) || (v1 < 2 && words.size > 80) }.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
232
253
  return w
@@ -14,6 +14,8 @@
14
14
 
15
15
  require 'nokogiri'
16
16
 
17
+ IS_RUBY19 = "a".respond_to?(:encoding)
18
+
17
19
  module Readability
18
20
  class Document
19
21
  TEXT_LENGTH_THRESHOLD = 25
@@ -28,14 +30,14 @@ module Readability
28
30
  end
29
31
 
30
32
  def make_html
31
- @html = Nokogiri::HTML(@input, nil, 'UTF-8')
33
+ @html = Nokogiri::HTML(@input) #, nil, 'UTF-8')
32
34
  end
33
35
 
34
36
  REGEXES = {
35
37
  :unlikelyCandidatesRe => /combx|comment|disqus|foot|header|menu|meta|nav|rss|shoutbox|sidebar|sponsor/i,
36
38
  :okMaybeItsACandidateRe => /and|article|body|column|main/i,
37
- :positiveRe => /article|body|content|entry|hentry|page|pagination|post|text/i,
38
- :negativeRe => /combx|comment|contact|foot|footer|footnote|link|media|meta|promo|related|scroll|shoutbox|sponsor|tags/i,
39
+ :positiveRe => /article|body|content|entry|hentry|page|pagination|post|story|text/i,
40
+ :negativeRe => /combx|comment|contact|foot|box_wrap|footer|footnote|link|media|meta|promo|related|scroll|shoutbox|sponsor|tags/i,
39
41
  :divToPElementsRe => /<(a|blockquote|dl|div|img|ol|p|pre|table|ul)/i,
40
42
  :replaceBrsRe => /(<br[^>]*>[ \n\r\t]*){2,}/i,
41
43
  :replaceFontsRe => /<(\/?)font[^>]*>/i,
@@ -135,8 +137,16 @@ module Readability
135
137
  candidates[grand_parent_node] ||= score_node(grand_parent_node) if grand_parent_node
136
138
 
137
139
  content_score = 1
138
- content_score += inner_text.split(',').length
139
- content_score += [(inner_text.length / 100).to_i, 3].min
140
+
141
+ begin
142
+ content_score += inner_text.split(',').length
143
+ content_score += [(inner_text.length / 100).to_i, 3].min
144
+ rescue => e
145
+ raise e unless IS_RUBY19
146
+ inner_text.force_encoding('ASCII-8BIT')
147
+ content_score += inner_text.split(',').length
148
+ content_score += [(inner_text.length / 100).to_i, 3].min
149
+ end
140
150
 
141
151
  candidates[parent_node][:content_score] += content_score
142
152
  candidates[grand_parent_node][:content_score] += content_score / 2.0 if grand_parent_node
@@ -209,7 +219,8 @@ module Readability
209
219
  @html.css("*").each do |elem|
210
220
  if elem.name.downcase == "div"
211
221
  # transform <div>s that do not contain other block elements into <p>s
212
- if elem.inner_html !~ REGEXES[:divToPElementsRe]
222
+ elem_inner_html = IS_RUBY19 ? elem.inner_html.dup.force_encoding('ASCII-8BIT') : elem.inner_html
223
+ if elem_inner_html !~ REGEXES[:divToPElementsRe]
213
224
  debug("Altering div(##{elem[:id]}.#{elem[:class]}) to p");
214
225
  elem.name = "p"
215
226
  end
@@ -255,7 +266,7 @@ module Readability
255
266
  if weight + content_score < 0
256
267
  el.remove
257
268
  debug("Conditionally cleaned #{name}##{el[:id]}.#{el[:class]} with weight #{weight} and content score #{content_score} because score + content score was less than zero.")
258
- elsif el.text.count(",") < 10
269
+ elsif (IS_RUBY19 && el.text.force_encoding("ASCII-8BIT").count(",") < 10) || (!IS_RUBY19 && el.text.count(",") < 10)
259
270
  counts = %w[p img li a embed input].inject({}) { |m, kind| m[kind] = el.css(kind).length; m }
260
271
  counts["li"] -= 100
261
272
 
@@ -308,13 +319,23 @@ module Readability
308
319
 
309
320
  # Otherwise, replace the element with its contents
310
321
  else
311
- el.swap(el.text)
322
+ begin
323
+ el.swap(el.text)
324
+ rescue => e
325
+ raise e unless IS_RUBY19
326
+ el.swap(el.text.force_encoding("ASCII-8BIT"))
327
+ end
312
328
  end
313
329
 
314
330
  end
315
331
 
316
332
  # Get rid of duplicate whitespace
317
- node.to_html.gsub(/[\r\n\f]+/, "\n" ).gsub(/[\t ]+/, " ").gsub(/&nbsp;/, " ")
333
+ begin
334
+ node.to_html.gsub(/[\r\n\f]+/, "\n" ).gsub(/[\t ]+/, " ").gsub(/&nbsp;/, " ")
335
+ rescue => e
336
+ raise e unless IS_RUBY19
337
+ node.to_html.force_encoding("ASCII-8BIT").gsub(/[\r\n\f]+/, "\n" ).gsub(/[\t ]+/, " ").gsub(/&nbsp;/, " ")
338
+ end
318
339
  end
319
340
 
320
341
  end
@@ -1016,7 +1016,9 @@ your
1016
1016
  yours
1017
1017
  yourself
1018
1018
  yourselves
1019
+ generally
1019
1020
  z
1020
1021
  Zachary
1021
1022
  zero
1022
1023
  Zoe
1024
+ congratulations
data/lib/pismo.rb CHANGED
@@ -11,11 +11,24 @@ require 'pismo/document'
11
11
  require 'pismo/readability'
12
12
 
13
13
  module Pismo
14
- # Sugar method to make creating document objects nicer
14
+ # Sugar methods to make creating document objects nicer
15
15
  def self.document(handle, url = nil)
16
16
  Document.new(handle, url)
17
17
  end
18
18
 
19
+ # Load a URL, as with Pismo['http://www.rubyinside.com'], and caches the Pismo document
20
+ # (mostly useful for debugging use)
21
+ def self.[](url)
22
+ @docs ||= {}
23
+ @docs[url] ||= Pismo::Document.new(open(url))
24
+ end
25
+
26
+
27
+ # Return stopword list
28
+ def self.stopwords
29
+ @stopwords ||= File.read(File.dirname(__FILE__) + '/pismo/stopwords.txt').split rescue []
30
+ end
31
+
19
32
  class NFunctions
20
33
  def self.match_href(list, expression)
21
34
  list.find_all { |node| node['href'] =~ /#{expression}/ }
@@ -33,7 +46,13 @@ class Nokogiri::HTML::Document
33
46
  r = [] if all
34
47
  [*queries].each do |query|
35
48
  if query.is_a?(String)
36
- result = self.search(query).first.inner_text.strip rescue nil
49
+ if el = self.search(query).first
50
+ if el.name.downcase == "meta"
51
+ result = el['content'].strip rescue nil
52
+ else
53
+ result = el.inner_text.strip rescue nil
54
+ end
55
+ end
37
56
  elsif query.is_a?(Array)
38
57
  result = query[1].call(self.search(query.first).first).strip rescue nil
39
58
  end
data/pismo.gemspec CHANGED
@@ -5,24 +5,24 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{pismo}
8
- s.version = "0.4.0"
8
+ s.version = "0.5.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Peter Cooper"]
12
- s.date = %q{2010-05-15}
12
+ s.date = %q{2010-06-01}
13
13
  s.default_executable = %q{pismo}
14
14
  s.description = %q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
15
15
  s.email = %q{git@peterc.org}
16
16
  s.executables = ["pismo"]
17
17
  s.extra_rdoc_files = [
18
18
  "LICENSE",
19
- "README.rdoc"
19
+ "README.markdown"
20
20
  ]
21
21
  s.files = [
22
22
  ".document",
23
23
  ".gitignore",
24
24
  "LICENSE",
25
- "README.rdoc",
25
+ "README.markdown",
26
26
  "Rakefile",
27
27
  "VERSION",
28
28
  "bin/pismo",
@@ -21,6 +21,7 @@
21
21
  :title: Gay Muslims made homeless by family violence
22
22
  :titles:
23
23
  - Gay Muslims made homeless by family violence
24
+ - BBC News - Gay Muslims made homeless by family violence
24
25
  :author: Poonam Taneja
25
26
  :authors:
26
27
  - Poonam Taneja
@@ -39,7 +40,7 @@
39
40
  :authors:
40
41
  - ymo1965
41
42
  :spolsky:
42
- :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
43
+ :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
43
44
  :description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
44
45
  :ledes:
45
46
  - Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be?
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pismo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 0.5.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Cooper
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2010-05-15 00:00:00 +01:00
12
+ date: 2010-06-01 00:00:00 +01:00
13
13
  default_executable: pismo
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
@@ -80,12 +80,12 @@ extensions: []
80
80
 
81
81
  extra_rdoc_files:
82
82
  - LICENSE
83
- - README.rdoc
83
+ - README.markdown
84
84
  files:
85
85
  - .document
86
86
  - .gitignore
87
87
  - LICENSE
88
- - README.rdoc
88
+ - README.markdown
89
89
  - Rakefile
90
90
  - VERSION
91
91
  - bin/pismo