pismo 0.6.1 → 0.6.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.markdown CHANGED
@@ -26,11 +26,27 @@ There's also a shorter "convenience" method which might be handy in IRB - it doe
26
26
 
27
27
  Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
28
28
 
29
- The current metadata methods are #title, #titles, #author, #authors, #lede, #keywords, #sentences(qty), #body, #html_body, #feed, #feeds, #favicon, #description and #datetime. These are not fully documented here yet, you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
29
+ The current metadata methods are:
30
+ * title
31
+ * titles
32
+ * author
33
+ * authors
34
+ * lede
35
+ * keywords
36
+ * sentences(qty)
37
+ * body
38
+ * html_body
39
+ * feed
40
+ * feeds
41
+ * favicon
42
+ * description
43
+ * datetime
44
+
45
+ These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
30
46
 
31
47
  The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.
32
48
 
33
- ## CAUTIONS / WARNINGS:
49
+ ## CAVEATS AND SHORTCOMINGS:
34
50
 
35
51
  There are some shortcomings or problems that I'm aware of and am going to pursue:
36
52
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.6.1
1
+ 0.6.2
data/bin/pismo CHANGED
@@ -37,7 +37,7 @@ else
37
37
  output = { :url => doc.url }
38
38
 
39
39
  ARGV.each do |cmd|
40
- output[cmd.to_sym] = doc.send(cmd) rescue nil
40
+ output[cmd.to_sym] = doc.send(cmd)
41
41
  end
42
42
 
43
43
  puts output.to_yaml
@@ -216,7 +216,7 @@ module Pismo
216
216
  elsif lede && Array === lede
217
217
  return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){2}/m].strip || l }.uniq
218
218
  else
219
- return reader_doc && !reader_doc.sentences(2).empty? ? reader_doc.sentences(2).join(' ') : nil
219
+ return reader_doc && !reader_doc.sentences(3).empty? ? reader_doc.sentences(3).join(' ') : nil
220
220
  end
221
221
  end
222
222
 
@@ -242,7 +242,7 @@ module Pismo
242
242
 
243
243
  # Convert doc to lowercase, scrub out most HTML tags, then keep track of words
244
244
  cached_title = title
245
- content_to_use = body.to_s.downcase + description.to_s.downcase
245
+ content_to_use = body.to_s.downcase + " " + description.to_s.downcase
246
246
 
247
247
  # old regex for safe keeping -- \b[a-z][a-z\+\.\'\+\#\-]*\b
248
248
  content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\/\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.each do |word|
data/lib/pismo/reader.rb CHANGED
@@ -1,5 +1,6 @@
1
1
  require 'nokogiri'
2
2
  require 'sanitize'
3
+ begin; require 'ap'; rescue LoadError; end
3
4
 
4
5
  module Pismo
5
6
  module Reader
@@ -20,7 +21,7 @@ module Pismo
20
21
  GOOD_WORDS = %w{content post blogpost main story body entry text desc asset hentry single entrytext postcontent bodycontent}.uniq
21
22
 
22
23
  # Words that indicate crap in general
23
- BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor snap nopreview ads commentform fbfans login similar thumb link blogroll grid twitter wrapper container nav sitesub printfooter editsection visualclear catlinks hidden toc contentsub caption disqus rss shoutbox sponsor}.uniq
24
+ BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor ads commentform fbfans login similar thumb link blogroll grid twitter wrapper container nav sitesub printfooter editsection visualclear catlinks hidden toc contentsub caption disqus rss shoutbox sponsor}.uniq
24
25
 
25
26
  # Words that kill a branch dead
26
27
  FATAL_WORDS = %w{comments comment bookmarks social links ads related similar footer digg totop metadata sitesub nav sidebar commenting options addcomment leaderboard offscreen job prevlink prevnext navigation reply-link hide hidden sidebox archives vcard}
@@ -69,6 +70,8 @@ module Pismo
69
70
 
70
71
  @doc = Nokogiri::HTML(@raw_content, nil, 'utf-8')
71
72
 
73
+ #ap @raw_content
74
+ #exit
72
75
  build_analysis_tree
73
76
  end
74
77
 
@@ -221,7 +224,7 @@ module Pismo
221
224
  # Return the content from best match number of index (default 0) and, optionally, clean it to plain-text
222
225
  def content(clean = false, index = 0)
223
226
  return @content[[clean, index]] if @content[[clean, index]]
224
- return '' unless @content_candidates && !@content_candidates.empty?
227
+ return '' if !@content_candidates || @content_candidates.empty?
225
228
 
226
229
  content_branch = @doc.at(@content_candidates[index].first)
227
230
  orphans_to_remove = []
@@ -361,7 +364,8 @@ module Pismo
361
364
  fodder = content(true) if fodder.to_s.length < 50
362
365
  fodder.gsub!(/\b\w\W\s/, '')
363
366
 
364
- sentences = fodder.scan(/([\&\w\s\-\'\,\+\.\/\\\:\#\(\)\=\"\?\!]+?[\.\?\!])(\s|\Z)/im).map { |s| s.first }
367
+ #sentences = fodder.scan(/([\&\w\s\-\'\,\+\.\/\\\:\#\(\)\=\"\?\!]+?[\.\?\!])(\s|\Z)/im).map { |s| s.first }
368
+ sentences = fodder.scan(/(.+?[\.\?\!])(\s|\Z)/im).map { |s| s.first.strip }
365
369
 
366
370
  sentences.compact!
367
371
  sentences.map! { |s| s.strip }
data/pismo.gemspec CHANGED
@@ -5,7 +5,7 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{pismo}
8
- s.version = "0.6.1"
8
+ s.version = "0.6.2"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Peter Cooper"]
@@ -2,7 +2,7 @@
2
2
  :rww:
3
3
  :title: "Cartoon: Apple Tablet: Now With Barometer and Bird Call Generator"
4
4
  :feed: http://www.readwriteweb.com/rss.xml
5
- :lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know.
5
+ :lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator.
6
6
  :feeds:
7
7
  - http://www.readwriteweb.com/rss.xml
8
8
  - http://www.readwriteweb.com/archives/2010/01/cartoon_apple_tablet_now_with_barometer_and_bird_c.xml
@@ -42,7 +42,7 @@
42
42
  :spolsky:
43
43
  :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
44
44
  :description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
45
- :lede: I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese.
45
+ :lede: I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese?
46
46
  :author: Joel Spolsky
47
47
  :favicon: /favicon.ico
48
48
  :feed: http://www.joelonsoftware.com/rss.xml
@@ -68,6 +68,6 @@
68
68
  :sentences: I am pleased to report that the GCC Steering Committee and the FSF have approved the use of C++ in GCC itself. Of course, there's no reason for us to use C++ features just because we can. The goal is a better compiler for users, not a C++ code base for its own sake.
69
69
  :queness:
70
70
  :title: 18 Incredible CSS3 Effects You Have Never Seen Before
71
- :lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web."
71
+ :lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it."
72
72
  :sentences: CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it.
73
73
  :datetime: 2010-06-02 12:00:00 +01:00
@@ -17,9 +17,6 @@
17
17
  :gmane:
18
18
  - "I am pleased to report that the GCC Steering Committee and the FSF have approved the use of C++ in GCC itself."
19
19
  - "Of course, there's no reason for us to use C++ features just because we can."
20
- :huffington:
21
- - "The man on the motorcycle was going the wrong way down a one-way street, gesturing indignantly for the phalanx of traffic-clogged cars in front of him to move."
22
- - "\"Brother, why are you angry with us?\" said a passenger leaning out of one of the vehicles blocking his path."
23
20
  :queness:
24
21
  - "CSS3 is hot these days and will soon be available in most modern browser."
25
22
  - "Just recently, I started to become aware to the present of CSS3 around the web."
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pismo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.1
4
+ version: 0.6.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Cooper