pismo 0.6.1 → 0.6.2
Sign up to get free protection for your applications and to get access to all the features.
- data/README.markdown +18 -2
- data/VERSION +1 -1
- data/bin/pismo +1 -1
- data/lib/pismo/internal_attributes.rb +2 -2
- data/lib/pismo/reader.rb +7 -3
- data/pismo.gemspec +1 -1
- data/test/corpus/metadata_expected.yaml +3 -3
- data/test/corpus/reader_expected.yaml +0 -3
- metadata +1 -1
data/README.markdown
CHANGED
@@ -26,11 +26,27 @@ There's also a shorter "convenience" method which might be handy in IRB - it doe
|
|
26
26
|
|
27
27
|
Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
|
28
28
|
|
29
|
-
The current metadata methods are
|
29
|
+
The current metadata methods are:
|
30
|
+
* title
|
31
|
+
* titles
|
32
|
+
* author
|
33
|
+
* authors
|
34
|
+
* lede
|
35
|
+
* keywords
|
36
|
+
* sentences(qty)
|
37
|
+
* body
|
38
|
+
* html_body
|
39
|
+
* feed
|
40
|
+
* feeds
|
41
|
+
* favicon
|
42
|
+
* description
|
43
|
+
* datetime
|
44
|
+
|
45
|
+
These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
|
30
46
|
|
31
47
|
The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.
|
32
48
|
|
33
|
-
##
|
49
|
+
## CAVEATS AND SHORTCOMINGS:
|
34
50
|
|
35
51
|
There are some shortcomings or problems that I'm aware of and am going to pursue:
|
36
52
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.6.
|
1
|
+
0.6.2
|
data/bin/pismo
CHANGED
@@ -216,7 +216,7 @@ module Pismo
|
|
216
216
|
elsif lede && Array === lede
|
217
217
|
return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){2}/m].strip || l }.uniq
|
218
218
|
else
|
219
|
-
return reader_doc && !reader_doc.sentences(
|
219
|
+
return reader_doc && !reader_doc.sentences(3).empty? ? reader_doc.sentences(3).join(' ') : nil
|
220
220
|
end
|
221
221
|
end
|
222
222
|
|
@@ -242,7 +242,7 @@ module Pismo
|
|
242
242
|
|
243
243
|
# Convert doc to lowercase, scrub out most HTML tags, then keep track of words
|
244
244
|
cached_title = title
|
245
|
-
content_to_use = body.to_s.downcase + description.to_s.downcase
|
245
|
+
content_to_use = body.to_s.downcase + " " + description.to_s.downcase
|
246
246
|
|
247
247
|
# old regex for safe keeping -- \b[a-z][a-z\+\.\'\+\#\-]*\b
|
248
248
|
content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\/\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.each do |word|
|
data/lib/pismo/reader.rb
CHANGED
@@ -1,5 +1,6 @@
|
|
1
1
|
require 'nokogiri'
|
2
2
|
require 'sanitize'
|
3
|
+
begin; require 'ap'; rescue LoadError; end
|
3
4
|
|
4
5
|
module Pismo
|
5
6
|
module Reader
|
@@ -20,7 +21,7 @@ module Pismo
|
|
20
21
|
GOOD_WORDS = %w{content post blogpost main story body entry text desc asset hentry single entrytext postcontent bodycontent}.uniq
|
21
22
|
|
22
23
|
# Words that indicate crap in general
|
23
|
-
BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor
|
24
|
+
BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor ads commentform fbfans login similar thumb link blogroll grid twitter wrapper container nav sitesub printfooter editsection visualclear catlinks hidden toc contentsub caption disqus rss shoutbox sponsor}.uniq
|
24
25
|
|
25
26
|
# Words that kill a branch dead
|
26
27
|
FATAL_WORDS = %w{comments comment bookmarks social links ads related similar footer digg totop metadata sitesub nav sidebar commenting options addcomment leaderboard offscreen job prevlink prevnext navigation reply-link hide hidden sidebox archives vcard}
|
@@ -69,6 +70,8 @@ module Pismo
|
|
69
70
|
|
70
71
|
@doc = Nokogiri::HTML(@raw_content, nil, 'utf-8')
|
71
72
|
|
73
|
+
#ap @raw_content
|
74
|
+
#exit
|
72
75
|
build_analysis_tree
|
73
76
|
end
|
74
77
|
|
@@ -221,7 +224,7 @@ module Pismo
|
|
221
224
|
# Return the content from best match number of index (default 0) and, optionally, clean it to plain-text
|
222
225
|
def content(clean = false, index = 0)
|
223
226
|
return @content[[clean, index]] if @content[[clean, index]]
|
224
|
-
return ''
|
227
|
+
return '' if !@content_candidates || @content_candidates.empty?
|
225
228
|
|
226
229
|
content_branch = @doc.at(@content_candidates[index].first)
|
227
230
|
orphans_to_remove = []
|
@@ -361,7 +364,8 @@ module Pismo
|
|
361
364
|
fodder = content(true) if fodder.to_s.length < 50
|
362
365
|
fodder.gsub!(/\b\w\W\s/, '')
|
363
366
|
|
364
|
-
sentences = fodder.scan(/([\&\w\s\-\'\,\+\.\/\\\:\#\(\)\=\"\?\!]+?[\.\?\!])(\s|\Z)/im).map { |s| s.first }
|
367
|
+
#sentences = fodder.scan(/([\&\w\s\-\'\,\+\.\/\\\:\#\(\)\=\"\?\!]+?[\.\?\!])(\s|\Z)/im).map { |s| s.first }
|
368
|
+
sentences = fodder.scan(/(.+?[\.\?\!])(\s|\Z)/im).map { |s| s.first.strip }
|
365
369
|
|
366
370
|
sentences.compact!
|
367
371
|
sentences.map! { |s| s.strip }
|
data/pismo.gemspec
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
:rww:
|
3
3
|
:title: "Cartoon: Apple Tablet: Now With Barometer and Bird Call Generator"
|
4
4
|
:feed: http://www.readwriteweb.com/rss.xml
|
5
|
-
:lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know.
|
5
|
+
:lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator.
|
6
6
|
:feeds:
|
7
7
|
- http://www.readwriteweb.com/rss.xml
|
8
8
|
- http://www.readwriteweb.com/archives/2010/01/cartoon_apple_tablet_now_with_barometer_and_bird_c.xml
|
@@ -42,7 +42,7 @@
|
|
42
42
|
:spolsky:
|
43
43
|
:title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
|
44
44
|
:description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
|
45
|
-
:lede: I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese.
|
45
|
+
:lede: I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese?
|
46
46
|
:author: Joel Spolsky
|
47
47
|
:favicon: /favicon.ico
|
48
48
|
:feed: http://www.joelonsoftware.com/rss.xml
|
@@ -68,6 +68,6 @@
|
|
68
68
|
:sentences: I am pleased to report that the GCC Steering Committee and the FSF have approved the use of C++ in GCC itself. Of course, there's no reason for us to use C++ features just because we can. The goal is a better compiler for users, not a C++ code base for its own sake.
|
69
69
|
:queness:
|
70
70
|
:title: 18 Incredible CSS3 Effects You Have Never Seen Before
|
71
|
-
:lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web."
|
71
|
+
:lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it."
|
72
72
|
:sentences: CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it.
|
73
73
|
:datetime: 2010-06-02 12:00:00 +01:00
|
@@ -17,9 +17,6 @@
|
|
17
17
|
:gmane:
|
18
18
|
- "I am pleased to report that the GCC Steering Committee and the FSF have approved the use of C++ in GCC itself."
|
19
19
|
- "Of course, there's no reason for us to use C++ features just because we can."
|
20
|
-
:huffington:
|
21
|
-
- "The man on the motorcycle was going the wrong way down a one-way street, gesturing indignantly for the phalanx of traffic-clogged cars in front of him to move."
|
22
|
-
- "\"Brother, why are you angry with us?\" said a passenger leaning out of one of the vehicles blocking his path."
|
23
20
|
:queness:
|
24
21
|
- "CSS3 is hot these days and will soon be available in most modern browser."
|
25
22
|
- "Just recently, I started to become aware to the present of CSS3 around the web."
|