pismo 0.6.2 → 0.7.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.markdown +9 -5
- data/VERSION +1 -1
- data/bin/pismo +1 -0
- data/lib/pismo/document.rb +11 -4
- data/lib/pismo/internal_attributes.rb +2 -2
- data/lib/pismo/reader.rb +36 -27
- data/lib/pismo/stopwords.txt +10 -69
- data/pismo.gemspec +2 -2
- data/test/corpus/metadata_expected.yaml +1 -2
- data/test/corpus/reader_expected.yaml +2 -5
- metadata +2 -2
data/README.markdown
CHANGED
@@ -27,6 +27,7 @@ There's also a shorter "convenience" method which might be handy in IRB - it doe
|
|
27
27
|
Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
|
28
28
|
|
29
29
|
The current metadata methods are:
|
30
|
+
|
30
31
|
* title
|
31
32
|
* titles
|
32
33
|
* author
|
@@ -50,11 +51,14 @@ The html_body and body methods will be of particular interest. They return the "
|
|
50
51
|
|
51
52
|
There are some shortcomings or problems that I'm aware of and am going to pursue:
|
52
53
|
|
53
|
-
* I do not know how Pismo fares on
|
54
|
-
*
|
55
|
-
*
|
56
|
-
* The
|
57
|
-
* The
|
54
|
+
* I do not know how Pismo fares on Rubinius or other versions of 1.9 (e.g. 1.9.2) yet
|
55
|
+
* pismo does not install on JRuby due to a problem in the fast-stemmer dependency
|
56
|
+
* Some users have had issues with using Pismo from irb. This appears to be related to Nokogiri use causing a segfault
|
57
|
+
* The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction
|
58
|
+
* The author name extraction isn't very strong and is best avoided for now
|
59
|
+
* The image extraction only deals with images with absolute URLs
|
60
|
+
* The stopword list is a little too long (~1000 words) and needs to be trimmed
|
61
|
+
* The corpus in test/corpus needs significantly extending
|
58
62
|
|
59
63
|
## OTHER GROOVY STUFF:
|
60
64
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.7.0
|
data/bin/pismo
CHANGED
@@ -32,6 +32,7 @@ if ARGV.empty?
|
|
32
32
|
P = doc
|
33
33
|
@p = doc
|
34
34
|
puts "Pismo has loaded #{url} into @p and P"
|
35
|
+
puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
|
35
36
|
IRB.start
|
36
37
|
else
|
37
38
|
output = { :url => doc.url }
|
data/lib/pismo/document.rb
CHANGED
@@ -33,7 +33,7 @@ module Pismo
|
|
33
33
|
handle
|
34
34
|
end
|
35
35
|
|
36
|
-
@html = clean_html(@html)
|
36
|
+
@html = self.class.clean_html(@html)
|
37
37
|
|
38
38
|
@doc = Nokogiri::HTML(@html)
|
39
39
|
end
|
@@ -42,11 +42,18 @@ module Pismo
|
|
42
42
|
@doc.match([*args], all)
|
43
43
|
end
|
44
44
|
|
45
|
-
def clean_html(html)
|
46
|
-
|
47
|
-
|
45
|
+
def self.clean_html(html)
|
46
|
+
# Normalize stupid entities
|
47
|
+
# TODO: Optimize this so we don't need all these sequential gsubs
|
48
|
+
html.gsub!(" ", " ")
|
49
|
+
html.gsub!(" ", " ")
|
50
|
+
html.gsub!(" ", " ")
|
48
51
|
html.gsub!('–', '-')
|
52
|
+
html.gsub!("‘", "'")
|
53
|
+
html.gsub!('’', "'")
|
49
54
|
html.gsub!('“', '"')
|
55
|
+
html.gsub!('”', '"')
|
56
|
+
html.gsub!("…", '...')
|
50
57
|
html.gsub!(' ', ' ')
|
51
58
|
html
|
52
59
|
end
|
@@ -130,7 +130,6 @@ module Pismo
|
|
130
130
|
'.byline',
|
131
131
|
'.post_subheader_left a', # TechCrunch style
|
132
132
|
'.byl', # BBC News style
|
133
|
-
'.meta a',
|
134
133
|
'.articledata .author a',
|
135
134
|
'#owners a', # Google Code style
|
136
135
|
'.author a',
|
@@ -147,7 +146,8 @@ module Pismo
|
|
147
146
|
'.blog_meta a',
|
148
147
|
'cite a',
|
149
148
|
'cite',
|
150
|
-
'.contributor_details h4 a'
|
149
|
+
'.contributor_details h4 a',
|
150
|
+
'.meta a'
|
151
151
|
], all)
|
152
152
|
|
153
153
|
return unless author
|
data/lib/pismo/reader.rb
CHANGED
@@ -8,7 +8,7 @@ module Pismo
|
|
8
8
|
attr_reader :raw_content, :doc, :content_candidates
|
9
9
|
|
10
10
|
# Elements to keep for /input/ sanitization
|
11
|
-
OK_ELEMENTS = %w{a td br th tbody table tr div span img strong em b i body html head title p h1 h2 h3 h4 h5 h6 pre code tt ul li ol blockquote font big small section article abbr audio video cite dd dt figure caption sup form dl dt dd}
|
11
|
+
OK_ELEMENTS = %w{a td br th tbody table tr div span img strong em b i body html head title p h1 h2 h3 h4 h5 h6 pre code tt ul li ol blockquote font big small section article abbr audio video cite dd dt figure caption sup form dl dt dd center}
|
12
12
|
|
13
13
|
# Build a tree of attributes that are allowed for each element.. doing it this messy way due to how Sanitize works, alas
|
14
14
|
OK_ATTRIBUTES = {}
|
@@ -21,7 +21,7 @@ module Pismo
|
|
21
21
|
GOOD_WORDS = %w{content post blogpost main story body entry text desc asset hentry single entrytext postcontent bodycontent}.uniq
|
22
22
|
|
23
23
|
# Words that indicate crap in general
|
24
|
-
BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor ads commentform fbfans login similar thumb link blogroll grid twitter wrapper container nav sitesub printfooter editsection visualclear catlinks hidden toc contentsub caption disqus rss shoutbox sponsor}.uniq
|
24
|
+
BAD_WORDS = %w{reply metadata options commenting comments comment about footer header outer credit sidebar widget subscribe clearfix date social bookmarks links share video watch excerpt related supplement accessibility offscreen meta title signup blq secondary feedback featured clearfix small job jobs listing listings navigation nav byline addcomment postcomment trackback neighbor ads commentform fbfans login similar thumb link blogroll grid twitter wrapper container nav sitesub printfooter editsection visualclear catlinks hidden toc contentsub caption disqus rss shoutbox sponsor blogcomments}.uniq
|
25
25
|
|
26
26
|
# Words that kill a branch dead
|
27
27
|
FATAL_WORDS = %w{comments comment bookmarks social links ads related similar footer digg totop metadata sitesub nav sidebar commenting options addcomment leaderboard offscreen job prevlink prevnext navigation reply-link hide hidden sidebox archives vcard}
|
@@ -39,7 +39,7 @@ module Pismo
|
|
39
39
|
|
40
40
|
# Create a document object based on the raw HTML content provided
|
41
41
|
def initialize(raw_content)
|
42
|
-
@raw_content = raw_content
|
42
|
+
@raw_content = Pismo::Document.clean_html(raw_content)
|
43
43
|
build_doc
|
44
44
|
end
|
45
45
|
|
@@ -59,6 +59,17 @@ module Pismo
|
|
59
59
|
|
60
60
|
# Remove scripts manually, Sanitize and/or Nokogiri seem to go a bit funny with them
|
61
61
|
@raw_content.gsub!(/\<script .*?\<\/script\>/im, '')
|
62
|
+
|
63
|
+
# Get rid of bullshit "smart" quotes and other Unicode nonsense
|
64
|
+
@raw_content.force_encoding("ASCII-8BIT") if RUBY_VERSION > "1.9"
|
65
|
+
@raw_content.gsub!("\xe2\x80\x89", " ")
|
66
|
+
@raw_content.gsub!("\xe2\x80\x99", "'")
|
67
|
+
@raw_content.gsub!("\xe2\x80\x98", "'")
|
68
|
+
@raw_content.gsub!("\xe2\x80\x9c", '"')
|
69
|
+
@raw_content.gsub!("\xe2\x80\x9d", '"')
|
70
|
+
@raw_content.gsub!("\xe2\x80\xf6", '.')
|
71
|
+
@raw_content.force_encoding("UTF-8") if RUBY_VERSION > "1.9"
|
72
|
+
|
62
73
|
|
63
74
|
# Sanitize the HTML
|
64
75
|
@raw_content = Sanitize.clean(@raw_content,
|
@@ -70,8 +81,6 @@ module Pismo
|
|
70
81
|
|
71
82
|
@doc = Nokogiri::HTML(@raw_content, nil, 'utf-8')
|
72
83
|
|
73
|
-
#ap @raw_content
|
74
|
-
#exit
|
75
84
|
build_analysis_tree
|
76
85
|
end
|
77
86
|
|
@@ -102,20 +111,34 @@ module Pismo
|
|
102
111
|
# Assume that no content we'll want comes in a total package of fewer than 80 characters!
|
103
112
|
next unless el.text.to_s.strip.length >= 80
|
104
113
|
|
105
|
-
ids = (el['id'].to_s + ' ' + el['class'].to_s).downcase.strip.scan(/[a-z]+/)
|
106
114
|
path_segments = el.path.scan(/[a-z]+/)[2..-1] || []
|
107
115
|
depth = path_segments.length
|
116
|
+
|
117
|
+
local_ids = (el['id'].to_s + ' ' + el['class'].to_s).downcase.strip.scan(/[a-z]+/)
|
118
|
+
ids = local_ids
|
119
|
+
|
120
|
+
cp = el.parent
|
121
|
+
(depth - 1).times do
|
122
|
+
ids += (cp['id'].to_s + ' ' + cp['class'].to_s).downcase.strip.scan(/[a-z]+/)
|
123
|
+
cp = cp.parent
|
124
|
+
end if depth > 1
|
125
|
+
|
126
|
+
#puts "IDS"
|
127
|
+
#ap ids
|
128
|
+
#puts "LOCAL IDS"
|
129
|
+
#ap local_ids
|
108
130
|
|
109
131
|
branch = {}
|
110
132
|
branch[:ids] = ids
|
133
|
+
branch[:local_ids] = local_ids
|
111
134
|
branch[:score] = -(BAD_WORDS & ids).size
|
112
|
-
branch[:score] += (GOOD_WORDS & ids).size
|
113
|
-
next if branch[:score] <
|
135
|
+
branch[:score] += ((GOOD_WORDS & ids).size * 2)
|
136
|
+
next if branch[:score] < -5
|
114
137
|
|
115
138
|
#puts "#{ids.join(",")} - #{branch[:score].to_s} - #{el.text.to_s.strip.length}"
|
116
139
|
|
117
140
|
# Elements that have an ID or class are more likely to be our winners
|
118
|
-
branch[:score] += 2 unless
|
141
|
+
branch[:score] += 2 unless local_ids.empty?
|
119
142
|
|
120
143
|
branch[:name] = el.name
|
121
144
|
branch[:depth] = depth
|
@@ -198,6 +221,7 @@ module Pismo
|
|
198
221
|
branch[:score] -= 5 if branch[:bad_child_count] > 20
|
199
222
|
|
200
223
|
branch[:score] += depth
|
224
|
+
branch[:score] *= 0.8 if ids.length > 10
|
201
225
|
|
202
226
|
|
203
227
|
|
@@ -212,8 +236,7 @@ module Pismo
|
|
212
236
|
# Sort the branches by their score in reverse order
|
213
237
|
@content_candidates = sorted_tree.reverse.first([5, sorted_tree.length].min)
|
214
238
|
|
215
|
-
@content_candidates #.map { |i| [i[0], i[1][:name], i[1][:ids].join(','), i[1][:score] ]}
|
216
|
-
#ap @content_candidates
|
239
|
+
#ap @content_candidates #.map { |i| [i[0], i[1][:name], i[1][:ids].join(','), i[1][:score] ]}
|
217
240
|
#t2 = Time.now.to_i + (Time.now.usec.to_f / 1000000)
|
218
241
|
#puts t2 - t1
|
219
242
|
#exit
|
@@ -278,7 +301,7 @@ module Pismo
|
|
278
301
|
next
|
279
302
|
end
|
280
303
|
|
281
|
-
if el.name == "p" && el.text !~
|
304
|
+
if el.name == "p" && el.text !~ /(\.|\?|\!|\"|\')(\s|$)/ && el.inner_html !~ /\<img/
|
282
305
|
el.remove
|
283
306
|
next
|
284
307
|
end
|
@@ -321,29 +344,15 @@ module Pismo
|
|
321
344
|
# Remove empty tags
|
322
345
|
clean_html.gsub!(/<(\w+)><\/\1>/, "")
|
323
346
|
|
324
|
-
# Trim leading space from lines but without removing blank lines
|
325
|
-
#clean_html.gsub!(/^\ +(?=\S)/, '')
|
326
|
-
|
327
347
|
# Just a messy, hacky way to make output look nicer with subsequent paragraphs..
|
328
348
|
clean_html.gsub!(/<\/(div|p|h1|h2|h3|h4|h5|h6)>/, '</\1>' + "\n\n")
|
329
|
-
|
330
|
-
# Get rid of bullshit "smart" quotes
|
331
|
-
clean_html.force_encoding("ASCII-8BIT") if RUBY_VERSION > "1.9"
|
332
|
-
clean_html.gsub!("\xe2\x80\x89", " ")
|
333
|
-
clean_html.gsub!("\xe2\x80\x99", "'")
|
334
|
-
clean_html.gsub!("\xe2\x80\x98", "'")
|
335
|
-
clean_html.gsub!("\xe2\x80\x9c", '"')
|
336
|
-
clean_html.gsub!("\xe2\x80\x9d", '"')
|
337
|
-
clean_html.force_encoding("UTF-8") if RUBY_VERSION > "1.9"
|
338
349
|
|
339
350
|
@content[[clean, index]] = clean_html
|
340
351
|
end
|
341
352
|
|
342
353
|
def sentences(qty = 3)
|
343
|
-
# ap content
|
344
354
|
clean_content = Sanitize.clean(content, :elements => NON_HEADER_ELEMENTS, :attributes => OK_CLEAN_ATTRIBUTES, :remove_contents => %w{h1 h2 h3 h4 h5 h6})
|
345
|
-
|
346
|
-
#exit
|
355
|
+
|
347
356
|
fodder = ''
|
348
357
|
doc = Nokogiri::HTML(clean_content, nil, 'utf-8')
|
349
358
|
|
data/lib/pismo/stopwords.txt
CHANGED
@@ -1,9 +1,3 @@
|
|
1
|
-
0
|
2
|
-
1
|
3
|
-
10
|
4
|
-
100
|
5
|
-
20
|
6
|
-
a
|
7
1
|
a's
|
8
2
|
Aaliyah
|
9
3
|
Aaron
|
@@ -17,8 +11,6 @@ accordingly
|
|
17
11
|
across
|
18
12
|
actually
|
19
13
|
Adam
|
20
|
-
add
|
21
|
-
added
|
22
14
|
Addison
|
23
15
|
Adrian
|
24
16
|
after
|
@@ -67,7 +59,6 @@ annual
|
|
67
59
|
another
|
68
60
|
Anthony
|
69
61
|
Antonio
|
70
|
-
any
|
71
62
|
anybody
|
72
63
|
anyhow
|
73
64
|
anyone
|
@@ -107,7 +98,6 @@ Avery
|
|
107
98
|
away
|
108
99
|
awesome
|
109
100
|
awfully
|
110
|
-
b
|
111
101
|
Bailey
|
112
102
|
based
|
113
103
|
basically
|
@@ -118,7 +108,6 @@ become
|
|
118
108
|
becomes
|
119
109
|
becoming
|
120
110
|
been
|
121
|
-
before
|
122
111
|
beforehand
|
123
112
|
behind
|
124
113
|
being
|
@@ -130,16 +119,12 @@ beside
|
|
130
119
|
besides
|
131
120
|
best
|
132
121
|
better
|
133
|
-
between
|
134
122
|
beyond
|
135
123
|
big
|
136
124
|
biggest
|
137
|
-
bit
|
138
|
-
bits
|
139
125
|
Blake
|
140
126
|
both
|
141
127
|
bother
|
142
|
-
box
|
143
128
|
Brady
|
144
129
|
Brandon
|
145
130
|
Brayden
|
@@ -152,10 +137,7 @@ Brooke
|
|
152
137
|
Brooklyn
|
153
138
|
Bryan
|
154
139
|
Bryce
|
155
|
-
built
|
156
140
|
but
|
157
|
-
by
|
158
|
-
c
|
159
141
|
c'mon
|
160
142
|
c's
|
161
143
|
Caden
|
@@ -192,15 +174,14 @@ Cody
|
|
192
174
|
Cole
|
193
175
|
Colin
|
194
176
|
Colton
|
195
|
-
com
|
196
177
|
come
|
197
178
|
comes
|
198
179
|
coming
|
199
180
|
comment
|
200
181
|
company
|
201
|
-
compared
|
202
182
|
compelling
|
203
183
|
concerning
|
184
|
+
congratulations
|
204
185
|
Connor
|
205
186
|
consequently
|
206
187
|
consider
|
@@ -220,7 +201,6 @@ covering
|
|
220
201
|
cunt
|
221
202
|
currently
|
222
203
|
customizable
|
223
|
-
d
|
224
204
|
damn
|
225
205
|
Daniel
|
226
206
|
Danielle
|
@@ -258,7 +238,6 @@ driven
|
|
258
238
|
drove
|
259
239
|
during
|
260
240
|
Dylan
|
261
|
-
e
|
262
241
|
each
|
263
242
|
easier
|
264
243
|
edu
|
@@ -278,8 +257,6 @@ end
|
|
278
257
|
english
|
279
258
|
enough
|
280
259
|
entirely
|
281
|
-
episodes
|
282
|
-
equals
|
283
260
|
Eric
|
284
261
|
Erin
|
285
262
|
es
|
@@ -305,12 +282,10 @@ existing
|
|
305
282
|
extensive
|
306
283
|
extra
|
307
284
|
extremely
|
308
|
-
f
|
309
285
|
Faith
|
310
286
|
false
|
311
287
|
fame
|
312
288
|
far
|
313
|
-
favorite
|
314
289
|
feb
|
315
290
|
february
|
316
291
|
feel
|
@@ -335,19 +310,20 @@ fuck
|
|
335
310
|
full
|
336
311
|
further
|
337
312
|
furthermore
|
338
|
-
g
|
339
313
|
Gabriel
|
340
314
|
Gabriella
|
341
315
|
Gabrielle
|
342
316
|
Garrett
|
317
|
+
gave
|
343
318
|
Gavin
|
319
|
+
generally
|
344
320
|
get
|
345
321
|
gets
|
346
322
|
getting
|
323
|
+
give
|
347
324
|
given
|
348
325
|
gives
|
349
326
|
glory
|
350
|
-
go
|
351
327
|
goal
|
352
328
|
goes
|
353
329
|
going
|
@@ -358,7 +334,6 @@ gotten
|
|
358
334
|
Grace
|
359
335
|
great
|
360
336
|
greetings
|
361
|
-
h
|
362
337
|
had
|
363
338
|
hadn't
|
364
339
|
Hailey
|
@@ -395,23 +370,18 @@ himself
|
|
395
370
|
hire
|
396
371
|
his
|
397
372
|
hither
|
398
|
-
homepage
|
399
373
|
hopefully
|
400
|
-
hour
|
401
|
-
hours
|
402
374
|
how
|
403
375
|
howbeit
|
404
376
|
however
|
405
377
|
huge
|
406
378
|
Hunter
|
407
|
-
i
|
408
379
|
i'd
|
409
380
|
i'll
|
410
381
|
i'm
|
411
382
|
i've
|
412
383
|
Ian
|
413
384
|
ie
|
414
|
-
if
|
415
385
|
ignored
|
416
386
|
imagine
|
417
387
|
immediate
|
@@ -428,7 +398,6 @@ indicates
|
|
428
398
|
informative
|
429
399
|
inhibits
|
430
400
|
inner
|
431
|
-
inside
|
432
401
|
insofar
|
433
402
|
instead
|
434
403
|
interest
|
@@ -448,9 +417,7 @@ it'll
|
|
448
417
|
it's
|
449
418
|
its
|
450
419
|
itself
|
451
|
-
itunes
|
452
420
|
Ivan
|
453
|
-
j
|
454
421
|
Jack
|
455
422
|
Jackson
|
456
423
|
Jacob
|
@@ -492,7 +459,6 @@ jun
|
|
492
459
|
june
|
493
460
|
just
|
494
461
|
Justin
|
495
|
-
k
|
496
462
|
Kaden
|
497
463
|
Kaitlyn
|
498
464
|
Kaleb
|
@@ -513,7 +479,6 @@ known
|
|
513
479
|
knows
|
514
480
|
Kyle
|
515
481
|
Kylie
|
516
|
-
l
|
517
482
|
la
|
518
483
|
Landon
|
519
484
|
last
|
@@ -541,8 +506,6 @@ line
|
|
541
506
|
listing
|
542
507
|
listings
|
543
508
|
little
|
544
|
-
live
|
545
|
-
loading
|
546
509
|
Logan
|
547
510
|
look
|
548
511
|
looking
|
@@ -555,7 +518,6 @@ ltd
|
|
555
518
|
Lucas
|
556
519
|
Luis
|
557
520
|
Luke
|
558
|
-
m
|
559
521
|
Mackenzie
|
560
522
|
Madeline
|
561
523
|
Madison
|
@@ -592,8 +554,6 @@ Michelle
|
|
592
554
|
might
|
593
555
|
Miguel
|
594
556
|
mile
|
595
|
-
minute
|
596
|
-
minutes
|
597
557
|
more
|
598
558
|
moreover
|
599
559
|
Morgan
|
@@ -601,11 +561,9 @@ most
|
|
601
561
|
mostly
|
602
562
|
moving
|
603
563
|
much
|
604
|
-
multiple
|
605
564
|
must
|
606
565
|
my
|
607
566
|
myself
|
608
|
-
n
|
609
567
|
name
|
610
568
|
namely
|
611
569
|
Natalie
|
@@ -644,7 +602,6 @@ novel
|
|
644
602
|
november
|
645
603
|
now
|
646
604
|
nowhere
|
647
|
-
o
|
648
605
|
Obie
|
649
606
|
obviously
|
650
607
|
oct
|
@@ -670,7 +627,6 @@ or
|
|
670
627
|
org
|
671
628
|
oriented
|
672
629
|
Oscar
|
673
|
-
other
|
674
630
|
others
|
675
631
|
otherwise
|
676
632
|
ought
|
@@ -678,12 +634,9 @@ our
|
|
678
634
|
ours
|
679
635
|
ourselves
|
680
636
|
out
|
681
|
-
outside
|
682
|
-
over
|
683
637
|
overall
|
684
638
|
Owen
|
685
639
|
own
|
686
|
-
p
|
687
640
|
Paige
|
688
641
|
par
|
689
642
|
Parker
|
@@ -694,7 +647,6 @@ Patrick
|
|
694
647
|
Paul
|
695
648
|
peasy
|
696
649
|
per
|
697
|
-
perform
|
698
650
|
perhaps
|
699
651
|
piece
|
700
652
|
placed
|
@@ -714,11 +666,9 @@ proud
|
|
714
666
|
provide
|
715
667
|
provides
|
716
668
|
put
|
717
|
-
q
|
718
669
|
que
|
719
670
|
quite
|
720
671
|
qv
|
721
|
-
r
|
722
672
|
Rachel
|
723
673
|
rather
|
724
674
|
rd
|
@@ -744,7 +694,6 @@ Riley
|
|
744
694
|
Robert
|
745
695
|
run
|
746
696
|
Ryan
|
747
|
-
s
|
748
697
|
safest
|
749
698
|
said
|
750
699
|
Samantha
|
@@ -778,7 +727,6 @@ september
|
|
778
727
|
serious
|
779
728
|
seriously
|
780
729
|
set
|
781
|
-
Seth
|
782
730
|
settings
|
783
731
|
seven
|
784
732
|
several
|
@@ -822,6 +770,7 @@ step
|
|
822
770
|
Stephanie
|
823
771
|
Steven
|
824
772
|
still
|
773
|
+
stuff
|
825
774
|
sub
|
826
775
|
subscribe
|
827
776
|
such
|
@@ -831,7 +780,6 @@ sup
|
|
831
780
|
sur
|
832
781
|
sure
|
833
782
|
Sydney
|
834
|
-
t
|
835
783
|
t's
|
836
784
|
take
|
837
785
|
taken
|
@@ -871,6 +819,8 @@ they'd
|
|
871
819
|
they'll
|
872
820
|
they're
|
873
821
|
they've
|
822
|
+
thing
|
823
|
+
things
|
874
824
|
think
|
875
825
|
third
|
876
826
|
this
|
@@ -909,12 +859,9 @@ twice
|
|
909
859
|
two
|
910
860
|
Tyler
|
911
861
|
typically
|
912
|
-
u
|
913
862
|
ultra
|
914
863
|
un
|
915
|
-
under
|
916
864
|
unfortunately
|
917
|
-
unless
|
918
865
|
unlikely
|
919
866
|
unsurprisingly
|
920
867
|
until
|
@@ -929,7 +876,6 @@ uses
|
|
929
876
|
using
|
930
877
|
usually
|
931
878
|
uucp
|
932
|
-
v
|
933
879
|
value
|
934
880
|
Vanessa
|
935
881
|
various
|
@@ -940,7 +886,6 @@ Victoria
|
|
940
886
|
Vincent
|
941
887
|
viz
|
942
888
|
vs
|
943
|
-
w
|
944
889
|
walks
|
945
890
|
want
|
946
891
|
wants
|
@@ -982,6 +927,8 @@ who's
|
|
982
927
|
whoever
|
983
928
|
whole
|
984
929
|
whom
|
930
|
+
approximate
|
931
|
+
approximately
|
985
932
|
whose
|
986
933
|
why
|
987
934
|
will
|
@@ -1000,11 +947,8 @@ would
|
|
1000
947
|
wouldn't
|
1001
948
|
wrapped
|
1002
949
|
Wyatt
|
1003
|
-
x
|
1004
950
|
Xavier
|
1005
|
-
y
|
1006
951
|
yeah
|
1007
|
-
years
|
1008
952
|
yes
|
1009
953
|
yet
|
1010
954
|
you
|
@@ -1016,9 +960,6 @@ your
|
|
1016
960
|
yours
|
1017
961
|
yourself
|
1018
962
|
yourselves
|
1019
|
-
generally
|
1020
|
-
z
|
1021
963
|
Zachary
|
1022
964
|
zero
|
1023
|
-
Zoe
|
1024
|
-
congratulations
|
965
|
+
Zoe
|
data/pismo.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = %q{pismo}
|
8
|
-
s.version = "0.
|
8
|
+
s.version = "0.7.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Peter Cooper"]
|
12
|
-
s.date = %q{2010-
|
12
|
+
s.date = %q{2010-07-27}
|
13
13
|
s.default_executable = %q{pismo}
|
14
14
|
s.description = %q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
|
15
15
|
s.email = %q{git@peterc.org}
|
@@ -42,7 +42,7 @@
|
|
42
42
|
:spolsky:
|
43
43
|
:title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
|
44
44
|
:description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
|
45
|
-
:lede:
|
45
|
+
:lede: Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "????
|
46
46
|
:author: Joel Spolsky
|
47
47
|
:favicon: /favicon.ico
|
48
48
|
:feed: http://www.joelonsoftware.com/rss.xml
|
@@ -61,7 +61,6 @@
|
|
61
61
|
:tweet:
|
62
62
|
:lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X. Wow..!
|
63
63
|
:sentences: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS Wow..!
|
64
|
-
:datetime: 2010-06-05 12:00:00 +01:00
|
65
64
|
:cant_read:
|
66
65
|
:sentences: "For those of us who grew up as weird kids in the 1980s, the work of Berkeley Breathed was as important as those twin eternal pillars of weird-kid-dom: Monty Python and Mad magazine. In a word: seminal. In two words: fucking seminal."
|
67
66
|
:gmane:
|
@@ -27,16 +27,13 @@
|
|
27
27
|
- "I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor."
|
28
28
|
- "I don't think it will be, but you never know."
|
29
29
|
:spolsky:
|
30
|
-
- "
|
31
|
-
- "
|
30
|
+
- "Ever wonder about that mysterious Content-Type tag?"
|
31
|
+
- "You know, the one you're supposed to put in HTML and you never quite know what it should be?"
|
32
32
|
:techcrunch:
|
33
33
|
- "Last week, we covered Googlle opening a school in India."
|
34
34
|
- "Googlle, not to be confused with Google."
|
35
35
|
:tweet:
|
36
36
|
- "Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS Wow..!"
|
37
|
-
:youtube:
|
38
|
-
- "The location filter shows you popular videos from the selected country or region on lists like Most Viewed and in search results.If you would like to change either of these preferences, please use the links in the footer at the bottom of the page."
|
39
|
-
- "Click \"OK\" to accept these settings or click \"Cancel\" to set your language preference to \"English (UK)\" and your location filter to \"Worldwide\"."
|
40
37
|
:zefrank:
|
41
38
|
- "If there's anyone who knows how to marshal an online audience, it's Ze Frank."
|
42
39
|
- "Ze is best-known for his 2006 program \"The Show,\" in which he made a new 2-3 minute video every day for 1 year."
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pismo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.7.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Peter Cooper
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2010-
|
12
|
+
date: 2010-07-27 00:00:00 +01:00
|
13
13
|
default_executable: pismo
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|