pismo 0.7.0 → 0.7.1
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +5 -0
- data/Gemfile +4 -0
- data/README.markdown +13 -2
- data/Rakefile +2 -28
- data/bin/pismo +1 -1
- data/lib/pismo.rb +7 -3
- data/lib/pismo/internal_attributes.rb +20 -21
- data/lib/pismo/stopwords.txt +40 -3
- data/lib/pismo/version.rb +3 -0
- data/pismo.gemspec +24 -94
- data/test/corpus/metadata_expected.yaml +8 -8
- metadata +81 -45
- data/VERSION +0 -1
data/.gitignore
CHANGED
data/Gemfile
ADDED
data/README.markdown
CHANGED
@@ -6,7 +6,11 @@ Pismo extracts machine-usable metadata from unstructured (or poorly structured)
|
|
6
6
|
Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
|
7
7
|
Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
|
8
8
|
|
9
|
-
All tests pass on Ruby 1.8.7
|
9
|
+
All tests pass on Ruby 1.8.7, Ruby 1.9.2 (both MRI) and JRuby 1.5.6.
|
10
|
+
|
11
|
+
## NEWS:
|
12
|
+
|
13
|
+
December 19, 2010: Version 1.7.1 has been released - it includes a patch from Darcy Laycock to fix keyword extraction problems on some pages, has switched from Jeweler to Bundler for management of the gem, and adds support for JRuby 1.5.6 by skipping stemming on that platform.
|
10
14
|
|
11
15
|
## USAGE:
|
12
16
|
|
@@ -46,12 +50,19 @@ The current metadata methods are:
|
|
46
50
|
These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
|
47
51
|
|
48
52
|
The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.
|
53
|
+
|
54
|
+
New! The keywords method accepts optional arguments. These are the current defaults:
|
55
|
+
|
56
|
+
:stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2
|
57
|
+
|
58
|
+
You can also pass an array to keywords with :hints => arr if you want only words of your choosing to be found.
|
49
59
|
|
50
60
|
## CAVEATS AND SHORTCOMINGS:
|
51
61
|
|
52
62
|
There are some shortcomings or problems that I'm aware of and am going to pursue:
|
53
63
|
|
54
|
-
* I do not know how Pismo fares on Rubinius
|
64
|
+
* I do not know how Pismo fares on Rubinius
|
65
|
+
* pismo requires Bundler - get it :-)
|
55
66
|
* pismo does not install on JRuby due to a problem in the fast-stemmer dependency
|
56
67
|
* Some users have had issues with using Pismo from irb. This appears to be related to Nokogiri use causing a segfault
|
57
68
|
* The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction
|
data/Rakefile
CHANGED
@@ -1,29 +1,5 @@
|
|
1
|
-
require '
|
2
|
-
|
3
|
-
|
4
|
-
begin
|
5
|
-
require 'jeweler'
|
6
|
-
Jeweler::Tasks.new do |gem|
|
7
|
-
gem.name = "pismo"
|
8
|
-
gem.summary = %Q{Extracts or retrieves content-related metadata from HTML pages}
|
9
|
-
gem.description = %Q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
|
10
|
-
gem.email = "git@peterc.org"
|
11
|
-
gem.homepage = "http://github.com/peterc/pismo"
|
12
|
-
gem.authors = ["Peter Cooper"]
|
13
|
-
gem.executables = "pismo"
|
14
|
-
gem.default_executable = "pismo"
|
15
|
-
gem.add_development_dependency "shoulda", ">= 0"
|
16
|
-
gem.add_development_dependency "awesome_print"
|
17
|
-
gem.add_dependency "jeweler"
|
18
|
-
gem.add_dependency "nokogiri"
|
19
|
-
gem.add_dependency "sanitize"
|
20
|
-
gem.add_dependency "fast-stemmer"
|
21
|
-
gem.add_dependency "chronic"
|
22
|
-
end
|
23
|
-
Jeweler::GemcutterTasks.new
|
24
|
-
rescue LoadError
|
25
|
-
puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
|
26
|
-
end
|
1
|
+
require 'bundler'
|
2
|
+
Bundler::GemHelper.install_tasks
|
27
3
|
|
28
4
|
require 'rake/testtask'
|
29
5
|
Rake::TestTask.new(:test) do |test|
|
@@ -45,8 +21,6 @@ rescue LoadError
|
|
45
21
|
end
|
46
22
|
end
|
47
23
|
|
48
|
-
task :test => :check_dependencies
|
49
|
-
|
50
24
|
task :default => :test
|
51
25
|
|
52
26
|
require 'rake/rdoctask'
|
data/bin/pismo
CHANGED
@@ -32,7 +32,7 @@ if ARGV.empty?
|
|
32
32
|
P = doc
|
33
33
|
@p = doc
|
34
34
|
puts "Pismo has loaded #{url} into @p and P"
|
35
|
-
puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
|
35
|
+
#puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
|
36
36
|
IRB.start
|
37
37
|
else
|
38
38
|
output = { :url => doc.url }
|
data/lib/pismo.rb
CHANGED
@@ -2,7 +2,6 @@
|
|
2
2
|
|
3
3
|
require 'open-uri'
|
4
4
|
require 'nokogiri'
|
5
|
-
require 'fast_stemmer'
|
6
5
|
require 'chronic'
|
7
6
|
require 'sanitize'
|
8
7
|
require 'tempfile'
|
@@ -11,6 +10,12 @@ $: << File.dirname(__FILE__)
|
|
11
10
|
require 'pismo/document'
|
12
11
|
require 'pismo/reader'
|
13
12
|
|
13
|
+
if RUBY_PLATFORM == "java"
|
14
|
+
class String; def stem; self; end; end
|
15
|
+
else
|
16
|
+
require 'fast_stemmer'
|
17
|
+
end
|
18
|
+
|
14
19
|
module Pismo
|
15
20
|
# Sugar methods to make creating document objects nicer
|
16
21
|
def self.document(handle, url = nil)
|
@@ -59,8 +64,7 @@ class Nokogiri::HTML::Document
|
|
59
64
|
end
|
60
65
|
|
61
66
|
if result
|
62
|
-
|
63
|
-
# result.gsub!(/\342\200\224/, '-')
|
67
|
+
# TODO: Sort out sanitization in a more centralized way
|
64
68
|
result.gsub!('’', '\'')
|
65
69
|
result.gsub!('—', '-')
|
66
70
|
if all
|
@@ -15,6 +15,7 @@ module Pismo
|
|
15
15
|
'.post-header h1',
|
16
16
|
'.entry-title',
|
17
17
|
'.post-title',
|
18
|
+
'.post h1',
|
18
19
|
'.post h3 a',
|
19
20
|
'a.datitle', # Slashdot style
|
20
21
|
'.posttitle',
|
@@ -93,9 +94,7 @@ module Pismo
|
|
93
94
|
datetime = 10
|
94
95
|
|
95
96
|
regexen.each do |r|
|
96
|
-
datetime = @doc.to_html[r]
|
97
|
-
# p datetime
|
98
|
-
break if datetime
|
97
|
+
break if datetime = @doc.to_html[r]
|
99
98
|
end
|
100
99
|
|
101
100
|
return unless datetime && datetime.length > 4
|
@@ -111,10 +110,6 @@ module Pismo
|
|
111
110
|
Chronic.parse(datetime) || datetime
|
112
111
|
end
|
113
112
|
|
114
|
-
# TODO: Attempts to work out what type of site or page the page is from the provided URL
|
115
|
-
# def site_type
|
116
|
-
# end
|
117
|
-
|
118
113
|
# Returns the author of the page/content
|
119
114
|
def author(all = false)
|
120
115
|
author = @doc.match([
|
@@ -189,13 +184,15 @@ module Pismo
|
|
189
184
|
'.post-text p',
|
190
185
|
'#blogpost p',
|
191
186
|
'.story-teaser',
|
192
|
-
'
|
187
|
+
'.article .body p',
|
188
|
+
'//div[@class="entrytext"]//p[string-length()>40]', # Ruby Inside / Kubrick style
|
193
189
|
'section p',
|
194
190
|
'.entry .text p',
|
191
|
+
'.hentry .content p',
|
195
192
|
'.entry-content p',
|
196
193
|
'#wikicontent p', # Google Code style
|
197
194
|
'.wikistyle p', # GitHub style
|
198
|
-
'//td[@class="storybody"]/p[string-length()>
|
195
|
+
'//td[@class="storybody"]/p[string-length()>40]', # BBC News style
|
199
196
|
'//div[@class="entry"]//p[string-length()>100]',
|
200
197
|
# The below is a horrible, horrible way to pluck out lead paras from crappy Blogspot blogs that
|
201
198
|
# don't use <p> tags..
|
@@ -212,16 +209,16 @@ module Pismo
|
|
212
209
|
|
213
210
|
# TODO: Improve sentence extraction - this is dire even if it "works for now"
|
214
211
|
if lede && String === lede
|
215
|
-
return (lede[/^(.*?[\.\!\?]\s){
|
212
|
+
return (lede[/^(.*?[\.\!\?]\s){1,3}/m] || lede).to_s.strip
|
216
213
|
elsif lede && Array === lede
|
217
|
-
return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){
|
214
|
+
return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){1,3}/m].strip || l }.uniq
|
218
215
|
else
|
219
|
-
return reader_doc && !reader_doc.sentences(
|
216
|
+
return reader_doc && !reader_doc.sentences(4).empty? ? reader_doc.sentences(4).join(' ') : nil
|
220
217
|
end
|
221
218
|
end
|
222
219
|
|
223
220
|
def ledes
|
224
|
-
lede(true)
|
221
|
+
lede(true) rescue []
|
225
222
|
end
|
226
223
|
|
227
224
|
# Returns a string containing the first [limit] sentences as determined by the Reader algorithm
|
@@ -236,29 +233,31 @@ module Pismo
|
|
236
233
|
|
237
234
|
# Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
|
238
235
|
def keywords(options = {})
|
239
|
-
options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
|
236
|
+
options = { :stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2 }.merge(options)
|
240
237
|
|
241
238
|
words = {}
|
242
239
|
|
243
240
|
# Convert doc to lowercase, scrub out most HTML tags, then keep track of words
|
244
|
-
cached_title = title
|
241
|
+
cached_title = title.to_s
|
245
242
|
content_to_use = body.to_s.downcase + " " + description.to_s.downcase
|
246
243
|
|
247
244
|
# old regex for safe keeping -- \b[a-z][a-z\+\.\'\+\#\-]*\b
|
248
|
-
content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'
|
245
|
+
content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.compact.each do |word|
|
249
246
|
next if word.length > options[:word_length_limit]
|
250
|
-
word.gsub!(
|
247
|
+
word.gsub!(/^[\']/, '')
|
248
|
+
word.gsub!(/[\.\-\']$/, '')
|
249
|
+
next if options[:hints] && !options[:hints].include?(word)
|
251
250
|
words[word] ||= 0
|
252
|
-
words[word] += (cached_title.downcase
|
251
|
+
words[word] += (cached_title.downcase =~ /\b#{word}\b/ ? 5 : 1)
|
253
252
|
end
|
254
253
|
|
255
254
|
# Stem the words and stop words if necessary
|
256
255
|
d = words.keys.uniq.map { |a| a.length > options[:stem_at] ? a.stem : a }
|
257
256
|
s = Pismo.stopwords.map { |a| a.length > options[:stem_at] ? a.stem : a }
|
258
257
|
|
259
|
-
|
260
|
-
|
261
|
-
|
258
|
+
words.delete_if { |k1, v1| v1 < options[:minimum_score] }
|
259
|
+
words.delete_if { |k1, v1| s.include?(k1) } if options[:remove_stopwords]
|
260
|
+
words.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
|
262
261
|
end
|
263
262
|
|
264
263
|
def reader_doc
|
data/lib/pismo/stopwords.txt
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
a
|
1
2
|
a's
|
2
3
|
Aaliyah
|
3
4
|
Aaron
|
@@ -70,6 +71,8 @@ apart
|
|
70
71
|
appear
|
71
72
|
appreciate
|
72
73
|
appropriate
|
74
|
+
approximate
|
75
|
+
approximately
|
73
76
|
apr
|
74
77
|
april
|
75
78
|
are
|
@@ -138,6 +141,7 @@ Brooklyn
|
|
138
141
|
Bryan
|
139
142
|
Bryce
|
140
143
|
but
|
144
|
+
by
|
141
145
|
c'mon
|
142
146
|
c's
|
143
147
|
Caden
|
@@ -238,6 +242,7 @@ driven
|
|
238
242
|
drove
|
239
243
|
during
|
240
244
|
Dylan
|
245
|
+
e
|
241
246
|
each
|
242
247
|
easier
|
243
248
|
edu
|
@@ -282,6 +287,7 @@ existing
|
|
282
287
|
extensive
|
283
288
|
extra
|
284
289
|
extremely
|
290
|
+
f
|
285
291
|
Faith
|
286
292
|
false
|
287
293
|
fame
|
@@ -310,6 +316,7 @@ fuck
|
|
310
316
|
full
|
311
317
|
further
|
312
318
|
furthermore
|
319
|
+
g
|
313
320
|
Gabriel
|
314
321
|
Gabriella
|
315
322
|
Gabrielle
|
@@ -334,6 +341,7 @@ gotten
|
|
334
341
|
Grace
|
335
342
|
great
|
336
343
|
greetings
|
344
|
+
h
|
337
345
|
had
|
338
346
|
hadn't
|
339
347
|
Hailey
|
@@ -376,12 +384,14 @@ howbeit
|
|
376
384
|
however
|
377
385
|
huge
|
378
386
|
Hunter
|
387
|
+
i
|
379
388
|
i'd
|
380
389
|
i'll
|
381
390
|
i'm
|
382
391
|
i've
|
383
392
|
Ian
|
384
393
|
ie
|
394
|
+
if
|
385
395
|
ignored
|
386
396
|
imagine
|
387
397
|
immediate
|
@@ -418,6 +428,7 @@ it's
|
|
418
428
|
its
|
419
429
|
itself
|
420
430
|
Ivan
|
431
|
+
j
|
421
432
|
Jack
|
422
433
|
Jackson
|
423
434
|
Jacob
|
@@ -440,6 +451,7 @@ Jessica
|
|
440
451
|
Jesus
|
441
452
|
jim
|
442
453
|
jimmy
|
454
|
+
jnr
|
443
455
|
Jocelyn
|
444
456
|
Joel
|
445
457
|
John
|
@@ -450,6 +462,7 @@ Jose
|
|
450
462
|
Joseph
|
451
463
|
Joshua
|
452
464
|
Josiah
|
465
|
+
jr
|
453
466
|
Juan
|
454
467
|
jul
|
455
468
|
Julia
|
@@ -459,6 +472,7 @@ jun
|
|
459
472
|
june
|
460
473
|
just
|
461
474
|
Justin
|
475
|
+
k
|
462
476
|
Kaden
|
463
477
|
Kaitlyn
|
464
478
|
Kaleb
|
@@ -479,6 +493,7 @@ known
|
|
479
493
|
knows
|
480
494
|
Kyle
|
481
495
|
Kylie
|
496
|
+
l
|
482
497
|
la
|
483
498
|
Landon
|
484
499
|
last
|
@@ -518,6 +533,7 @@ ltd
|
|
518
533
|
Lucas
|
519
534
|
Luis
|
520
535
|
Luke
|
536
|
+
m
|
521
537
|
Mackenzie
|
522
538
|
Madeline
|
523
539
|
Madison
|
@@ -564,6 +580,7 @@ much
|
|
564
580
|
must
|
565
581
|
my
|
566
582
|
myself
|
583
|
+
n
|
567
584
|
name
|
568
585
|
namely
|
569
586
|
Natalie
|
@@ -602,6 +619,7 @@ novel
|
|
602
619
|
november
|
603
620
|
now
|
604
621
|
nowhere
|
622
|
+
o
|
605
623
|
Obie
|
606
624
|
obviously
|
607
625
|
oct
|
@@ -637,6 +655,7 @@ out
|
|
637
655
|
overall
|
638
656
|
Owen
|
639
657
|
own
|
658
|
+
p
|
640
659
|
Paige
|
641
660
|
par
|
642
661
|
Parker
|
@@ -666,9 +685,11 @@ proud
|
|
666
685
|
provide
|
667
686
|
provides
|
668
687
|
put
|
688
|
+
q
|
669
689
|
que
|
670
690
|
quite
|
671
691
|
qv
|
692
|
+
r
|
672
693
|
Rachel
|
673
694
|
rather
|
674
695
|
rd
|
@@ -694,6 +715,7 @@ Riley
|
|
694
715
|
Robert
|
695
716
|
run
|
696
717
|
Ryan
|
718
|
+
s
|
697
719
|
safest
|
698
720
|
said
|
699
721
|
Samantha
|
@@ -764,6 +786,7 @@ specify
|
|
764
786
|
specifying
|
765
787
|
spoke
|
766
788
|
spread
|
789
|
+
sr
|
767
790
|
stand
|
768
791
|
started
|
769
792
|
step
|
@@ -780,6 +803,7 @@ sup
|
|
780
803
|
sur
|
781
804
|
sure
|
782
805
|
Sydney
|
806
|
+
t
|
783
807
|
t's
|
784
808
|
take
|
785
809
|
taken
|
@@ -859,6 +883,7 @@ twice
|
|
859
883
|
two
|
860
884
|
Tyler
|
861
885
|
typically
|
886
|
+
u
|
862
887
|
ultra
|
863
888
|
un
|
864
889
|
unfortunately
|
@@ -876,6 +901,7 @@ uses
|
|
876
901
|
using
|
877
902
|
usually
|
878
903
|
uucp
|
904
|
+
v
|
879
905
|
value
|
880
906
|
Vanessa
|
881
907
|
various
|
@@ -886,6 +912,7 @@ Victoria
|
|
886
912
|
Vincent
|
887
913
|
viz
|
888
914
|
vs
|
915
|
+
w
|
889
916
|
walks
|
890
917
|
want
|
891
918
|
wants
|
@@ -927,8 +954,6 @@ who's
|
|
927
954
|
whoever
|
928
955
|
whole
|
929
956
|
whom
|
930
|
-
approximate
|
931
|
-
approximately
|
932
957
|
whose
|
933
958
|
why
|
934
959
|
will
|
@@ -948,6 +973,7 @@ wouldn't
|
|
948
973
|
wrapped
|
949
974
|
Wyatt
|
950
975
|
Xavier
|
976
|
+
y
|
951
977
|
yeah
|
952
978
|
yes
|
953
979
|
yet
|
@@ -960,6 +986,17 @@ your
|
|
960
986
|
yours
|
961
987
|
yourself
|
962
988
|
yourselves
|
989
|
+
z
|
963
990
|
Zachary
|
964
991
|
zero
|
965
|
-
Zoe
|
992
|
+
Zoe
|
993
|
+
0
|
994
|
+
1
|
995
|
+
2
|
996
|
+
3
|
997
|
+
4
|
998
|
+
5
|
999
|
+
6
|
1000
|
+
7
|
1001
|
+
8
|
1002
|
+
9
|
data/pismo.gemspec
CHANGED
@@ -1,101 +1,31 @@
|
|
1
|
-
# Generated by jeweler
|
2
|
-
# DO NOT EDIT THIS FILE DIRECTLY
|
3
|
-
# Instead, edit Jeweler::Tasks in Rakefile, and run the gemspec command
|
4
1
|
# -*- encoding: utf-8 -*-
|
2
|
+
$:.push File.expand_path("../lib", __FILE__)
|
3
|
+
require "pismo/version"
|
5
4
|
|
6
5
|
Gem::Specification.new do |s|
|
7
|
-
s.name
|
8
|
-
s.version
|
9
|
-
|
10
|
-
s.
|
11
|
-
s.
|
12
|
-
s.
|
13
|
-
s.
|
6
|
+
s.name = "pismo"
|
7
|
+
s.version = Pismo::VERSION
|
8
|
+
s.platform = Gem::Platform::RUBY
|
9
|
+
s.authors = ["Peter Cooper"]
|
10
|
+
s.email = ["git@peterc.org"]
|
11
|
+
s.homepage = "http://github.com/peterc/pismo"
|
12
|
+
s.summary = %q{TODO: Write a gem summary}
|
14
13
|
s.description = %q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
|
15
|
-
s.
|
16
|
-
s.
|
17
|
-
s.
|
18
|
-
"LICENSE",
|
19
|
-
"README.markdown"
|
20
|
-
]
|
21
|
-
s.files = [
|
22
|
-
".document",
|
23
|
-
".gitignore",
|
24
|
-
"LICENSE",
|
25
|
-
"NOTICE",
|
26
|
-
"README.markdown",
|
27
|
-
"Rakefile",
|
28
|
-
"VERSION",
|
29
|
-
"bin/pismo",
|
30
|
-
"lib/pismo.rb",
|
31
|
-
"lib/pismo/document.rb",
|
32
|
-
"lib/pismo/external_attributes.rb",
|
33
|
-
"lib/pismo/internal_attributes.rb",
|
34
|
-
"lib/pismo/reader.rb",
|
35
|
-
"lib/pismo/stopwords.txt",
|
36
|
-
"pismo.gemspec",
|
37
|
-
"test/corpus/bbcnews.html",
|
38
|
-
"test/corpus/bbcnews2.html",
|
39
|
-
"test/corpus/briancray.html",
|
40
|
-
"test/corpus/cant_read.html",
|
41
|
-
"test/corpus/factor.html",
|
42
|
-
"test/corpus/gmane.html",
|
43
|
-
"test/corpus/huffington.html",
|
44
|
-
"test/corpus/metadata_expected.yaml",
|
45
|
-
"test/corpus/metadata_expected.yaml.old",
|
46
|
-
"test/corpus/queness.html",
|
47
|
-
"test/corpus/reader_expected.yaml",
|
48
|
-
"test/corpus/rubyinside.html",
|
49
|
-
"test/corpus/rww.html",
|
50
|
-
"test/corpus/spolsky.html",
|
51
|
-
"test/corpus/techcrunch.html",
|
52
|
-
"test/corpus/tweet.html",
|
53
|
-
"test/corpus/youtube.html",
|
54
|
-
"test/corpus/zefrank.html",
|
55
|
-
"test/helper.rb",
|
56
|
-
"test/test_corpus.rb",
|
57
|
-
"test/test_pismo_document.rb"
|
58
|
-
]
|
59
|
-
s.homepage = %q{http://github.com/peterc/pismo}
|
60
|
-
s.rdoc_options = ["--charset=UTF-8"]
|
61
|
-
s.require_paths = ["lib"]
|
62
|
-
s.rubygems_version = %q{1.3.5}
|
63
|
-
s.summary = %q{Extracts or retrieves content-related metadata from HTML pages}
|
64
|
-
s.test_files = [
|
65
|
-
"test/helper.rb",
|
66
|
-
"test/test_corpus.rb",
|
67
|
-
"test/test_pismo_document.rb"
|
68
|
-
]
|
14
|
+
s.summary = %q{Extracts or retrieves content-related metadata from HTML pages}
|
15
|
+
s.date = %q{2010-07-27}
|
16
|
+
s.default_executable = %q{pismo}
|
69
17
|
|
70
|
-
|
71
|
-
current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
|
72
|
-
s.specification_version = 3
|
18
|
+
s.rubyforge_project = "pismo"
|
73
19
|
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
s.add_dependency(%q<jeweler>, [">= 0"])
|
86
|
-
s.add_dependency(%q<nokogiri>, [">= 0"])
|
87
|
-
s.add_dependency(%q<sanitize>, [">= 0"])
|
88
|
-
s.add_dependency(%q<fast-stemmer>, [">= 0"])
|
89
|
-
s.add_dependency(%q<chronic>, [">= 0"])
|
90
|
-
end
|
91
|
-
else
|
92
|
-
s.add_dependency(%q<shoulda>, [">= 0"])
|
93
|
-
s.add_dependency(%q<awesome_print>, [">= 0"])
|
94
|
-
s.add_dependency(%q<jeweler>, [">= 0"])
|
95
|
-
s.add_dependency(%q<nokogiri>, [">= 0"])
|
96
|
-
s.add_dependency(%q<sanitize>, [">= 0"])
|
97
|
-
s.add_dependency(%q<fast-stemmer>, [">= 0"])
|
98
|
-
s.add_dependency(%q<chronic>, [">= 0"])
|
99
|
-
end
|
20
|
+
s.files = `git ls-files`.split("\n")
|
21
|
+
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
22
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
23
|
+
s.require_paths = ["lib"]
|
24
|
+
|
25
|
+
s.add_dependency(%q<shoulda>, [">= 0"])
|
26
|
+
s.add_dependency(%q<awesome_print>, [">= 0"])
|
27
|
+
s.add_dependency(%q<nokogiri>, [">= 0"])
|
28
|
+
s.add_dependency(%q<sanitize>, [">= 0"])
|
29
|
+
s.add_dependency(%q<fast-stemmer>, [">= 0"])
|
30
|
+
s.add_dependency(%q<chronic>, [">= 0"])
|
100
31
|
end
|
101
|
-
|
@@ -2,14 +2,14 @@
|
|
2
2
|
:rww:
|
3
3
|
:title: "Cartoon: Apple Tablet: Now With Barometer and Bird Call Generator"
|
4
4
|
:feed: http://www.readwriteweb.com/rss.xml
|
5
|
-
:lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator.
|
5
|
+
:lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator. I'm never sure if Apple does themselves more good than harm with the secrecy and anticipation that surrounds the run-up to these announcements.
|
6
6
|
:feeds:
|
7
7
|
- http://www.readwriteweb.com/rss.xml
|
8
8
|
- http://www.readwriteweb.com/archives/2010/01/cartoon_apple_tablet_now_with_barometer_and_bird_c.xml
|
9
9
|
:briancray:
|
10
10
|
:title: 5 great examples of popular blog posts that you should know
|
11
11
|
:feed: http://feeds.feedburner.com/briancray/blog
|
12
|
-
:lede: "This is a mock post.
|
12
|
+
:lede: "This is a mock post."
|
13
13
|
:huffington:
|
14
14
|
:title: Afghans Losing Hope After 8 Years Of War
|
15
15
|
:author: TODD PITMAN
|
@@ -31,9 +31,9 @@
|
|
31
31
|
:feed: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml
|
32
32
|
:factor:
|
33
33
|
:title: Factor's bootstrap process explained
|
34
|
-
:lede: "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap."
|
34
|
+
:lede: "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code."
|
35
35
|
:ledes:
|
36
|
-
- "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap."
|
36
|
+
- "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code."
|
37
37
|
:youtube:
|
38
38
|
:title: YMO - Rydeen (Official Video)
|
39
39
|
:author: ymo1965
|
@@ -42,7 +42,7 @@
|
|
42
42
|
:spolsky:
|
43
43
|
:title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
|
44
44
|
:description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
|
45
|
-
:lede: Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "????
|
45
|
+
:lede: "Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line \"???? "
|
46
46
|
:author: Joel Spolsky
|
47
47
|
:favicon: /favicon.ico
|
48
48
|
:feed: http://www.joelonsoftware.com/rss.xml
|
@@ -52,14 +52,14 @@
|
|
52
52
|
:rubyinside:
|
53
53
|
:title: "CoffeeScript: A New Language With A Pure Ruby Compiler"
|
54
54
|
:author: Peter Cooper
|
55
|
-
:lede: CoffeeScript (GitHub repo) is a new programming language with a pure Ruby compiler.
|
55
|
+
:lede: CoffeeScript (GitHub repo) is a new programming language with a pure Ruby compiler.
|
56
56
|
:feed: http://www.rubyinside.com/feed/
|
57
57
|
:zefrank:
|
58
58
|
:sentences: If there's anyone who knows how to marshal an online audience, it's Ze Frank. Ze is best-known for his 2006 program "The Show," in which he made a new 2-3 minute video every day for 1 year. Topics ranged from "fingers in food" to the mysteries of airport signage to a tour de force summary of creatives' addiction to un-executed ideas, aka brain crack.
|
59
59
|
:title: "Ze Frank on Imaginary Audiences :: Articles :: The 99 Percent"
|
60
60
|
:description: We chat with the Internet's most notorious mass-collaboration instigator Ze Frank about idea execution and how to build armies of sportsracers.
|
61
61
|
:tweet:
|
62
|
-
:lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X.
|
62
|
+
:lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X.
|
63
63
|
:sentences: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS Wow..!
|
64
64
|
:cant_read:
|
65
65
|
:sentences: "For those of us who grew up as weird kids in the 1980s, the work of Berkeley Breathed was as important as those twin eternal pillars of weird-kid-dom: Monty Python and Mad magazine. In a word: seminal. In two words: fucking seminal."
|
@@ -67,6 +67,6 @@
|
|
67
67
|
:sentences: I am pleased to report that the GCC Steering Committee and the FSF have approved the use of C++ in GCC itself. Of course, there's no reason for us to use C++ features just because we can. The goal is a better compiler for users, not a C++ code base for its own sake.
|
68
68
|
:queness:
|
69
69
|
:title: 18 Incredible CSS3 Effects You Have Never Seen Before
|
70
|
-
:lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it."
|
70
|
+
:lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it. Also, I have started to implement it to my own project as well and I really love it!"
|
71
71
|
:sentences: CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it.
|
72
72
|
:datetime: 2010-06-02 12:00:00 +01:00
|
metadata
CHANGED
@@ -1,7 +1,12 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pismo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
|
4
|
+
prerelease: false
|
5
|
+
segments:
|
6
|
+
- 0
|
7
|
+
- 7
|
8
|
+
- 1
|
9
|
+
version: 0.7.1
|
5
10
|
platform: ruby
|
6
11
|
authors:
|
7
12
|
- Peter Cooper
|
@@ -14,91 +19,99 @@ default_executable: pismo
|
|
14
19
|
dependencies:
|
15
20
|
- !ruby/object:Gem::Dependency
|
16
21
|
name: shoulda
|
17
|
-
|
18
|
-
|
19
|
-
|
22
|
+
prerelease: false
|
23
|
+
requirement: &id001 !ruby/object:Gem::Requirement
|
24
|
+
none: false
|
20
25
|
requirements:
|
21
26
|
- - ">="
|
22
27
|
- !ruby/object:Gem::Version
|
28
|
+
segments:
|
29
|
+
- 0
|
23
30
|
version: "0"
|
24
|
-
|
31
|
+
type: :runtime
|
32
|
+
version_requirements: *id001
|
25
33
|
- !ruby/object:Gem::Dependency
|
26
34
|
name: awesome_print
|
27
|
-
|
28
|
-
|
29
|
-
|
35
|
+
prerelease: false
|
36
|
+
requirement: &id002 !ruby/object:Gem::Requirement
|
37
|
+
none: false
|
30
38
|
requirements:
|
31
39
|
- - ">="
|
32
40
|
- !ruby/object:Gem::Version
|
41
|
+
segments:
|
42
|
+
- 0
|
33
43
|
version: "0"
|
34
|
-
version:
|
35
|
-
- !ruby/object:Gem::Dependency
|
36
|
-
name: jeweler
|
37
44
|
type: :runtime
|
38
|
-
|
39
|
-
version_requirements: !ruby/object:Gem::Requirement
|
40
|
-
requirements:
|
41
|
-
- - ">="
|
42
|
-
- !ruby/object:Gem::Version
|
43
|
-
version: "0"
|
44
|
-
version:
|
45
|
+
version_requirements: *id002
|
45
46
|
- !ruby/object:Gem::Dependency
|
46
47
|
name: nokogiri
|
47
|
-
|
48
|
-
|
49
|
-
|
48
|
+
prerelease: false
|
49
|
+
requirement: &id003 !ruby/object:Gem::Requirement
|
50
|
+
none: false
|
50
51
|
requirements:
|
51
52
|
- - ">="
|
52
53
|
- !ruby/object:Gem::Version
|
54
|
+
segments:
|
55
|
+
- 0
|
53
56
|
version: "0"
|
54
|
-
|
57
|
+
type: :runtime
|
58
|
+
version_requirements: *id003
|
55
59
|
- !ruby/object:Gem::Dependency
|
56
60
|
name: sanitize
|
57
|
-
|
58
|
-
|
59
|
-
|
61
|
+
prerelease: false
|
62
|
+
requirement: &id004 !ruby/object:Gem::Requirement
|
63
|
+
none: false
|
60
64
|
requirements:
|
61
65
|
- - ">="
|
62
66
|
- !ruby/object:Gem::Version
|
67
|
+
segments:
|
68
|
+
- 0
|
63
69
|
version: "0"
|
64
|
-
|
70
|
+
type: :runtime
|
71
|
+
version_requirements: *id004
|
65
72
|
- !ruby/object:Gem::Dependency
|
66
73
|
name: fast-stemmer
|
67
|
-
|
68
|
-
|
69
|
-
|
74
|
+
prerelease: false
|
75
|
+
requirement: &id005 !ruby/object:Gem::Requirement
|
76
|
+
none: false
|
70
77
|
requirements:
|
71
78
|
- - ">="
|
72
79
|
- !ruby/object:Gem::Version
|
80
|
+
segments:
|
81
|
+
- 0
|
73
82
|
version: "0"
|
74
|
-
|
83
|
+
type: :runtime
|
84
|
+
version_requirements: *id005
|
75
85
|
- !ruby/object:Gem::Dependency
|
76
86
|
name: chronic
|
77
|
-
|
78
|
-
|
79
|
-
|
87
|
+
prerelease: false
|
88
|
+
requirement: &id006 !ruby/object:Gem::Requirement
|
89
|
+
none: false
|
80
90
|
requirements:
|
81
91
|
- - ">="
|
82
92
|
- !ruby/object:Gem::Version
|
93
|
+
segments:
|
94
|
+
- 0
|
83
95
|
version: "0"
|
84
|
-
|
96
|
+
type: :runtime
|
97
|
+
version_requirements: *id006
|
85
98
|
description: Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.
|
86
|
-
email:
|
99
|
+
email:
|
100
|
+
- git@peterc.org
|
87
101
|
executables:
|
88
102
|
- pismo
|
89
103
|
extensions: []
|
90
104
|
|
91
|
-
extra_rdoc_files:
|
92
|
-
|
93
|
-
- README.markdown
|
105
|
+
extra_rdoc_files: []
|
106
|
+
|
94
107
|
files:
|
95
108
|
- .document
|
96
109
|
- .gitignore
|
110
|
+
- Gemfile
|
97
111
|
- LICENSE
|
98
112
|
- NOTICE
|
99
113
|
- README.markdown
|
100
114
|
- Rakefile
|
101
|
-
- VERSION
|
102
115
|
- bin/pismo
|
103
116
|
- lib/pismo.rb
|
104
117
|
- lib/pismo/document.rb
|
@@ -106,6 +119,7 @@ files:
|
|
106
119
|
- lib/pismo/internal_attributes.rb
|
107
120
|
- lib/pismo/reader.rb
|
108
121
|
- lib/pismo/stopwords.txt
|
122
|
+
- lib/pismo/version.rb
|
109
123
|
- pismo.gemspec
|
110
124
|
- test/corpus/bbcnews.html
|
111
125
|
- test/corpus/bbcnews2.html
|
@@ -133,30 +147,52 @@ homepage: http://github.com/peterc/pismo
|
|
133
147
|
licenses: []
|
134
148
|
|
135
149
|
post_install_message:
|
136
|
-
rdoc_options:
|
137
|
-
|
150
|
+
rdoc_options: []
|
151
|
+
|
138
152
|
require_paths:
|
139
153
|
- lib
|
140
154
|
required_ruby_version: !ruby/object:Gem::Requirement
|
155
|
+
none: false
|
141
156
|
requirements:
|
142
157
|
- - ">="
|
143
158
|
- !ruby/object:Gem::Version
|
159
|
+
segments:
|
160
|
+
- 0
|
144
161
|
version: "0"
|
145
|
-
version:
|
146
162
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
163
|
+
none: false
|
147
164
|
requirements:
|
148
165
|
- - ">="
|
149
166
|
- !ruby/object:Gem::Version
|
167
|
+
segments:
|
168
|
+
- 0
|
150
169
|
version: "0"
|
151
|
-
version:
|
152
170
|
requirements: []
|
153
171
|
|
154
|
-
rubyforge_project:
|
155
|
-
rubygems_version: 1.3.
|
172
|
+
rubyforge_project: pismo
|
173
|
+
rubygems_version: 1.3.7
|
156
174
|
signing_key:
|
157
175
|
specification_version: 3
|
158
176
|
summary: Extracts or retrieves content-related metadata from HTML pages
|
159
177
|
test_files:
|
178
|
+
- test/corpus/bbcnews.html
|
179
|
+
- test/corpus/bbcnews2.html
|
180
|
+
- test/corpus/briancray.html
|
181
|
+
- test/corpus/cant_read.html
|
182
|
+
- test/corpus/factor.html
|
183
|
+
- test/corpus/gmane.html
|
184
|
+
- test/corpus/huffington.html
|
185
|
+
- test/corpus/metadata_expected.yaml
|
186
|
+
- test/corpus/metadata_expected.yaml.old
|
187
|
+
- test/corpus/queness.html
|
188
|
+
- test/corpus/reader_expected.yaml
|
189
|
+
- test/corpus/rubyinside.html
|
190
|
+
- test/corpus/rww.html
|
191
|
+
- test/corpus/spolsky.html
|
192
|
+
- test/corpus/techcrunch.html
|
193
|
+
- test/corpus/tweet.html
|
194
|
+
- test/corpus/youtube.html
|
195
|
+
- test/corpus/zefrank.html
|
160
196
|
- test/helper.rb
|
161
197
|
- test/test_corpus.rb
|
162
198
|
- test/test_pismo_document.rb
|
data/VERSION
DELETED
@@ -1 +0,0 @@
|
|
1
|
-
0.7.0
|