pismo 0.7.0 → 0.7.1

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore CHANGED
@@ -1,3 +1,8 @@
1
+ pkg/*
2
+ *.gem
3
+ .bundle
4
+ Gemfile.lock
5
+
1
6
  ## MAC OS
2
7
  .DS_Store
3
8
 
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in pismo.gemspec
4
+ gemspec
@@ -6,7 +6,11 @@ Pismo extracts machine-usable metadata from unstructured (or poorly structured)
6
6
  Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
7
7
  Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
8
8
 
9
- All tests pass on Ruby 1.8.7 (MRI) and Ruby 1.9.1-p378 (MRI).
9
+ All tests pass on Ruby 1.8.7, Ruby 1.9.2 (both MRI) and JRuby 1.5.6.
10
+
11
+ ## NEWS:
12
+
13
+ December 19, 2010: Version 1.7.1 has been released - it includes a patch from Darcy Laycock to fix keyword extraction problems on some pages, has switched from Jeweler to Bundler for management of the gem, and adds support for JRuby 1.5.6 by skipping stemming on that platform.
10
14
 
11
15
  ## USAGE:
12
16
 
@@ -46,12 +50,19 @@ The current metadata methods are:
46
50
  These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
47
51
 
48
52
  The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.
53
+
54
+ New! The keywords method accepts optional arguments. These are the current defaults:
55
+
56
+ :stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2
57
+
58
+ You can also pass an array to keywords with :hints => arr if you want only words of your choosing to be found.
49
59
 
50
60
  ## CAVEATS AND SHORTCOMINGS:
51
61
 
52
62
  There are some shortcomings or problems that I'm aware of and am going to pursue:
53
63
 
54
- * I do not know how Pismo fares on Rubinius or other versions of 1.9 (e.g. 1.9.2) yet
64
+ * I do not know how Pismo fares on Rubinius
65
+ * pismo requires Bundler - get it :-)
55
66
  * pismo does not install on JRuby due to a problem in the fast-stemmer dependency
56
67
  * Some users have had issues with using Pismo from irb. This appears to be related to Nokogiri use causing a segfault
57
68
  * The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction
data/Rakefile CHANGED
@@ -1,29 +1,5 @@
1
- require 'rubygems'
2
- require 'rake'
3
-
4
- begin
5
- require 'jeweler'
6
- Jeweler::Tasks.new do |gem|
7
- gem.name = "pismo"
8
- gem.summary = %Q{Extracts or retrieves content-related metadata from HTML pages}
9
- gem.description = %Q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
10
- gem.email = "git@peterc.org"
11
- gem.homepage = "http://github.com/peterc/pismo"
12
- gem.authors = ["Peter Cooper"]
13
- gem.executables = "pismo"
14
- gem.default_executable = "pismo"
15
- gem.add_development_dependency "shoulda", ">= 0"
16
- gem.add_development_dependency "awesome_print"
17
- gem.add_dependency "jeweler"
18
- gem.add_dependency "nokogiri"
19
- gem.add_dependency "sanitize"
20
- gem.add_dependency "fast-stemmer"
21
- gem.add_dependency "chronic"
22
- end
23
- Jeweler::GemcutterTasks.new
24
- rescue LoadError
25
- puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
26
- end
1
+ require 'bundler'
2
+ Bundler::GemHelper.install_tasks
27
3
 
28
4
  require 'rake/testtask'
29
5
  Rake::TestTask.new(:test) do |test|
@@ -45,8 +21,6 @@ rescue LoadError
45
21
  end
46
22
  end
47
23
 
48
- task :test => :check_dependencies
49
-
50
24
  task :default => :test
51
25
 
52
26
  require 'rake/rdoctask'
data/bin/pismo CHANGED
@@ -32,7 +32,7 @@ if ARGV.empty?
32
32
  P = doc
33
33
  @p = doc
34
34
  puts "Pismo has loaded #{url} into @p and P"
35
- puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
35
+ #puts "Note: There have been several reports of Nokogiri segfaulting while using Pismo from irb. If this happens, try the same code as a standalone Ruby app."
36
36
  IRB.start
37
37
  else
38
38
  output = { :url => doc.url }
@@ -2,7 +2,6 @@
2
2
 
3
3
  require 'open-uri'
4
4
  require 'nokogiri'
5
- require 'fast_stemmer'
6
5
  require 'chronic'
7
6
  require 'sanitize'
8
7
  require 'tempfile'
@@ -11,6 +10,12 @@ $: << File.dirname(__FILE__)
11
10
  require 'pismo/document'
12
11
  require 'pismo/reader'
13
12
 
13
+ if RUBY_PLATFORM == "java"
14
+ class String; def stem; self; end; end
15
+ else
16
+ require 'fast_stemmer'
17
+ end
18
+
14
19
  module Pismo
15
20
  # Sugar methods to make creating document objects nicer
16
21
  def self.document(handle, url = nil)
@@ -59,8 +64,7 @@ class Nokogiri::HTML::Document
59
64
  end
60
65
 
61
66
  if result
62
- # result.gsub!(/\342\200\231/, '\'')
63
- # result.gsub!(/\342\200\224/, '-')
67
+ # TODO: Sort out sanitization in a more centralized way
64
68
  result.gsub!('’', '\'')
65
69
  result.gsub!('—', '-')
66
70
  if all
@@ -15,6 +15,7 @@ module Pismo
15
15
  '.post-header h1',
16
16
  '.entry-title',
17
17
  '.post-title',
18
+ '.post h1',
18
19
  '.post h3 a',
19
20
  'a.datitle', # Slashdot style
20
21
  '.posttitle',
@@ -93,9 +94,7 @@ module Pismo
93
94
  datetime = 10
94
95
 
95
96
  regexen.each do |r|
96
- datetime = @doc.to_html[r]
97
- # p datetime
98
- break if datetime
97
+ break if datetime = @doc.to_html[r]
99
98
  end
100
99
 
101
100
  return unless datetime && datetime.length > 4
@@ -111,10 +110,6 @@ module Pismo
111
110
  Chronic.parse(datetime) || datetime
112
111
  end
113
112
 
114
- # TODO: Attempts to work out what type of site or page the page is from the provided URL
115
- # def site_type
116
- # end
117
-
118
113
  # Returns the author of the page/content
119
114
  def author(all = false)
120
115
  author = @doc.match([
@@ -189,13 +184,15 @@ module Pismo
189
184
  '.post-text p',
190
185
  '#blogpost p',
191
186
  '.story-teaser',
192
- '//div[@class="entrytext"]//p[string-length()>10]', # Ruby Inside / Kubrick style
187
+ '.article .body p',
188
+ '//div[@class="entrytext"]//p[string-length()>40]', # Ruby Inside / Kubrick style
193
189
  'section p',
194
190
  '.entry .text p',
191
+ '.hentry .content p',
195
192
  '.entry-content p',
196
193
  '#wikicontent p', # Google Code style
197
194
  '.wikistyle p', # GitHub style
198
- '//td[@class="storybody"]/p[string-length()>10]', # BBC News style
195
+ '//td[@class="storybody"]/p[string-length()>40]', # BBC News style
199
196
  '//div[@class="entry"]//p[string-length()>100]',
200
197
  # The below is a horrible, horrible way to pluck out lead paras from crappy Blogspot blogs that
201
198
  # don't use <p> tags..
@@ -212,16 +209,16 @@ module Pismo
212
209
 
213
210
  # TODO: Improve sentence extraction - this is dire even if it "works for now"
214
211
  if lede && String === lede
215
- return (lede[/^(.*?[\.\!\?]\s){2}/m] || lede).to_s.strip
212
+ return (lede[/^(.*?[\.\!\?]\s){1,3}/m] || lede).to_s.strip
216
213
  elsif lede && Array === lede
217
- return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){2}/m].strip || l }.uniq
214
+ return lede.map { |l| l.to_s[/^(.*?[\.\!\?]\s){1,3}/m].strip || l }.uniq
218
215
  else
219
- return reader_doc && !reader_doc.sentences(3).empty? ? reader_doc.sentences(3).join(' ') : nil
216
+ return reader_doc && !reader_doc.sentences(4).empty? ? reader_doc.sentences(4).join(' ') : nil
220
217
  end
221
218
  end
222
219
 
223
220
  def ledes
224
- lede(true)
221
+ lede(true) rescue []
225
222
  end
226
223
 
227
224
  # Returns a string containing the first [limit] sentences as determined by the Reader algorithm
@@ -236,29 +233,31 @@ module Pismo
236
233
 
237
234
  # Returns the "keywords" in the document (not the meta keywords - they're next to useless now)
238
235
  def keywords(options = {})
239
- options = { :stem_at => 20, :word_length_limit => 15, :limit => 20 }.merge(options)
236
+ options = { :stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2 }.merge(options)
240
237
 
241
238
  words = {}
242
239
 
243
240
  # Convert doc to lowercase, scrub out most HTML tags, then keep track of words
244
- cached_title = title
241
+ cached_title = title.to_s
245
242
  content_to_use = body.to_s.downcase + " " + description.to_s.downcase
246
243
 
247
244
  # old regex for safe keeping -- \b[a-z][a-z\+\.\'\+\#\-]*\b
248
- content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\/\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.each do |word|
245
+ content_to_use.downcase.gsub(/\<[^\>]{1,100}\>/, '').gsub(/\.+\s+/, ' ').gsub(/\&\w+\;/, '').scan(/(\b|\s|\A)([a-z0-9][a-z0-9\+\.\'\+\#\-\\]*)(\b|\s|\Z)/i).map{ |ta1| ta1[1] }.compact.each do |word|
249
246
  next if word.length > options[:word_length_limit]
250
- word.gsub!(/\'\w+/, '')
247
+ word.gsub!(/^[\']/, '')
248
+ word.gsub!(/[\.\-\']$/, '')
249
+ next if options[:hints] && !options[:hints].include?(word)
251
250
  words[word] ||= 0
252
- words[word] += (cached_title.downcase.include?(word) ? 5 : 1)
251
+ words[word] += (cached_title.downcase =~ /\b#{word}\b/ ? 5 : 1)
253
252
  end
254
253
 
255
254
  # Stem the words and stop words if necessary
256
255
  d = words.keys.uniq.map { |a| a.length > options[:stem_at] ? a.stem : a }
257
256
  s = Pismo.stopwords.map { |a| a.length > options[:stem_at] ? a.stem : a }
258
257
 
259
-
260
- w = words.delete_if { |k1, v1| s.include?(k1) || (v1 < 2 && words.size > 80) }.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
261
- return w
258
+ words.delete_if { |k1, v1| v1 < options[:minimum_score] }
259
+ words.delete_if { |k1, v1| s.include?(k1) } if options[:remove_stopwords]
260
+ words.sort_by { |k2, v2| v2 }.reverse.first(options[:limit])
262
261
  end
263
262
 
264
263
  def reader_doc
@@ -1,3 +1,4 @@
1
+ a
1
2
  a's
2
3
  Aaliyah
3
4
  Aaron
@@ -70,6 +71,8 @@ apart
70
71
  appear
71
72
  appreciate
72
73
  appropriate
74
+ approximate
75
+ approximately
73
76
  apr
74
77
  april
75
78
  are
@@ -138,6 +141,7 @@ Brooklyn
138
141
  Bryan
139
142
  Bryce
140
143
  but
144
+ by
141
145
  c'mon
142
146
  c's
143
147
  Caden
@@ -238,6 +242,7 @@ driven
238
242
  drove
239
243
  during
240
244
  Dylan
245
+ e
241
246
  each
242
247
  easier
243
248
  edu
@@ -282,6 +287,7 @@ existing
282
287
  extensive
283
288
  extra
284
289
  extremely
290
+ f
285
291
  Faith
286
292
  false
287
293
  fame
@@ -310,6 +316,7 @@ fuck
310
316
  full
311
317
  further
312
318
  furthermore
319
+ g
313
320
  Gabriel
314
321
  Gabriella
315
322
  Gabrielle
@@ -334,6 +341,7 @@ gotten
334
341
  Grace
335
342
  great
336
343
  greetings
344
+ h
337
345
  had
338
346
  hadn't
339
347
  Hailey
@@ -376,12 +384,14 @@ howbeit
376
384
  however
377
385
  huge
378
386
  Hunter
387
+ i
379
388
  i'd
380
389
  i'll
381
390
  i'm
382
391
  i've
383
392
  Ian
384
393
  ie
394
+ if
385
395
  ignored
386
396
  imagine
387
397
  immediate
@@ -418,6 +428,7 @@ it's
418
428
  its
419
429
  itself
420
430
  Ivan
431
+ j
421
432
  Jack
422
433
  Jackson
423
434
  Jacob
@@ -440,6 +451,7 @@ Jessica
440
451
  Jesus
441
452
  jim
442
453
  jimmy
454
+ jnr
443
455
  Jocelyn
444
456
  Joel
445
457
  John
@@ -450,6 +462,7 @@ Jose
450
462
  Joseph
451
463
  Joshua
452
464
  Josiah
465
+ jr
453
466
  Juan
454
467
  jul
455
468
  Julia
@@ -459,6 +472,7 @@ jun
459
472
  june
460
473
  just
461
474
  Justin
475
+ k
462
476
  Kaden
463
477
  Kaitlyn
464
478
  Kaleb
@@ -479,6 +493,7 @@ known
479
493
  knows
480
494
  Kyle
481
495
  Kylie
496
+ l
482
497
  la
483
498
  Landon
484
499
  last
@@ -518,6 +533,7 @@ ltd
518
533
  Lucas
519
534
  Luis
520
535
  Luke
536
+ m
521
537
  Mackenzie
522
538
  Madeline
523
539
  Madison
@@ -564,6 +580,7 @@ much
564
580
  must
565
581
  my
566
582
  myself
583
+ n
567
584
  name
568
585
  namely
569
586
  Natalie
@@ -602,6 +619,7 @@ novel
602
619
  november
603
620
  now
604
621
  nowhere
622
+ o
605
623
  Obie
606
624
  obviously
607
625
  oct
@@ -637,6 +655,7 @@ out
637
655
  overall
638
656
  Owen
639
657
  own
658
+ p
640
659
  Paige
641
660
  par
642
661
  Parker
@@ -666,9 +685,11 @@ proud
666
685
  provide
667
686
  provides
668
687
  put
688
+ q
669
689
  que
670
690
  quite
671
691
  qv
692
+ r
672
693
  Rachel
673
694
  rather
674
695
  rd
@@ -694,6 +715,7 @@ Riley
694
715
  Robert
695
716
  run
696
717
  Ryan
718
+ s
697
719
  safest
698
720
  said
699
721
  Samantha
@@ -764,6 +786,7 @@ specify
764
786
  specifying
765
787
  spoke
766
788
  spread
789
+ sr
767
790
  stand
768
791
  started
769
792
  step
@@ -780,6 +803,7 @@ sup
780
803
  sur
781
804
  sure
782
805
  Sydney
806
+ t
783
807
  t's
784
808
  take
785
809
  taken
@@ -859,6 +883,7 @@ twice
859
883
  two
860
884
  Tyler
861
885
  typically
886
+ u
862
887
  ultra
863
888
  un
864
889
  unfortunately
@@ -876,6 +901,7 @@ uses
876
901
  using
877
902
  usually
878
903
  uucp
904
+ v
879
905
  value
880
906
  Vanessa
881
907
  various
@@ -886,6 +912,7 @@ Victoria
886
912
  Vincent
887
913
  viz
888
914
  vs
915
+ w
889
916
  walks
890
917
  want
891
918
  wants
@@ -927,8 +954,6 @@ who's
927
954
  whoever
928
955
  whole
929
956
  whom
930
- approximate
931
- approximately
932
957
  whose
933
958
  why
934
959
  will
@@ -948,6 +973,7 @@ wouldn't
948
973
  wrapped
949
974
  Wyatt
950
975
  Xavier
976
+ y
951
977
  yeah
952
978
  yes
953
979
  yet
@@ -960,6 +986,17 @@ your
960
986
  yours
961
987
  yourself
962
988
  yourselves
989
+ z
963
990
  Zachary
964
991
  zero
965
- Zoe
992
+ Zoe
993
+ 0
994
+ 1
995
+ 2
996
+ 3
997
+ 4
998
+ 5
999
+ 6
1000
+ 7
1001
+ 8
1002
+ 9
@@ -0,0 +1,3 @@
1
+ module Pismo
2
+ VERSION = "0.7.1"
3
+ end
@@ -1,101 +1,31 @@
1
- # Generated by jeweler
2
- # DO NOT EDIT THIS FILE DIRECTLY
3
- # Instead, edit Jeweler::Tasks in Rakefile, and run the gemspec command
4
1
  # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "pismo/version"
5
4
 
6
5
  Gem::Specification.new do |s|
7
- s.name = %q{pismo}
8
- s.version = "0.7.0"
9
-
10
- s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
- s.authors = ["Peter Cooper"]
12
- s.date = %q{2010-07-27}
13
- s.default_executable = %q{pismo}
6
+ s.name = "pismo"
7
+ s.version = Pismo::VERSION
8
+ s.platform = Gem::Platform::RUBY
9
+ s.authors = ["Peter Cooper"]
10
+ s.email = ["git@peterc.org"]
11
+ s.homepage = "http://github.com/peterc/pismo"
12
+ s.summary = %q{TODO: Write a gem summary}
14
13
  s.description = %q{Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.}
15
- s.email = %q{git@peterc.org}
16
- s.executables = ["pismo"]
17
- s.extra_rdoc_files = [
18
- "LICENSE",
19
- "README.markdown"
20
- ]
21
- s.files = [
22
- ".document",
23
- ".gitignore",
24
- "LICENSE",
25
- "NOTICE",
26
- "README.markdown",
27
- "Rakefile",
28
- "VERSION",
29
- "bin/pismo",
30
- "lib/pismo.rb",
31
- "lib/pismo/document.rb",
32
- "lib/pismo/external_attributes.rb",
33
- "lib/pismo/internal_attributes.rb",
34
- "lib/pismo/reader.rb",
35
- "lib/pismo/stopwords.txt",
36
- "pismo.gemspec",
37
- "test/corpus/bbcnews.html",
38
- "test/corpus/bbcnews2.html",
39
- "test/corpus/briancray.html",
40
- "test/corpus/cant_read.html",
41
- "test/corpus/factor.html",
42
- "test/corpus/gmane.html",
43
- "test/corpus/huffington.html",
44
- "test/corpus/metadata_expected.yaml",
45
- "test/corpus/metadata_expected.yaml.old",
46
- "test/corpus/queness.html",
47
- "test/corpus/reader_expected.yaml",
48
- "test/corpus/rubyinside.html",
49
- "test/corpus/rww.html",
50
- "test/corpus/spolsky.html",
51
- "test/corpus/techcrunch.html",
52
- "test/corpus/tweet.html",
53
- "test/corpus/youtube.html",
54
- "test/corpus/zefrank.html",
55
- "test/helper.rb",
56
- "test/test_corpus.rb",
57
- "test/test_pismo_document.rb"
58
- ]
59
- s.homepage = %q{http://github.com/peterc/pismo}
60
- s.rdoc_options = ["--charset=UTF-8"]
61
- s.require_paths = ["lib"]
62
- s.rubygems_version = %q{1.3.5}
63
- s.summary = %q{Extracts or retrieves content-related metadata from HTML pages}
64
- s.test_files = [
65
- "test/helper.rb",
66
- "test/test_corpus.rb",
67
- "test/test_pismo_document.rb"
68
- ]
14
+ s.summary = %q{Extracts or retrieves content-related metadata from HTML pages}
15
+ s.date = %q{2010-07-27}
16
+ s.default_executable = %q{pismo}
69
17
 
70
- if s.respond_to? :specification_version then
71
- current_version = Gem::Specification::CURRENT_SPECIFICATION_VERSION
72
- s.specification_version = 3
18
+ s.rubyforge_project = "pismo"
73
19
 
74
- if Gem::Version.new(Gem::RubyGemsVersion) >= Gem::Version.new('1.2.0') then
75
- s.add_development_dependency(%q<shoulda>, [">= 0"])
76
- s.add_development_dependency(%q<awesome_print>, [">= 0"])
77
- s.add_runtime_dependency(%q<jeweler>, [">= 0"])
78
- s.add_runtime_dependency(%q<nokogiri>, [">= 0"])
79
- s.add_runtime_dependency(%q<sanitize>, [">= 0"])
80
- s.add_runtime_dependency(%q<fast-stemmer>, [">= 0"])
81
- s.add_runtime_dependency(%q<chronic>, [">= 0"])
82
- else
83
- s.add_dependency(%q<shoulda>, [">= 0"])
84
- s.add_dependency(%q<awesome_print>, [">= 0"])
85
- s.add_dependency(%q<jeweler>, [">= 0"])
86
- s.add_dependency(%q<nokogiri>, [">= 0"])
87
- s.add_dependency(%q<sanitize>, [">= 0"])
88
- s.add_dependency(%q<fast-stemmer>, [">= 0"])
89
- s.add_dependency(%q<chronic>, [">= 0"])
90
- end
91
- else
92
- s.add_dependency(%q<shoulda>, [">= 0"])
93
- s.add_dependency(%q<awesome_print>, [">= 0"])
94
- s.add_dependency(%q<jeweler>, [">= 0"])
95
- s.add_dependency(%q<nokogiri>, [">= 0"])
96
- s.add_dependency(%q<sanitize>, [">= 0"])
97
- s.add_dependency(%q<fast-stemmer>, [">= 0"])
98
- s.add_dependency(%q<chronic>, [">= 0"])
99
- end
20
+ s.files = `git ls-files`.split("\n")
21
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
22
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
23
+ s.require_paths = ["lib"]
24
+
25
+ s.add_dependency(%q<shoulda>, [">= 0"])
26
+ s.add_dependency(%q<awesome_print>, [">= 0"])
27
+ s.add_dependency(%q<nokogiri>, [">= 0"])
28
+ s.add_dependency(%q<sanitize>, [">= 0"])
29
+ s.add_dependency(%q<fast-stemmer>, [">= 0"])
30
+ s.add_dependency(%q<chronic>, [">= 0"])
100
31
  end
101
-
@@ -2,14 +2,14 @@
2
2
  :rww:
3
3
  :title: "Cartoon: Apple Tablet: Now With Barometer and Bird Call Generator"
4
4
  :feed: http://www.readwriteweb.com/rss.xml
5
- :lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator.
5
+ :lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator. I'm never sure if Apple does themselves more good than harm with the secrecy and anticipation that surrounds the run-up to these announcements.
6
6
  :feeds:
7
7
  - http://www.readwriteweb.com/rss.xml
8
8
  - http://www.readwriteweb.com/archives/2010/01/cartoon_apple_tablet_now_with_barometer_and_bird_c.xml
9
9
  :briancray:
10
10
  :title: 5 great examples of popular blog posts that you should know
11
11
  :feed: http://feeds.feedburner.com/briancray/blog
12
- :lede: "This is a mock post. While there is a place for all of these posts, I'm trying to make a point that original blogs are being shut out by formulaic blogs."
12
+ :lede: "This is a mock post."
13
13
  :huffington:
14
14
  :title: Afghans Losing Hope After 8 Years Of War
15
15
  :author: TODD PITMAN
@@ -31,9 +31,9 @@
31
31
  :feed: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/england/rss.xml
32
32
  :factor:
33
33
  :title: Factor's bootstrap process explained
34
- :lede: "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap."
34
+ :lede: "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code."
35
35
  :ledes:
36
- - "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap."
36
+ - "Separation of concerns between Factor VM and library codeThe Factor VM implements an abstract machine consisting of a data heap of objects, a code heap of machine code blocks, and a set of stacks. The VM loads an image file on startup, which becomes the data and code heap. It then begins executing code in the image, by calling a special startup quotation.When new source files are loaded into a running Factor instance by the developer, they are parsed and compiled into a collection of objects -- words, quotations, and other literals, along with executable machine code."
37
37
  :youtube:
38
38
  :title: YMO - Rydeen (Official Video)
39
39
  :author: ymo1965
@@ -42,7 +42,7 @@
42
42
  :spolsky:
43
43
  :title: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
44
44
  :description: Haven't mastered the basics of Unicode and character sets? Please don't write another line of code until you've read this article.
45
- :lede: Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line "????
45
+ :lede: "Ever wonder about that mysterious Content-Type tag? You know, the one you're supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in Bulgaria with the subject line \"???? "
46
46
  :author: Joel Spolsky
47
47
  :favicon: /favicon.ico
48
48
  :feed: http://www.joelonsoftware.com/rss.xml
@@ -52,14 +52,14 @@
52
52
  :rubyinside:
53
53
  :title: "CoffeeScript: A New Language With A Pure Ruby Compiler"
54
54
  :author: Peter Cooper
55
- :lede: CoffeeScript (GitHub repo) is a new programming language with a pure Ruby compiler. Creator Jeremy Ashkenas calls it "JavaScript's less ostentatious kid brother" - mostly because it compiles into JavaScript and shares most of the same constructs, but with a different, tighter syntax.
55
+ :lede: CoffeeScript (GitHub repo) is a new programming language with a pure Ruby compiler.
56
56
  :feed: http://www.rubyinside.com/feed/
57
57
  :zefrank:
58
58
  :sentences: If there's anyone who knows how to marshal an online audience, it's Ze Frank. Ze is best-known for his 2006 program "The Show," in which he made a new 2-3 minute video every day for 1 year. Topics ranged from "fingers in food" to the mysteries of airport signage to a tour de force summary of creatives' addiction to un-executed ideas, aka brain crack.
59
59
  :title: "Ze Frank on Imaginary Audiences :: Articles :: The 99 Percent"
60
60
  :description: We chat with the Internet's most notorious mass-collaboration instigator Ze Frank about idea execution and how to build armies of sportsracers.
61
61
  :tweet:
62
- :lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X. Wow..!
62
+ :lede: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS X.
63
63
  :sentences: Gobsmacked that TeX/LaTeX (document formatting tools) for OS X is a 1.3GB (yes, GIGAbytes) download OS Wow..!
64
64
  :cant_read:
65
65
  :sentences: "For those of us who grew up as weird kids in the 1980s, the work of Berkeley Breathed was as important as those twin eternal pillars of weird-kid-dom: Monty Python and Mad magazine. In a word: seminal. In two words: fucking seminal."
@@ -67,6 +67,6 @@
67
67
  :sentences: I am pleased to report that the GCC Steering Committee and the FSF have approved the use of C++ in GCC itself. Of course, there's no reason for us to use C++ features just because we can. The goal is a better compiler for users, not a C++ code base for its own sake.
68
68
  :queness:
69
69
  :title: 18 Incredible CSS3 Effects You Have Never Seen Before
70
- :lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it."
70
+ :lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it. Also, I have started to implement it to my own project as well and I really love it!"
71
71
  :sentences: CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it.
72
72
  :datetime: 2010-06-02 12:00:00 +01:00
metadata CHANGED
@@ -1,7 +1,12 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pismo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.0
4
+ prerelease: false
5
+ segments:
6
+ - 0
7
+ - 7
8
+ - 1
9
+ version: 0.7.1
5
10
  platform: ruby
6
11
  authors:
7
12
  - Peter Cooper
@@ -14,91 +19,99 @@ default_executable: pismo
14
19
  dependencies:
15
20
  - !ruby/object:Gem::Dependency
16
21
  name: shoulda
17
- type: :development
18
- version_requirement:
19
- version_requirements: !ruby/object:Gem::Requirement
22
+ prerelease: false
23
+ requirement: &id001 !ruby/object:Gem::Requirement
24
+ none: false
20
25
  requirements:
21
26
  - - ">="
22
27
  - !ruby/object:Gem::Version
28
+ segments:
29
+ - 0
23
30
  version: "0"
24
- version:
31
+ type: :runtime
32
+ version_requirements: *id001
25
33
  - !ruby/object:Gem::Dependency
26
34
  name: awesome_print
27
- type: :development
28
- version_requirement:
29
- version_requirements: !ruby/object:Gem::Requirement
35
+ prerelease: false
36
+ requirement: &id002 !ruby/object:Gem::Requirement
37
+ none: false
30
38
  requirements:
31
39
  - - ">="
32
40
  - !ruby/object:Gem::Version
41
+ segments:
42
+ - 0
33
43
  version: "0"
34
- version:
35
- - !ruby/object:Gem::Dependency
36
- name: jeweler
37
44
  type: :runtime
38
- version_requirement:
39
- version_requirements: !ruby/object:Gem::Requirement
40
- requirements:
41
- - - ">="
42
- - !ruby/object:Gem::Version
43
- version: "0"
44
- version:
45
+ version_requirements: *id002
45
46
  - !ruby/object:Gem::Dependency
46
47
  name: nokogiri
47
- type: :runtime
48
- version_requirement:
49
- version_requirements: !ruby/object:Gem::Requirement
48
+ prerelease: false
49
+ requirement: &id003 !ruby/object:Gem::Requirement
50
+ none: false
50
51
  requirements:
51
52
  - - ">="
52
53
  - !ruby/object:Gem::Version
54
+ segments:
55
+ - 0
53
56
  version: "0"
54
- version:
57
+ type: :runtime
58
+ version_requirements: *id003
55
59
  - !ruby/object:Gem::Dependency
56
60
  name: sanitize
57
- type: :runtime
58
- version_requirement:
59
- version_requirements: !ruby/object:Gem::Requirement
61
+ prerelease: false
62
+ requirement: &id004 !ruby/object:Gem::Requirement
63
+ none: false
60
64
  requirements:
61
65
  - - ">="
62
66
  - !ruby/object:Gem::Version
67
+ segments:
68
+ - 0
63
69
  version: "0"
64
- version:
70
+ type: :runtime
71
+ version_requirements: *id004
65
72
  - !ruby/object:Gem::Dependency
66
73
  name: fast-stemmer
67
- type: :runtime
68
- version_requirement:
69
- version_requirements: !ruby/object:Gem::Requirement
74
+ prerelease: false
75
+ requirement: &id005 !ruby/object:Gem::Requirement
76
+ none: false
70
77
  requirements:
71
78
  - - ">="
72
79
  - !ruby/object:Gem::Version
80
+ segments:
81
+ - 0
73
82
  version: "0"
74
- version:
83
+ type: :runtime
84
+ version_requirements: *id005
75
85
  - !ruby/object:Gem::Dependency
76
86
  name: chronic
77
- type: :runtime
78
- version_requirement:
79
- version_requirements: !ruby/object:Gem::Requirement
87
+ prerelease: false
88
+ requirement: &id006 !ruby/object:Gem::Requirement
89
+ none: false
80
90
  requirements:
81
91
  - - ">="
82
92
  - !ruby/object:Gem::Version
93
+ segments:
94
+ - 0
83
95
  version: "0"
84
- version:
96
+ type: :runtime
97
+ version_requirements: *id006
85
98
  description: Pismo extracts and retrieves content-related metadata from HTML pages - you can use the resulting data in an organized way, such as a summary/first paragraph, body text, keywords, RSS feed URL, favicon, etc.
86
- email: git@peterc.org
99
+ email:
100
+ - git@peterc.org
87
101
  executables:
88
102
  - pismo
89
103
  extensions: []
90
104
 
91
- extra_rdoc_files:
92
- - LICENSE
93
- - README.markdown
105
+ extra_rdoc_files: []
106
+
94
107
  files:
95
108
  - .document
96
109
  - .gitignore
110
+ - Gemfile
97
111
  - LICENSE
98
112
  - NOTICE
99
113
  - README.markdown
100
114
  - Rakefile
101
- - VERSION
102
115
  - bin/pismo
103
116
  - lib/pismo.rb
104
117
  - lib/pismo/document.rb
@@ -106,6 +119,7 @@ files:
106
119
  - lib/pismo/internal_attributes.rb
107
120
  - lib/pismo/reader.rb
108
121
  - lib/pismo/stopwords.txt
122
+ - lib/pismo/version.rb
109
123
  - pismo.gemspec
110
124
  - test/corpus/bbcnews.html
111
125
  - test/corpus/bbcnews2.html
@@ -133,30 +147,52 @@ homepage: http://github.com/peterc/pismo
133
147
  licenses: []
134
148
 
135
149
  post_install_message:
136
- rdoc_options:
137
- - --charset=UTF-8
150
+ rdoc_options: []
151
+
138
152
  require_paths:
139
153
  - lib
140
154
  required_ruby_version: !ruby/object:Gem::Requirement
155
+ none: false
141
156
  requirements:
142
157
  - - ">="
143
158
  - !ruby/object:Gem::Version
159
+ segments:
160
+ - 0
144
161
  version: "0"
145
- version:
146
162
  required_rubygems_version: !ruby/object:Gem::Requirement
163
+ none: false
147
164
  requirements:
148
165
  - - ">="
149
166
  - !ruby/object:Gem::Version
167
+ segments:
168
+ - 0
150
169
  version: "0"
151
- version:
152
170
  requirements: []
153
171
 
154
- rubyforge_project:
155
- rubygems_version: 1.3.5
172
+ rubyforge_project: pismo
173
+ rubygems_version: 1.3.7
156
174
  signing_key:
157
175
  specification_version: 3
158
176
  summary: Extracts or retrieves content-related metadata from HTML pages
159
177
  test_files:
178
+ - test/corpus/bbcnews.html
179
+ - test/corpus/bbcnews2.html
180
+ - test/corpus/briancray.html
181
+ - test/corpus/cant_read.html
182
+ - test/corpus/factor.html
183
+ - test/corpus/gmane.html
184
+ - test/corpus/huffington.html
185
+ - test/corpus/metadata_expected.yaml
186
+ - test/corpus/metadata_expected.yaml.old
187
+ - test/corpus/queness.html
188
+ - test/corpus/reader_expected.yaml
189
+ - test/corpus/rubyinside.html
190
+ - test/corpus/rww.html
191
+ - test/corpus/spolsky.html
192
+ - test/corpus/techcrunch.html
193
+ - test/corpus/tweet.html
194
+ - test/corpus/youtube.html
195
+ - test/corpus/zefrank.html
160
196
  - test/helper.rb
161
197
  - test/test_corpus.rb
162
198
  - test/test_pismo_document.rb
data/VERSION DELETED
@@ -1 +0,0 @@
1
- 0.7.0