bwkfanboy 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/doc/plugin.rdoc ADDED
@@ -0,0 +1,118 @@
1
+ = HOWTO Write a \Plugin
2
+
3
+ First of all, look at examples provided with bwkfanboy. They were
4
+ intended to be 100% working because I was writing them for myself.
5
+
6
+ Basically, all you need is to write a class named _Page_ that
7
+ inherits this class Bwkfanboy::Parse, override in the child #myparse
8
+ method and write a simple module named _Meta_ inside your _Page_
9
+ class.
10
+
11
+ == Skeleton
12
+
13
+ Here is a skeleton of a plugin:
14
+
15
+ require 'nokogiri'
16
+
17
+ class Page < Bwkfanboy::Parse
18
+ module Meta
19
+ URI = 'http://example.org/news'
20
+ ENC = 'UTF-8'
21
+ VERSION = 1
22
+ COPYRIGHT = '(c) 2010 John Doe'
23
+ TITLE = "News from example.org"
24
+ CONTENT_TYPE = 'html'
25
+ end
26
+
27
+ def myparse()
28
+ # read stdin and parse it
29
+ doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
30
+ doc.xpath("XPATH QUERY").each {|i|
31
+ t = clean(i.xpath("XPATH QUERY").text())
32
+ l = clean(i.xpath("XPATH QUERY").text())
33
+ u = date(i.xpath("XPATH QUERY").text())
34
+ a = clean(i.xpath("XPATH QUERY").text())
35
+ c = clean(i.xpath("XPATH QUERY").text())
36
+
37
+ self << { title: t, link: l, updated: u, author: a, content: c }
38
+ }
39
+ end
40
+ end
41
+
42
+ As you see, we are using Nokogiri for HTML parsing. You are not
43
+ required to use it too--take the parser whatever you like. Nokogiri
44
+ is nice, because it's able to read a broken HTML and search thought
45
+ it via XPath. If you would like to use, for example, REXML, beware
46
+ that it loves only strict XML--you may need to clean the HTML with
47
+ an external utility such as Tide.
48
+
49
+ Bwkfanboy loads a plugin from 1 file as a valid Ruby code. It means
50
+ that the plugin can contain *any* Ruby code, but doesn't mean that
51
+ it should.
52
+
53
+ === \Meta
54
+
55
+ Module _Meta_ can have only constants--and *all* constants listed in
56
+ the skeleton are required.
57
+
58
+ * <tt>URI</tt>--can be a <tt>http(s)://</tt> or <tt>ftp://</tt> URL
59
+ or just a path to a file on your local machine, as
60
+ <tt>/home/bob/huzza.html</tt>. This is the source that
61
+ bwkfanboy will be transforming to the Atom feed.
62
+
63
+ * <tt>ENC</tt>--an encoding for URI.
64
+
65
+ * <tt>VERSION</tt>--a version of a plugin.
66
+
67
+ * <tt>COPYRIGHT</tt>--some boring string.
68
+
69
+ * <tt>TITLE</tt>--a short description of the future feed. It'll be
70
+ used later in the resulting XML.
71
+
72
+ * <tt>CONTENT_TYPE</tt>--one of +xhtml+, +html+ or +text+ values. This is
73
+ very important constant because it says in what format entries
74
+ will be placed in the feed. Usually it's safe to use +html+.
75
+
76
+ === myparse
77
+
78
+ In #myparse method please read stdin. The contends of it is the raw
79
+ HTML you want to parse. The general idea:
80
+
81
+ * Atom feed must contain at least 1 entry, so look in HTML for some
82
+ crap which you break into 5 peaces: title of the entry, link for
83
+ it, a date for the entry, who is author of the entry and its
84
+ contents.
85
+
86
+ * After you scan and grab 1 entry, create a hash and add it to
87
+ _self_ as it was in the skeleton:
88
+
89
+ self << { title: t, link: l, updated: u, author: a, content: c }
90
+
91
+ Here variables _t_, _l_, _u_, _a_ and _c_ contains the actual
92
+ values of 5 peaces for the entry. Names of the keys in hash are
93
+ important of course--don't invent your own.
94
+
95
+ * There would be probably more crap in HTML that you can use to
96
+ construct another entry. Keep parsing and adding entries.
97
+
98
+ * While you scanning, use the 2 helper methods for cleaning each
99
+ peace: \#clean, which removed duplicate spaces and #date, which
100
+ parses a sting and return a date in ISO8601 format. You may
101
+ override #date method if you like.
102
+
103
+ == How to test all this
104
+
105
+ To test how nice your plugin works, save the html page to the file
106
+ and type:
107
+
108
+ % bwkparser_parse -vv path/to/a/plugin.rb < saved_page.html
109
+
110
+ to see the result as in plain text, or
111
+
112
+ % bwkparser_parse -v path/to/a/plugin.rb < saved_page.html
113
+
114
+ as pretty JSON.
115
+
116
+ <tt>bwkparser_parse</tt> return 0 if no errors occurred or >= 1 if you
117
+ have errors in your plugin code. N.B.: the output from
118
+ <tt>bwkparser_parse</tt> is always in UTF-8.
@@ -0,0 +1,143 @@
1
+ require 'json'
2
+ require 'date'
3
+
4
+ require_relative 'utils'
5
+
6
+ # :include: ../../doc/README.rdoc
7
+ module Bwkfanboy
8
+
9
+ # :include: ../../doc/plugin.rdoc
10
+ class Parse
11
+ ENTRIES_MAX = 64
12
+
13
+ def initialize()
14
+ @entries = []
15
+ end
16
+
17
+ # Invokes #myparse & checks if it has grabbed something.
18
+ def parse()
19
+ @entries = []
20
+ begin
21
+ myparse()
22
+ rescue
23
+ @entries = []
24
+ Utils.errx(1, "parsing failed: #{$!}\n\nBacktrace:\n\n#{$!.backtrace.join("\n")}")
25
+ end
26
+ Utils.errx(1, "plugin return no output") if @entries.length == 0
27
+ end
28
+
29
+ # Prints entries in 'key: value' formatted strings. Intended for
30
+ # debugging.
31
+ def dump()
32
+ @entries.each {|i|
33
+ puts "title : " + i[:title]
34
+ puts "link : " + i[:link]
35
+ puts "updated : " + i[:updated]
36
+ puts "author : " + i[:author]
37
+ puts "content : " + i[:content]
38
+ puts ""
39
+ }
40
+ end
41
+
42
+ def to_json()
43
+ # guess the time of the most recent entry
44
+ u = DateTime.parse() # January 1, 4713 BCE
45
+ @entries.each {|i|
46
+ t = DateTime.parse(i[:updated])
47
+ u = t if t > u
48
+ }
49
+
50
+ m = get_meta()
51
+ j = {
52
+ channel: {
53
+ updated: u,
54
+ id: m::URI,
55
+ author: Meta::NAME, # just a placeholder
56
+ title: m::TITLE,
57
+ link: m::URI,
58
+ x_entries_content_type: m::CONTENT_TYPE
59
+ },
60
+ x_entries: @entries
61
+ }
62
+ Utils::cfg[:verbose] >= 1 ? JSON.pretty_generate(j) : JSON.generate(j)
63
+ end
64
+
65
+ # After loading a plugin, one can do basic validation of the
66
+ # plugin's class with the help of this method.
67
+ def check
68
+ m = get_meta()
69
+ begin
70
+ [:URI, :ENC, :VERSION, :COPYRIGHT, :TITLE, :CONTENT_TYPE].each {|i|
71
+ fail "#{m}::#{i} not defined or empty" if (! m.const_defined?(i) || m.const_get(i) =~ /^\s*$/)
72
+ }
73
+ rescue
74
+ Utils.errx(1, "incomplete plugin: #{$!}")
75
+ end
76
+ end
77
+
78
+ # Prints plugin's meta information.
79
+ def dump_info()
80
+ m = get_meta()
81
+ puts "Version : #{m::VERSION}"
82
+ puts "Copyright : #{m::COPYRIGHT}"
83
+ puts "Title : #{m::TITLE}"
84
+ puts "URI : #{m::URI}"
85
+ end
86
+
87
+ protected
88
+
89
+ # This *must* be overridden in the child.
90
+ def myparse()
91
+ raise "plugin isn't finished yet"
92
+ end
93
+
94
+ # Tries to parse _s_ as a date string. Return the result in ISO 8601
95
+ # format.
96
+ def date(s)
97
+ begin
98
+ DateTime.parse(clean(s)).iso8601()
99
+ rescue
100
+ Utils.vewarnx(2, "#{s} is unparsable; date is set to current")
101
+ DateTime.now().iso8601()
102
+ end
103
+ end
104
+
105
+ # will help you to check if there is a
106
+ def toobig?
107
+ return true if @entries.length >= ENTRIES_MAX
108
+ return false
109
+ end
110
+
111
+ def <<(t)
112
+ if toobig? then
113
+ Utils.warnx("reached max number of entries (#{ENTRIES_MAX})")
114
+ return @entries
115
+ end
116
+
117
+ %w(updated author link).each { |i|
118
+ fail "unable to extract '#{i}'" if ! t.key?(i.to_sym) || t[i.to_sym] == nil || t[i.to_sym].empty?
119
+ }
120
+ %w(title content).each { |i|
121
+ fail "missing '#{i}'" if ! t.key?(i.to_sym) || t[i.to_sym] == nil
122
+ }
123
+ # a redundant check if user hasn't redefined date() method
124
+ if t[:updated] !~ /\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}/ then
125
+ fail "'#{t[:updated]}' isn't in iso8601 format"
126
+ end
127
+ @entries << t
128
+ end
129
+
130
+ private
131
+
132
+ def clean(s)
133
+ s.gsub(/\s+/, ' ').strip()
134
+ end
135
+
136
+ def get_meta()
137
+ Utils.errx(1, "incomplete plugin: no #{self.class}::Meta module") if (! defined?(self.class::Meta) || ! self.class::Meta.is_a?(Module))
138
+ self.class::Meta
139
+ end
140
+
141
+ end # class
142
+
143
+ end # module
@@ -0,0 +1,33 @@
1
+ # A simple plugin that parses the listing of bwk's articles from
2
+ # dailyprincetonian.com.
3
+
4
+ require 'nokogiri'
5
+
6
+ class Page < Bwkfanboy::Parse
7
+ module Meta
8
+ URI = 'http://www.dailyprincetonian.com/advanced_search/?author=Brian+Kernighan'
9
+ URI_DEBUG = '/home/alex/lib/software/alex/bwkfanboy/test/semis/bwk.html'
10
+ ENC = 'UTF-8'
11
+ VERSION = 1
12
+ COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
13
+ TITLE = "Brian Kernighan's articles from Daily Princetonian"
14
+ CONTENT_TYPE = 'html'
15
+ end
16
+
17
+ def myparse()
18
+ url = "http://www.dailyprincetonian.com"
19
+
20
+ doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
21
+ doc.xpath("//div[@class='article_item']").each {|i|
22
+ t = clean(i.xpath("h2/a").children.text())
23
+ fail 'unable to extract link' if (link = clean(i.xpath("h2/a")[0].attributes['href'].value()).empty?)
24
+ link = clean(i.xpath("h2/a")[0].attributes['href'].value())
25
+ l = url + link + "print"
26
+ u = date(i.xpath("h2").children[1].text())
27
+ a = clean(i.xpath("div/span/a[1]").children.text())
28
+ c = clean(i.xpath("div[@class='summary']").text())
29
+
30
+ self << { title: t, link: l, updated: u, author: a, content: c }
31
+ }
32
+ end
33
+ end
@@ -0,0 +1,76 @@
1
+ require 'digest/md5'
2
+
3
+ class Page < Bwkfanboy::Parse
4
+ module Meta
5
+ URI = '/usr/ports/UPDATING'
6
+ URI_DEBUG = URI
7
+ ENC = 'ASCII'
8
+ VERSION = 1
9
+ COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
10
+ TITLE = "News from FreeBSD ports"
11
+ CONTENT_TYPE = 'text'
12
+ end
13
+
14
+ def myadd(ready, t, l, u, a, c)
15
+ return true if ! ready
16
+ return false if toobig?
17
+ self << { title: t, link: l, updated: u, author: a, content: c.rstrip } if ready
18
+ return true
19
+ end
20
+
21
+ def clean(t)
22
+ t = t[2..-1] if t[0] != "\t"
23
+ return '' if t == nil
24
+ return t
25
+ end
26
+
27
+ def myparse()
28
+ re_u = /^(\d{8}):$/
29
+ re_t1 = /^ {2}AFFECTS:\s+(.+)$/
30
+ re_t2 = /^\s+(.+)$/
31
+ re_a = /^ {2}AUTHOR:\s+(.+)$/
32
+
33
+ ready = false
34
+ mode = nil
35
+ t = l = u = a = c = nil
36
+ while line = STDIN.gets
37
+ line.rstrip!
38
+
39
+ if line =~ re_u then
40
+ # add a new entry
41
+ break if ! myadd(ready, t, l, u, a, c)
42
+ ready = true
43
+ u = date($1)
44
+ l = $1 # partial, see below
45
+ t = a = c = nil
46
+ next
47
+ end
48
+
49
+ if ready then
50
+ if line =~ re_t1 then
51
+ mode = 'title'
52
+ t = $1
53
+ c = clean($&) + "\n"
54
+ # link should be unique
55
+ l = "file://#{Meta::URI}\##{l}-#{Digest::MD5.hexdigest($1)}"
56
+ elsif line =~ re_a
57
+ mode = 'author'
58
+ a = $1
59
+ c += clean($&) + "\n"
60
+ elsif line =~ re_t2 && mode == 'title'
61
+ t += ' ' + $1
62
+ c += clean($&) + "\n"
63
+ else
64
+ # content
65
+ c += clean(line) + "\n"
66
+ mode = nil
67
+ end
68
+ end
69
+
70
+ # skipping the preamble
71
+ end
72
+
73
+ # add last entry
74
+ myadd(ready, t, l, u, a, c)
75
+ end
76
+ end
@@ -0,0 +1,39 @@
1
+ {
2
+ "type": "object",
3
+ "properties": {
4
+ "channel": {
5
+ "type": "object",
6
+ "properties": {
7
+ "updated": {
8
+ "type": "string",
9
+ "format": "date-time"
10
+ },
11
+ "id": { "type": "string" },
12
+ "author": { "type": "string" },
13
+ "title": { "type": "string" },
14
+ "link": { "type": "string" },
15
+ "x_entries_content_type": {
16
+ "type": "string",
17
+ "enum": ["text", "html", "xhtml"]
18
+ }
19
+ }
20
+ },
21
+ "x_entries": {
22
+ "type": "array",
23
+ "minItems": 1,
24
+ "items": {
25
+ "type": "object",
26
+ "properties": {
27
+ "title": { "type": "string" },
28
+ "link": { "type": "string" },
29
+ "updated": {
30
+ "type": "string",
31
+ "format": "date-time"
32
+ },
33
+ "author": { "type": "string" },
34
+ "content": { "type": "string" }
35
+ }
36
+ }
37
+ }
38
+ }
39
+ }
@@ -0,0 +1,134 @@
1
+ require 'optparse'
2
+ require 'logger'
3
+
4
+ require 'open4'
5
+ require 'active_support/core_ext/module/attribute_accessors'
6
+
7
+ module Bwkfanboy
8
+ module Meta
9
+ NAME = 'bwkfanboy'
10
+ VERSION = '0.0.1'
11
+ USER_AGENT = "#{NAME}/#{VERSION} (#{RUBY_PLATFORM}; N; #{Encoding.default_external.name}; #{RUBY_ENGINE}; rv:#{RUBY_VERSION}.#{RUBY_PATCHLEVEL})"
12
+ PLUGIN_CLASS = 'Page'
13
+ DIR_TMP = "/tmp/#{Meta::NAME}/#{ENV['USER']}"
14
+ DIR_LOG = "#{DIR_TMP}/log"
15
+ LOG_MAXSIZE = 64*1024
16
+ PLUGIN_NAME = /^[a-zA-Z0-9_-]+$/
17
+ end
18
+
19
+ module Utils
20
+ mattr_accessor :cfg, :log
21
+
22
+ self.cfg = Hash.new()
23
+ cfg[:verbose] = 0
24
+ cfg[:log] = "#{Meta::DIR_LOG}/general.log"
25
+
26
+ def self.warnx(t)
27
+ m = File.basename($0) +" warning: "+ t + "\n";
28
+ $stderr.print(m);
29
+ log.warn(m.chomp) if log
30
+ end
31
+
32
+ def self.errx(ec, t)
33
+ m = File.basename($0) +" error: "+ t + "\n"
34
+ $stderr.print(m);
35
+ log.error(m.chomp) if log
36
+ exit(ec)
37
+ end
38
+
39
+ def self.veputs(level, t)
40
+ if cfg[:verbose] >= level then
41
+ # p log
42
+ log.info(t.chomp) if log
43
+ print(t)
44
+ end
45
+ end
46
+
47
+ def self.vewarnx(level, t)
48
+ warnx(t) if cfg[:verbose] >= level
49
+ end
50
+
51
+ # Logs and pidfiles the other temporal stuff sits here
52
+ def self.dir_tmp_create()
53
+ if ! File.writable?(Meta::DIR_TMP) then
54
+ begin
55
+ t = '/'
56
+ Meta::DIR_TMP.split('/')[1..-1].each {|i|
57
+ t += i + '/'
58
+ Dir.mkdir(t) if ! Dir.exists?(t)
59
+ }
60
+ rescue
61
+ warnx("cannot create/open directory #{Meta::DIR_TMP} for writing")
62
+ end
63
+ end
64
+ end
65
+
66
+ def self.log_start()
67
+ dir_tmp_create()
68
+ begin
69
+ Dir.mkdir(Meta::DIR_LOG) if ! File.writable?(Meta::DIR_LOG)
70
+ log = Logger.new(cfg[:log], 2, Meta::LOG_MAXSIZE)
71
+ rescue
72
+ warnx("cannot open log #{cfg[:log]}");
73
+ return nil
74
+ end
75
+ log.level = Logger::DEBUG
76
+ log.datetime_format = "%H:%M:%S"
77
+ log.info("#{$0} starting")
78
+ log
79
+ end
80
+ self.log = log_start()
81
+
82
+ # Loads (via <tt>require()</tt>) a Ruby code from _path_ (the full path to
83
+ # the file). <em>class_name</em> is the name of the class to check
84
+ # for existence after successful plugin loading.
85
+ def self.plugin_load(path, class_name)
86
+ begin
87
+ require(path)
88
+ # TODO get rid of eval()
89
+ fail "class #{class_name} isn't defined" if (! eval("defined?#{class_name}") || ! eval(class_name).is_a?(Class) )
90
+ rescue LoadError
91
+ errx(1, "cannot load plugin '#{path}'");
92
+ rescue Exception
93
+ errx(1, "plugin '#{path}' has errors: #{$!}\n\nBacktrace:\n\n#{$!.backtrace.join("\n")}")
94
+ end
95
+ end
96
+
97
+ # Parses command line options. _arr_ is an array of options (usually
98
+ # +ARGV+). _banner_ is a help string that describes what your
99
+ # program does.
100
+ #
101
+ # If _o_ is non nil function parses _arr_ immediately, otherwise it
102
+ # only creates +OptionParser+ object and return it (if _simple_ is
103
+ # false). See <tt>bwkfanboy</tt> script for examples.
104
+ def self.cl_parse(arr, banner, o = nil, simple = false)
105
+ if ! o then
106
+ o = OptionParser.new
107
+ o.banner = banner
108
+ o.on('-v', 'Be more verbose.') { |i| Bwkfanboy::Utils.cfg[:verbose] += 1 }
109
+ return o if ! simple
110
+ end
111
+
112
+ begin
113
+ o.parse!(arr)
114
+ rescue
115
+ Bwkfanboy::Utils.errx(1, $!.to_s)
116
+ end
117
+ end
118
+
119
+ # used in CGI and WEBrick examples
120
+ def self.cmd_run(cmd)
121
+ pid, stdin, stdout, stderr = Open4::popen4(cmd)
122
+ ignored, status = Process::waitpid2(pid)
123
+ [status.exitstatus, stderr.read, stdout.read]
124
+ end
125
+
126
+ def self.gem_dir_system
127
+ t = ["#{File.dirname(File.expand_path($0))}/../lib/#{Meta::NAME}",
128
+ "#{Gem.dir}/gems/#{Meta::NAME}-#{Meta::VERSION}/lib/#{Meta::NAME}"]
129
+ t.each {|i| return i if File.readable?(i) }
130
+ raise "both paths are invalid: #{t}"
131
+ end
132
+
133
+ end # utils
134
+ end
@@ -0,0 +1,29 @@
1
+ require 'nokogiri'
2
+
3
+ class Page < Bwkfanboy::Parse
4
+ module Meta
5
+ URI = "html/bwk.html"
6
+ ENC = 'UTF-8'
7
+ VERSION = 1
8
+ COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
9
+ TITLE = "Brian Kernighan's articles from Daily Princetonian"
10
+ CONTENT_TYPE = 'html'
11
+ end
12
+
13
+ def myparse()
14
+ url = "http://www.dailyprincetonian.com"
15
+
16
+ doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
17
+ doc.xpath("//div[@class='article_item']").each {|i|
18
+ t = clean(i.xpath("h2/a").children.text())
19
+ fail 'unable to extract link' if (link = clean(i.xpath("h2/a")[0].attributes['href'].value()).empty?)
20
+ link = clean(i.xpath("h2/a")[0].attributes['href'].value())
21
+ l = url + link + "print"
22
+ u = date(i.xpath("h2").children[1].text())
23
+ a = clean(i.xpath("div/span/a[1]").children.text())
24
+ c = clean(i.xpath("div[@class='summary']").text())
25
+
26
+ self << { title: t, link: l, updated: u, author: a, content: c }
27
+ }
28
+ end
29
+ end
File without changes
data/test/popen4.sh ADDED
@@ -0,0 +1,4 @@
1
+ #!/bin/sh
2
+ echo this is stdin
3
+ echo this is stderr 1>&2
4
+ exit 32