bwkfanboy 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/doc/plugin.rdoc ADDED
@@ -0,0 +1,118 @@
1
+ = HOWTO Write a \Plugin
2
+
3
+ First of all, look at examples provided with bwkfanboy. They were
4
+ intended to be 100% working because I was writing them for myself.
5
+
6
+ Basically, all you need is to write a class named _Page_ that
7
+ inherits this class Bwkfanboy::Parse, override in the child #myparse
8
+ method and write a simple module named _Meta_ inside your _Page_
9
+ class.
10
+
11
+ == Skeleton
12
+
13
+ Here is a skeleton of a plugin:
14
+
15
+ require 'nokogiri'
16
+
17
+ class Page < Bwkfanboy::Parse
18
+ module Meta
19
+ URI = 'http://example.org/news'
20
+ ENC = 'UTF-8'
21
+ VERSION = 1
22
+ COPYRIGHT = '(c) 2010 John Doe'
23
+ TITLE = "News from example.org"
24
+ CONTENT_TYPE = 'html'
25
+ end
26
+
27
+ def myparse()
28
+ # read stdin and parse it
29
+ doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
30
+ doc.xpath("XPATH QUERY").each {|i|
31
+ t = clean(i.xpath("XPATH QUERY").text())
32
+ l = clean(i.xpath("XPATH QUERY").text())
33
+ u = date(i.xpath("XPATH QUERY").text())
34
+ a = clean(i.xpath("XPATH QUERY").text())
35
+ c = clean(i.xpath("XPATH QUERY").text())
36
+
37
+ self << { title: t, link: l, updated: u, author: a, content: c }
38
+ }
39
+ end
40
+ end
41
+
42
+ As you see, we are using Nokogiri for HTML parsing. You are not
43
+ required to use it too--take the parser whatever you like. Nokogiri
44
+ is nice, because it's able to read a broken HTML and search thought
45
+ it via XPath. If you would like to use, for example, REXML, beware
46
+ that it loves only strict XML--you may need to clean the HTML with
47
+ an external utility such as Tide.
48
+
49
+ Bwkfanboy loads a plugin from 1 file as a valid Ruby code. It means
50
+ that the plugin can contain *any* Ruby code, but doesn't mean that
51
+ it should.
52
+
53
+ === \Meta
54
+
55
+ Module _Meta_ can have only constants--and *all* constants listed in
56
+ the skeleton are required.
57
+
58
+ * <tt>URI</tt>--can be a <tt>http(s)://</tt> or <tt>ftp://</tt> URL
59
+ or just a path to a file on your local machine, as
60
+ <tt>/home/bob/huzza.html</tt>. This is the source that
61
+ bwkfanboy will be transforming to the Atom feed.
62
+
63
+ * <tt>ENC</tt>--an encoding for URI.
64
+
65
+ * <tt>VERSION</tt>--a version of a plugin.
66
+
67
+ * <tt>COPYRIGHT</tt>--some boring string.
68
+
69
+ * <tt>TITLE</tt>--a short description of the future feed. It'll be
70
+ used later in the resulting XML.
71
+
72
+ * <tt>CONTENT_TYPE</tt>--one of +xhtml+, +html+ or +text+ values. This is
73
+ very important constant because it says in what format entries
74
+ will be placed in the feed. Usually it's safe to use +html+.
75
+
76
+ === myparse
77
+
78
+ In #myparse method please read stdin. The contends of it is the raw
79
+ HTML you want to parse. The general idea:
80
+
81
+ * Atom feed must contain at least 1 entry, so look in HTML for some
82
+ crap which you break into 5 peaces: title of the entry, link for
83
+ it, a date for the entry, who is author of the entry and its
84
+ contents.
85
+
86
+ * After you scan and grab 1 entry, create a hash and add it to
87
+ _self_ as it was in the skeleton:
88
+
89
+ self << { title: t, link: l, updated: u, author: a, content: c }
90
+
91
+ Here variables _t_, _l_, _u_, _a_ and _c_ contains the actual
92
+ values of 5 peaces for the entry. Names of the keys in hash are
93
+ important of course--don't invent your own.
94
+
95
+ * There would be probably more crap in HTML that you can use to
96
+ construct another entry. Keep parsing and adding entries.
97
+
98
+ * While you scanning, use the 2 helper methods for cleaning each
99
+ peace: \#clean, which removed duplicate spaces and #date, which
100
+ parses a sting and return a date in ISO8601 format. You may
101
+ override #date method if you like.
102
+
103
+ == How to test all this
104
+
105
+ To test how nice your plugin works, save the html page to the file
106
+ and type:
107
+
108
+ % bwkparser_parse -vv path/to/a/plugin.rb < saved_page.html
109
+
110
+ to see the result as in plain text, or
111
+
112
+ % bwkparser_parse -v path/to/a/plugin.rb < saved_page.html
113
+
114
+ as pretty JSON.
115
+
116
+ <tt>bwkparser_parse</tt> return 0 if no errors occurred or >= 1 if you
117
+ have errors in your plugin code. N.B.: the output from
118
+ <tt>bwkparser_parse</tt> is always in UTF-8.
@@ -0,0 +1,143 @@
1
+ require 'json'
2
+ require 'date'
3
+
4
+ require_relative 'utils'
5
+
6
+ # :include: ../../doc/README.rdoc
7
+ module Bwkfanboy
8
+
9
+ # :include: ../../doc/plugin.rdoc
10
+ class Parse
11
+ ENTRIES_MAX = 64
12
+
13
+ def initialize()
14
+ @entries = []
15
+ end
16
+
17
+ # Invokes #myparse & checks if it has grabbed something.
18
+ def parse()
19
+ @entries = []
20
+ begin
21
+ myparse()
22
+ rescue
23
+ @entries = []
24
+ Utils.errx(1, "parsing failed: #{$!}\n\nBacktrace:\n\n#{$!.backtrace.join("\n")}")
25
+ end
26
+ Utils.errx(1, "plugin return no output") if @entries.length == 0
27
+ end
28
+
29
+ # Prints entries in 'key: value' formatted strings. Intended for
30
+ # debugging.
31
+ def dump()
32
+ @entries.each {|i|
33
+ puts "title : " + i[:title]
34
+ puts "link : " + i[:link]
35
+ puts "updated : " + i[:updated]
36
+ puts "author : " + i[:author]
37
+ puts "content : " + i[:content]
38
+ puts ""
39
+ }
40
+ end
41
+
42
+ def to_json()
43
+ # guess the time of the most recent entry
44
+ u = DateTime.parse() # January 1, 4713 BCE
45
+ @entries.each {|i|
46
+ t = DateTime.parse(i[:updated])
47
+ u = t if t > u
48
+ }
49
+
50
+ m = get_meta()
51
+ j = {
52
+ channel: {
53
+ updated: u,
54
+ id: m::URI,
55
+ author: Meta::NAME, # just a placeholder
56
+ title: m::TITLE,
57
+ link: m::URI,
58
+ x_entries_content_type: m::CONTENT_TYPE
59
+ },
60
+ x_entries: @entries
61
+ }
62
+ Utils::cfg[:verbose] >= 1 ? JSON.pretty_generate(j) : JSON.generate(j)
63
+ end
64
+
65
+ # After loading a plugin, one can do basic validation of the
66
+ # plugin's class with the help of this method.
67
+ def check
68
+ m = get_meta()
69
+ begin
70
+ [:URI, :ENC, :VERSION, :COPYRIGHT, :TITLE, :CONTENT_TYPE].each {|i|
71
+ fail "#{m}::#{i} not defined or empty" if (! m.const_defined?(i) || m.const_get(i) =~ /^\s*$/)
72
+ }
73
+ rescue
74
+ Utils.errx(1, "incomplete plugin: #{$!}")
75
+ end
76
+ end
77
+
78
+ # Prints plugin's meta information.
79
+ def dump_info()
80
+ m = get_meta()
81
+ puts "Version : #{m::VERSION}"
82
+ puts "Copyright : #{m::COPYRIGHT}"
83
+ puts "Title : #{m::TITLE}"
84
+ puts "URI : #{m::URI}"
85
+ end
86
+
87
+ protected
88
+
89
+ # This *must* be overridden in the child.
90
+ def myparse()
91
+ raise "plugin isn't finished yet"
92
+ end
93
+
94
+ # Tries to parse _s_ as a date string. Return the result in ISO 8601
95
+ # format.
96
+ def date(s)
97
+ begin
98
+ DateTime.parse(clean(s)).iso8601()
99
+ rescue
100
+ Utils.vewarnx(2, "#{s} is unparsable; date is set to current")
101
+ DateTime.now().iso8601()
102
+ end
103
+ end
104
+
105
+ # will help you to check if there is a
106
+ def toobig?
107
+ return true if @entries.length >= ENTRIES_MAX
108
+ return false
109
+ end
110
+
111
+ def <<(t)
112
+ if toobig? then
113
+ Utils.warnx("reached max number of entries (#{ENTRIES_MAX})")
114
+ return @entries
115
+ end
116
+
117
+ %w(updated author link).each { |i|
118
+ fail "unable to extract '#{i}'" if ! t.key?(i.to_sym) || t[i.to_sym] == nil || t[i.to_sym].empty?
119
+ }
120
+ %w(title content).each { |i|
121
+ fail "missing '#{i}'" if ! t.key?(i.to_sym) || t[i.to_sym] == nil
122
+ }
123
+ # a redundant check if user hasn't redefined date() method
124
+ if t[:updated] !~ /\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}/ then
125
+ fail "'#{t[:updated]}' isn't in iso8601 format"
126
+ end
127
+ @entries << t
128
+ end
129
+
130
+ private
131
+
132
+ def clean(s)
133
+ s.gsub(/\s+/, ' ').strip()
134
+ end
135
+
136
+ def get_meta()
137
+ Utils.errx(1, "incomplete plugin: no #{self.class}::Meta module") if (! defined?(self.class::Meta) || ! self.class::Meta.is_a?(Module))
138
+ self.class::Meta
139
+ end
140
+
141
+ end # class
142
+
143
+ end # module
@@ -0,0 +1,33 @@
1
+ # A simple plugin that parses the listing of bwk's articles from
2
+ # dailyprincetonian.com.
3
+
4
+ require 'nokogiri'
5
+
6
+ class Page < Bwkfanboy::Parse
7
+ module Meta
8
+ URI = 'http://www.dailyprincetonian.com/advanced_search/?author=Brian+Kernighan'
9
+ URI_DEBUG = '/home/alex/lib/software/alex/bwkfanboy/test/semis/bwk.html'
10
+ ENC = 'UTF-8'
11
+ VERSION = 1
12
+ COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
13
+ TITLE = "Brian Kernighan's articles from Daily Princetonian"
14
+ CONTENT_TYPE = 'html'
15
+ end
16
+
17
+ def myparse()
18
+ url = "http://www.dailyprincetonian.com"
19
+
20
+ doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
21
+ doc.xpath("//div[@class='article_item']").each {|i|
22
+ t = clean(i.xpath("h2/a").children.text())
23
+ fail 'unable to extract link' if (link = clean(i.xpath("h2/a")[0].attributes['href'].value()).empty?)
24
+ link = clean(i.xpath("h2/a")[0].attributes['href'].value())
25
+ l = url + link + "print"
26
+ u = date(i.xpath("h2").children[1].text())
27
+ a = clean(i.xpath("div/span/a[1]").children.text())
28
+ c = clean(i.xpath("div[@class='summary']").text())
29
+
30
+ self << { title: t, link: l, updated: u, author: a, content: c }
31
+ }
32
+ end
33
+ end
@@ -0,0 +1,76 @@
1
+ require 'digest/md5'
2
+
3
+ class Page < Bwkfanboy::Parse
4
+ module Meta
5
+ URI = '/usr/ports/UPDATING'
6
+ URI_DEBUG = URI
7
+ ENC = 'ASCII'
8
+ VERSION = 1
9
+ COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
10
+ TITLE = "News from FreeBSD ports"
11
+ CONTENT_TYPE = 'text'
12
+ end
13
+
14
+ def myadd(ready, t, l, u, a, c)
15
+ return true if ! ready
16
+ return false if toobig?
17
+ self << { title: t, link: l, updated: u, author: a, content: c.rstrip } if ready
18
+ return true
19
+ end
20
+
21
+ def clean(t)
22
+ t = t[2..-1] if t[0] != "\t"
23
+ return '' if t == nil
24
+ return t
25
+ end
26
+
27
+ def myparse()
28
+ re_u = /^(\d{8}):$/
29
+ re_t1 = /^ {2}AFFECTS:\s+(.+)$/
30
+ re_t2 = /^\s+(.+)$/
31
+ re_a = /^ {2}AUTHOR:\s+(.+)$/
32
+
33
+ ready = false
34
+ mode = nil
35
+ t = l = u = a = c = nil
36
+ while line = STDIN.gets
37
+ line.rstrip!
38
+
39
+ if line =~ re_u then
40
+ # add a new entry
41
+ break if ! myadd(ready, t, l, u, a, c)
42
+ ready = true
43
+ u = date($1)
44
+ l = $1 # partial, see below
45
+ t = a = c = nil
46
+ next
47
+ end
48
+
49
+ if ready then
50
+ if line =~ re_t1 then
51
+ mode = 'title'
52
+ t = $1
53
+ c = clean($&) + "\n"
54
+ # link should be unique
55
+ l = "file://#{Meta::URI}\##{l}-#{Digest::MD5.hexdigest($1)}"
56
+ elsif line =~ re_a
57
+ mode = 'author'
58
+ a = $1
59
+ c += clean($&) + "\n"
60
+ elsif line =~ re_t2 && mode == 'title'
61
+ t += ' ' + $1
62
+ c += clean($&) + "\n"
63
+ else
64
+ # content
65
+ c += clean(line) + "\n"
66
+ mode = nil
67
+ end
68
+ end
69
+
70
+ # skipping the preamble
71
+ end
72
+
73
+ # add last entry
74
+ myadd(ready, t, l, u, a, c)
75
+ end
76
+ end
@@ -0,0 +1,39 @@
1
+ {
2
+ "type": "object",
3
+ "properties": {
4
+ "channel": {
5
+ "type": "object",
6
+ "properties": {
7
+ "updated": {
8
+ "type": "string",
9
+ "format": "date-time"
10
+ },
11
+ "id": { "type": "string" },
12
+ "author": { "type": "string" },
13
+ "title": { "type": "string" },
14
+ "link": { "type": "string" },
15
+ "x_entries_content_type": {
16
+ "type": "string",
17
+ "enum": ["text", "html", "xhtml"]
18
+ }
19
+ }
20
+ },
21
+ "x_entries": {
22
+ "type": "array",
23
+ "minItems": 1,
24
+ "items": {
25
+ "type": "object",
26
+ "properties": {
27
+ "title": { "type": "string" },
28
+ "link": { "type": "string" },
29
+ "updated": {
30
+ "type": "string",
31
+ "format": "date-time"
32
+ },
33
+ "author": { "type": "string" },
34
+ "content": { "type": "string" }
35
+ }
36
+ }
37
+ }
38
+ }
39
+ }
@@ -0,0 +1,134 @@
1
+ require 'optparse'
2
+ require 'logger'
3
+
4
+ require 'open4'
5
+ require 'active_support/core_ext/module/attribute_accessors'
6
+
7
+ module Bwkfanboy
8
+ module Meta
9
+ NAME = 'bwkfanboy'
10
+ VERSION = '0.0.1'
11
+ USER_AGENT = "#{NAME}/#{VERSION} (#{RUBY_PLATFORM}; N; #{Encoding.default_external.name}; #{RUBY_ENGINE}; rv:#{RUBY_VERSION}.#{RUBY_PATCHLEVEL})"
12
+ PLUGIN_CLASS = 'Page'
13
+ DIR_TMP = "/tmp/#{Meta::NAME}/#{ENV['USER']}"
14
+ DIR_LOG = "#{DIR_TMP}/log"
15
+ LOG_MAXSIZE = 64*1024
16
+ PLUGIN_NAME = /^[a-zA-Z0-9_-]+$/
17
+ end
18
+
19
+ module Utils
20
+ mattr_accessor :cfg, :log
21
+
22
+ self.cfg = Hash.new()
23
+ cfg[:verbose] = 0
24
+ cfg[:log] = "#{Meta::DIR_LOG}/general.log"
25
+
26
+ def self.warnx(t)
27
+ m = File.basename($0) +" warning: "+ t + "\n";
28
+ $stderr.print(m);
29
+ log.warn(m.chomp) if log
30
+ end
31
+
32
+ def self.errx(ec, t)
33
+ m = File.basename($0) +" error: "+ t + "\n"
34
+ $stderr.print(m);
35
+ log.error(m.chomp) if log
36
+ exit(ec)
37
+ end
38
+
39
+ def self.veputs(level, t)
40
+ if cfg[:verbose] >= level then
41
+ # p log
42
+ log.info(t.chomp) if log
43
+ print(t)
44
+ end
45
+ end
46
+
47
+ def self.vewarnx(level, t)
48
+ warnx(t) if cfg[:verbose] >= level
49
+ end
50
+
51
+ # Logs and pidfiles the other temporal stuff sits here
52
+ def self.dir_tmp_create()
53
+ if ! File.writable?(Meta::DIR_TMP) then
54
+ begin
55
+ t = '/'
56
+ Meta::DIR_TMP.split('/')[1..-1].each {|i|
57
+ t += i + '/'
58
+ Dir.mkdir(t) if ! Dir.exists?(t)
59
+ }
60
+ rescue
61
+ warnx("cannot create/open directory #{Meta::DIR_TMP} for writing")
62
+ end
63
+ end
64
+ end
65
+
66
+ def self.log_start()
67
+ dir_tmp_create()
68
+ begin
69
+ Dir.mkdir(Meta::DIR_LOG) if ! File.writable?(Meta::DIR_LOG)
70
+ log = Logger.new(cfg[:log], 2, Meta::LOG_MAXSIZE)
71
+ rescue
72
+ warnx("cannot open log #{cfg[:log]}");
73
+ return nil
74
+ end
75
+ log.level = Logger::DEBUG
76
+ log.datetime_format = "%H:%M:%S"
77
+ log.info("#{$0} starting")
78
+ log
79
+ end
80
+ self.log = log_start()
81
+
82
+ # Loads (via <tt>require()</tt>) a Ruby code from _path_ (the full path to
83
+ # the file). <em>class_name</em> is the name of the class to check
84
+ # for existence after successful plugin loading.
85
+ def self.plugin_load(path, class_name)
86
+ begin
87
+ require(path)
88
+ # TODO get rid of eval()
89
+ fail "class #{class_name} isn't defined" if (! eval("defined?#{class_name}") || ! eval(class_name).is_a?(Class) )
90
+ rescue LoadError
91
+ errx(1, "cannot load plugin '#{path}'");
92
+ rescue Exception
93
+ errx(1, "plugin '#{path}' has errors: #{$!}\n\nBacktrace:\n\n#{$!.backtrace.join("\n")}")
94
+ end
95
+ end
96
+
97
+ # Parses command line options. _arr_ is an array of options (usually
98
+ # +ARGV+). _banner_ is a help string that describes what your
99
+ # program does.
100
+ #
101
+ # If _o_ is non nil function parses _arr_ immediately, otherwise it
102
+ # only creates +OptionParser+ object and return it (if _simple_ is
103
+ # false). See <tt>bwkfanboy</tt> script for examples.
104
+ def self.cl_parse(arr, banner, o = nil, simple = false)
105
+ if ! o then
106
+ o = OptionParser.new
107
+ o.banner = banner
108
+ o.on('-v', 'Be more verbose.') { |i| Bwkfanboy::Utils.cfg[:verbose] += 1 }
109
+ return o if ! simple
110
+ end
111
+
112
+ begin
113
+ o.parse!(arr)
114
+ rescue
115
+ Bwkfanboy::Utils.errx(1, $!.to_s)
116
+ end
117
+ end
118
+
119
+ # used in CGI and WEBrick examples
120
+ def self.cmd_run(cmd)
121
+ pid, stdin, stdout, stderr = Open4::popen4(cmd)
122
+ ignored, status = Process::waitpid2(pid)
123
+ [status.exitstatus, stderr.read, stdout.read]
124
+ end
125
+
126
+ def self.gem_dir_system
127
+ t = ["#{File.dirname(File.expand_path($0))}/../lib/#{Meta::NAME}",
128
+ "#{Gem.dir}/gems/#{Meta::NAME}-#{Meta::VERSION}/lib/#{Meta::NAME}"]
129
+ t.each {|i| return i if File.readable?(i) }
130
+ raise "both paths are invalid: #{t}"
131
+ end
132
+
133
+ end # utils
134
+ end
@@ -0,0 +1,29 @@
1
+ require 'nokogiri'
2
+
3
+ class Page < Bwkfanboy::Parse
4
+ module Meta
5
+ URI = "html/bwk.html"
6
+ ENC = 'UTF-8'
7
+ VERSION = 1
8
+ COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
9
+ TITLE = "Brian Kernighan's articles from Daily Princetonian"
10
+ CONTENT_TYPE = 'html'
11
+ end
12
+
13
+ def myparse()
14
+ url = "http://www.dailyprincetonian.com"
15
+
16
+ doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
17
+ doc.xpath("//div[@class='article_item']").each {|i|
18
+ t = clean(i.xpath("h2/a").children.text())
19
+ fail 'unable to extract link' if (link = clean(i.xpath("h2/a")[0].attributes['href'].value()).empty?)
20
+ link = clean(i.xpath("h2/a")[0].attributes['href'].value())
21
+ l = url + link + "print"
22
+ u = date(i.xpath("h2").children[1].text())
23
+ a = clean(i.xpath("div/span/a[1]").children.text())
24
+ c = clean(i.xpath("div[@class='summary']").text())
25
+
26
+ self << { title: t, link: l, updated: u, author: a, content: c }
27
+ }
28
+ end
29
+ end
File without changes
data/test/popen4.sh ADDED
@@ -0,0 +1,4 @@
1
+ #!/bin/sh
2
+ echo this is stdin
3
+ echo this is stderr 1>&2
4
+ exit 32