RubyGems - bwkfanboy - Versions diffs - 0.0.1 - Mend

bwkfanboy 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

data/LICENSE +22 -0
data/README.rdoc +88 -0
data/Rakefile +48 -0
data/TODO +7 -0
data/bin/bwkfanboy +128 -0
data/bin/bwkfanboy_fetch +30 -0
data/bin/bwkfanboy_generate +80 -0
data/bin/bwkfanboy_parse +32 -0
data/bin/bwkfanboy_server +141 -0
data/doc/README.rdoc +88 -0
data/doc/plugin.rdoc +118 -0
data/lib/bwkfanboy/parser.rb +143 -0
data/lib/bwkfanboy/plugins/bwk.rb +33 -0
data/lib/bwkfanboy/plugins/freebsd-ports-update.rb +76 -0
data/lib/bwkfanboy/schema.js +39 -0
data/lib/bwkfanboy/utils.rb +134 -0
data/test/plugins/bwk.rb +29 -0
data/test/plugins/empty.rb +0 -0
data/test/popen4.sh +4 -0
data/test/semis/bwk.html +398 -0
data/test/semis/bwk.json +82 -0
data/test/test_fetch.rb +34 -0
data/test/test_generate.rb +30 -0
data/test/test_parse.rb +32 -0
data/test/test_server.rb +39 -0
data/test/ts_utils.rb +21 -0
data/test/xml-clean.sh +8 -0
metadata +158 -0

data/doc/plugin.rdoc ADDED Viewed

@@ -0,0 +1,118 @@
+= HOWTO Write a \Plugin
+First of all, look at examples provided with bwkfanboy. They were
+intended to be 100% working because I was writing them for myself.
+Basically, all you need is to write a class named _Page_ that
+inherits this class Bwkfanboy::Parse, override in the child #myparse
+method and write a simple module named _Meta_ inside your _Page_
+class.
+== Skeleton
+Here is a skeleton of a plugin:
+  require 'nokogiri'
+  class Page < Bwkfanboy::Parse
+    module Meta
+      URI = 'http://example.org/news'
+      ENC = 'UTF-8'
+      VERSION = 1
+      COPYRIGHT = '(c) 2010 John Doe'
+      TITLE = "News from example.org"
+      CONTENT_TYPE = 'html'
+    end
+    def myparse()
+      # read stdin and parse it
+      doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
+      doc.xpath("XPATH QUERY").each {|i|
+        t = clean(i.xpath("XPATH QUERY").text())
+        l = clean(i.xpath("XPATH QUERY").text())
+        u = date(i.xpath("XPATH QUERY").text())
+        a = clean(i.xpath("XPATH QUERY").text())
+        c = clean(i.xpath("XPATH QUERY").text())
+        self << { title: t, link: l, updated: u, author: a, content: c }
+      }
+    end
+  end
+As you see, we are using Nokogiri for HTML parsing. You are not
+required to use it too--take the parser whatever you like. Nokogiri
+is nice, because it's able to read a broken HTML and search thought
+it via XPath. If you would like to use, for example, REXML, beware
+that it loves only strict XML--you may need to clean the HTML with
+an external utility such as Tide.
+Bwkfanboy loads a plugin from 1 file as a valid Ruby code. It means
+that the plugin can contain *any* Ruby code, but doesn't mean that
+it should.
+=== \Meta
+Module _Meta_ can have only constants--and *all* constants listed in
+the skeleton are required.
+* <tt>URI</tt>--can be a <tt>http(s)://</tt> or <tt>ftp://</tt> URL
+  or just a path to a file on your local machine, as
+  <tt>/home/bob/huzza.html</tt>. This is the source that
+  bwkfanboy will be transforming to the Atom feed.
+* <tt>ENC</tt>--an encoding for URI.
+* <tt>VERSION</tt>--a version of a plugin.
+* <tt>COPYRIGHT</tt>--some boring string.
+* <tt>TITLE</tt>--a short description of the future feed. It'll be
+  used later in the resulting XML.
+* <tt>CONTENT_TYPE</tt>--one of +xhtml+, +html+ or +text+ values. This is
+  very important constant because it says in what format entries
+  will be placed in the feed. Usually it's safe to use +html+.
+=== myparse
+In #myparse method please read stdin. The contends of it is the raw
+HTML you want to parse. The general idea:
+* Atom feed must contain at least 1 entry, so look in HTML for some
+  crap which you break into 5 peaces: title of the entry, link for
+  it, a date for the entry, who is author of the entry and its
+  contents.
+* After you scan and grab 1 entry, create a hash and add it to
+  _self_ as it was in the skeleton:
+    self << { title: t, link: l, updated: u, author: a, content: c }
+  Here variables _t_, _l_, _u_, _a_ and _c_ contains the actual
+  values of 5 peaces for the entry. Names of the keys in hash are
+  important of course--don't invent your own.
+* There would be probably more crap in HTML that you can use to
+  construct another entry. Keep parsing and adding entries.
+* While you scanning, use the 2 helper methods for cleaning each
+  peace: \#clean, which removed duplicate spaces and #date, which
+  parses a sting and return a date in ISO8601 format. You may
+  override #date method if you like.
+== How to test all this
+To test how nice your plugin works, save the html page to the file
+and type:
+  % bwkparser_parse -vv path/to/a/plugin.rb < saved_page.html
+to see the result as in plain text, or
+  % bwkparser_parse -v path/to/a/plugin.rb < saved_page.html
+as pretty JSON.
+<tt>bwkparser_parse</tt> return 0 if no errors occurred or >= 1 if you
+have errors in your plugin code. N.B.: the output from
+<tt>bwkparser_parse</tt> is always in UTF-8.

data/lib/bwkfanboy/parser.rb ADDED Viewed

@@ -0,0 +1,143 @@
+require 'json'
+require 'date'
+require_relative 'utils'
+# :include: ../../doc/README.rdoc
+module Bwkfanboy
+  # :include: ../../doc/plugin.rdoc
+  class Parse
+    ENTRIES_MAX = 64
+    def initialize()
+      @entries = []
+    end
+    # Invokes #myparse & checks if it has grabbed something.
+    def parse()
+      @entries = []
+      begin
+        myparse()
+      rescue
+        @entries = []
+        Utils.errx(1, "parsing failed: #{$!}\n\nBacktrace:\n\n#{$!.backtrace.join("\n")}")
+      end
+      Utils.errx(1, "plugin return no output") if @entries.length == 0
+    end
+    # Prints entries in 'key: value' formatted strings. Intended for
+    # debugging.
+    def dump()
+      @entries.each {|i|
+        puts "title    : " + i[:title]
+        puts "link     : " + i[:link]
+        puts "updated  : " + i[:updated]
+        puts "author   : " + i[:author]
+        puts "content  : " + i[:content]
+        puts ""
+      }
+    end
+    def to_json()
+      # guess the time of the most recent entry
+      u = DateTime.parse() # January 1, 4713 BCE
+      @entries.each {|i|
+        t = DateTime.parse(i[:updated])
+        u = t if t > u
+      }
+      m = get_meta()
+      j = {
+        channel: {
+          updated: u,
+          id: m::URI,
+          author: Meta::NAME,   # just a placeholder
+          title: m::TITLE,
+          link: m::URI,
+          x_entries_content_type: m::CONTENT_TYPE
+        },
+        x_entries: @entries
+      }
+      Utils::cfg[:verbose] >= 1 ? JSON.pretty_generate(j) : JSON.generate(j)
+    end
+    # After loading a plugin, one can do basic validation of the
+    # plugin's class with the help of this method.
+    def check
+      m = get_meta()
+      begin
+        [:URI, :ENC, :VERSION, :COPYRIGHT, :TITLE, :CONTENT_TYPE].each {|i|
+          fail "#{m}::#{i} not defined or empty" if (! m.const_defined?(i) || m.const_get(i) =~ /^\s*$/)
+        }
+      rescue
+        Utils.errx(1, "incomplete plugin: #{$!}")
+      end
+    end
+    # Prints plugin's meta information.
+    def dump_info()
+      m = get_meta()
+      puts "Version     : #{m::VERSION}"
+      puts "Copyright   : #{m::COPYRIGHT}"
+      puts "Title       : #{m::TITLE}"
+      puts "URI         : #{m::URI}"
+    end
+    protected
+    # This *must* be overridden in the child.
+    def myparse()
+      raise "plugin isn't finished yet"
+    end
+    # Tries to parse _s_ as a date string. Return the result in ISO 8601
+    # format.
+    def date(s)
+      begin
+        DateTime.parse(clean(s)).iso8601()
+      rescue
+        Utils.vewarnx(2, "#{s} is unparsable; date is set to current")
+        DateTime.now().iso8601()
+      end
+    end
+    # will help you to check if there is a
+    def toobig?
+      return true if @entries.length >= ENTRIES_MAX
+      return false
+    end
+    def <<(t)
+      if toobig? then
+        Utils.warnx("reached max number of entries (#{ENTRIES_MAX})")
+        return @entries
+      end
+      %w(updated author link).each { |i|
+        fail "unable to extract '#{i}'" if ! t.key?(i.to_sym) || t[i.to_sym] == nil || t[i.to_sym].empty?
+      }
+      %w(title content).each { |i|
+        fail "missing '#{i}'" if ! t.key?(i.to_sym) || t[i.to_sym] == nil
+      }
+      # a redundant check if user hasn't redefined date() method
+      if t[:updated] !~ /\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}/ then
+        fail "'#{t[:updated]}' isn't in iso8601 format"
+      end
+      @entries << t
+    end
+    private
+    def clean(s)
+      s.gsub(/\s+/, ' ').strip()
+    end
+    def get_meta()
+      Utils.errx(1, "incomplete plugin: no #{self.class}::Meta module") if (! defined?(self.class::Meta) || ! self.class::Meta.is_a?(Module))
+      self.class::Meta
+    end
+  end # class
+end # module

data/lib/bwkfanboy/plugins/bwk.rb ADDED Viewed

@@ -0,0 +1,33 @@
+# A simple plugin that parses the listing of bwk's articles from
+# dailyprincetonian.com.
+require 'nokogiri'
+class Page < Bwkfanboy::Parse
+  module Meta
+    URI = 'http://www.dailyprincetonian.com/advanced_search/?author=Brian+Kernighan'
+    URI_DEBUG = '/home/alex/lib/software/alex/bwkfanboy/test/semis/bwk.html'
+    ENC = 'UTF-8'
+    VERSION = 1
+    COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
+    TITLE = "Brian Kernighan's articles from Daily Princetonian"
+    CONTENT_TYPE = 'html'
+  end
+  def myparse()
+    url = "http://www.dailyprincetonian.com"
+    doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
+    doc.xpath("//div[@class='article_item']").each {|i|
+      t = clean(i.xpath("h2/a").children.text())
+      fail 'unable to extract link' if (link = clean(i.xpath("h2/a")[0].attributes['href'].value()).empty?)
+      link = clean(i.xpath("h2/a")[0].attributes['href'].value())
+      l = url + link + "print"
+      u = date(i.xpath("h2").children[1].text())
+      a = clean(i.xpath("div/span/a[1]").children.text())
+      c = clean(i.xpath("div[@class='summary']").text())
+      self << { title: t, link: l, updated: u, author: a, content: c }
+    }
+  end
+end

data/lib/bwkfanboy/plugins/freebsd-ports-update.rb ADDED Viewed

@@ -0,0 +1,76 @@
+require 'digest/md5'
+class Page < Bwkfanboy::Parse
+  module Meta
+    URI = '/usr/ports/UPDATING'
+    URI_DEBUG = URI
+    ENC = 'ASCII'
+    VERSION = 1
+    COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
+    TITLE = "News from FreeBSD ports"
+    CONTENT_TYPE = 'text'
+  end
+  def myadd(ready, t, l, u, a, c)
+    return true if ! ready
+    return false if toobig?
+    self << { title: t, link: l, updated: u, author: a, content: c.rstrip } if ready
+    return true
+  end
+  def clean(t)
+    t = t[2..-1] if t[0] != "\t"
+    return '' if t == nil
+    return t
+  end
+  def myparse()
+    re_u = /^(\d{8}):$/
+    re_t1 = /^ {2}AFFECTS:\s+(.+)$/
+    re_t2 = /^\s+(.+)$/
+    re_a = /^ {2}AUTHOR:\s+(.+)$/
+    ready = false
+    mode = nil
+    t = l = u = a = c = nil
+    while line = STDIN.gets
+      line.rstrip!
+      if line =~ re_u then
+        # add a new entry
+        break if ! myadd(ready, t, l, u, a, c)
+        ready = true
+        u = date($1)
+        l = $1                  # partial, see below
+        t = a = c = nil
+        next
+      end
+      if ready then
+        if line =~ re_t1 then
+          mode = 'title'
+          t = $1
+          c = clean($&) + "\n"
+          # link should be unique
+          l = "file://#{Meta::URI}\##{l}-#{Digest::MD5.hexdigest($1)}"
+        elsif line =~ re_a
+          mode = 'author'
+          a = $1
+          c += clean($&) + "\n"
+        elsif line =~ re_t2 && mode == 'title'
+          t += ' ' + $1
+          c += clean($&) + "\n"
+        else
+          # content
+          c += clean(line) + "\n"
+          mode = nil
+        end
+      end
+      # skipping the preamble
+    end
+    # add last entry
+    myadd(ready, t, l, u, a, c)
+  end
+end

data/lib/bwkfanboy/schema.js ADDED Viewed

@@ -0,0 +1,39 @@
+{
+    "type": "object",
+    "properties": {
+		"channel": {
+			"type": "object",
+			"properties": {
+				"updated": {
+					"type": "string",
+					"format": "date-time"
+				},
+				"id": { "type": "string" },
+				"author": { "type": "string" },
+				"title": { "type": "string" },
+				"link": { "type": "string" },
+				"x_entries_content_type": {
+					"type": "string",
+					"enum": ["text", "html", "xhtml"]
+				}
+			}
+		},
+		"x_entries": {
+			"type": "array",
+			"minItems": 1,
+			"items": {
+				"type": "object",
+				"properties": {
+					"title": { "type": "string" },
+					"link": { "type": "string" },
+					"updated": {
+						"type": "string",
+						"format": "date-time"
+					},
+					"author": { "type": "string" },
+					"content": { "type": "string" }
+				}
+			}
+		}
+	}
+}

data/lib/bwkfanboy/utils.rb ADDED Viewed

@@ -0,0 +1,134 @@
+require 'optparse'
+require 'logger'
+require 'open4'
+require 'active_support/core_ext/module/attribute_accessors'
+module Bwkfanboy
+  module Meta
+    NAME = 'bwkfanboy'
+    VERSION = '0.0.1'
+    USER_AGENT = "#{NAME}/#{VERSION} (#{RUBY_PLATFORM}; N; #{Encoding.default_external.name}; #{RUBY_ENGINE}; rv:#{RUBY_VERSION}.#{RUBY_PATCHLEVEL})"
+    PLUGIN_CLASS = 'Page'
+    DIR_TMP = "/tmp/#{Meta::NAME}/#{ENV['USER']}"
+    DIR_LOG = "#{DIR_TMP}/log"
+    LOG_MAXSIZE = 64*1024
+    PLUGIN_NAME = /^[a-zA-Z0-9_-]+$/
+  end
+  module Utils
+    mattr_accessor :cfg, :log
+    self.cfg = Hash.new()
+    cfg[:verbose] = 0
+    cfg[:log] = "#{Meta::DIR_LOG}/general.log"
+    def self.warnx(t)
+      m = File.basename($0) +" warning: "+ t + "\n";
+      $stderr.print(m);
+      log.warn(m.chomp) if log
+    end
+    def self.errx(ec, t)
+      m = File.basename($0) +" error: "+ t + "\n"
+      $stderr.print(m);
+      log.error(m.chomp) if log
+      exit(ec)
+    end
+    def self.veputs(level, t)
+      if cfg[:verbose] >= level then
+#        p log
+        log.info(t.chomp) if log
+        print(t)
+      end
+    end
+    def self.vewarnx(level, t)
+      warnx(t) if cfg[:verbose] >= level
+    end
+    # Logs and pidfiles the other temporal stuff sits here
+    def self.dir_tmp_create()
+      if ! File.writable?(Meta::DIR_TMP) then
+        begin
+          t = '/'
+          Meta::DIR_TMP.split('/')[1..-1].each {|i|
+            t += i + '/'
+            Dir.mkdir(t) if ! Dir.exists?(t)
+          }
+        rescue
+          warnx("cannot create/open directory #{Meta::DIR_TMP} for writing")
+        end
+      end
+    end
+    def self.log_start()
+      dir_tmp_create()
+      begin
+        Dir.mkdir(Meta::DIR_LOG) if ! File.writable?(Meta::DIR_LOG)
+        log = Logger.new(cfg[:log], 2, Meta::LOG_MAXSIZE)
+      rescue
+        warnx("cannot open log #{cfg[:log]}");
+        return nil
+      end
+      log.level = Logger::DEBUG
+      log.datetime_format = "%H:%M:%S"
+      log.info("#{$0} starting")
+      log
+    end
+    self.log = log_start()
+    # Loads (via <tt>require()</tt>) a Ruby code from _path_ (the full path to
+    # the file). <em>class_name</em> is the name of the class to check
+    # for existence after successful plugin loading.
+    def self.plugin_load(path, class_name)
+      begin
+        require(path)
+        # TODO get rid of eval()
+        fail "class #{class_name} isn't defined" if (! eval("defined?#{class_name}") || ! eval(class_name).is_a?(Class) )
+      rescue LoadError
+        errx(1, "cannot load plugin '#{path}'");
+      rescue Exception
+        errx(1, "plugin '#{path}' has errors: #{$!}\n\nBacktrace:\n\n#{$!.backtrace.join("\n")}")
+      end
+    end
+    # Parses command line options. _arr_ is an array of options (usually
+    # +ARGV+). _banner_ is a help string that describes what your
+    # program does.
+    #
+    # If _o_ is non nil function parses _arr_ immediately, otherwise it
+    # only creates +OptionParser+ object and return it (if _simple_ is
+    # false). See <tt>bwkfanboy</tt> script for examples.
+    def self.cl_parse(arr, banner, o = nil, simple = false)
+      if ! o then
+        o = OptionParser.new
+        o.banner = banner
+        o.on('-v', 'Be more verbose.') { |i| Bwkfanboy::Utils.cfg[:verbose] += 1 }
+        return o if ! simple
+      end
+      begin
+        o.parse!(arr)
+      rescue
+        Bwkfanboy::Utils.errx(1, $!.to_s)
+      end
+    end
+    # used in CGI and WEBrick examples
+    def self.cmd_run(cmd)
+      pid, stdin, stdout, stderr = Open4::popen4(cmd)
+      ignored, status = Process::waitpid2(pid)
+      [status.exitstatus, stderr.read, stdout.read]
+    end
+    def self.gem_dir_system
+      t = ["#{File.dirname(File.expand_path($0))}/../lib/#{Meta::NAME}",
+           "#{Gem.dir}/gems/#{Meta::NAME}-#{Meta::VERSION}/lib/#{Meta::NAME}"]
+      t.each {|i| return i if File.readable?(i) }
+      raise "both paths are invalid: #{t}"
+    end
+  end # utils
+end

data/test/plugins/bwk.rb ADDED Viewed

@@ -0,0 +1,29 @@
+require 'nokogiri'
+class Page < Bwkfanboy::Parse
+  module Meta
+    URI = "html/bwk.html"
+    ENC = 'UTF-8'
+    VERSION = 1
+    COPYRIGHT = '(c) 2010 Alexander Gromnitsky'
+    TITLE = "Brian Kernighan's articles from Daily Princetonian"
+    CONTENT_TYPE = 'html'
+  end
+  def myparse()
+    url = "http://www.dailyprincetonian.com"
+    doc = Nokogiri::HTML(STDIN, nil, Meta::ENC)
+    doc.xpath("//div[@class='article_item']").each {|i|
+      t = clean(i.xpath("h2/a").children.text())
+      fail 'unable to extract link' if (link = clean(i.xpath("h2/a")[0].attributes['href'].value()).empty?)
+      link = clean(i.xpath("h2/a")[0].attributes['href'].value())
+      l = url + link + "print"
+      u = date(i.xpath("h2").children[1].text())
+      a = clean(i.xpath("div/span/a[1]").children.text())
+      c = clean(i.xpath("div[@class='summary']").text())
+      self << { title: t, link: l, updated: u, author: a, content: c }
+    }
+  end
+end

data/test/plugins/empty.rb ADDED Viewed

File without changes

data/test/popen4.sh ADDED Viewed

@@ -0,0 +1,4 @@
+#!/bin/sh
+echo this is stdin
+echo this is stderr 1>&2
+exit 32