RubyGems - repub - Versions diffs - 0.3.2 → 0.3.3 - Mend

repub 0.3.2 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

data/History.txt +25 -9
data/README.rdoc +46 -40
data/Rakefile +1 -0
data/bin/repub +1 -1
data/lib/repub.rb +1 -1
data/lib/repub/app.rb +3 -3
data/lib/repub/app/builder.rb +84 -36
data/lib/repub/app/fetcher.rb +13 -11
data/lib/repub/app/options.rb +36 -5
data/lib/repub/app/parser.rb +1 -1
data/lib/repub/app/profile.rb +16 -15
data/lib/repub/epub/container.rb +28 -28
data/lib/repub/epub/content.rb +59 -34
data/lib/repub/epub/toc.rb +139 -139
data/repub.gemspec +3 -3
data/test/data/custom.css +3 -0
data/test/data/invisiblellama.png +0 -0
data/test/data/test.css +5 -0
data/test/data/test.html +60 -0
data/test/epub/test_container.rb +4 -4
data/test/epub/test_content.rb +42 -38
data/test/epub/test_toc.rb +19 -7
data/test/test_builder.rb +145 -1
data/test/test_fetcher.rb +79 -20
data/test/test_parser.rb +45 -32
metadata +6 -2

data/History.txt CHANGED

@@ -1,18 +1,34 @@
-== 0.2.1 / 2009-06-26
-* Initial release
+== 0.3.3 / 2009-07-05
-== 0.3.0 / 2009-06-28
+* New features
-* Switched to Nokogiri for HTML parsing
-* Better parsing for hierarchical TOCs
-* Many bug fixes
+    * Option to add external files to the generated ePub (e.g. cover images, logos etc)
+    * Option to insert HTML fragments before/after specific element
+    * It is now possible to instruct repub to remove all links to CSS and <style> elements from source doc
-== 0.3.1 / 2009-06-28
+* Bug fixes
-* Fixed App.data_path bug
+    * Metadata double namespace prefix
+    * Encoding autodetection now is done only once after download (as it was supposed to be)
+    * -e flag actually works
+    * Source doc content-type encoding now is always set to utf-8
+    * Fixed warnings in Profile helper under Ruby 1.9.1
 == 0.3.2 / 2009-06-30
 * Improved Win32 support
 * Updated documentation
+== 0.3.1 / 2009-06-28
+* Fixed App.data_path bug
+== 0.3.0 / 2009-06-28
+* Switched to Nokogiri for HTML parsing
+* Better parsing for hierarchical TOCs
+* Many bug fixes
+== 0.2.1 / 2009-06-26
+* Initial release

data/README.rdoc CHANGED

@@ -67,7 +67,7 @@ For example, if you later decide to regenerate Git Manual ePub without TOC at th
     repub -l git-manual -X '//div[@class="toc"]' http://www.kernel.org/pub/software/scm/git/docs/user-manual.html
-A few more examples:
+Few more examples:
 * GNU Wget Manual
@@ -81,47 +81,49 @@ A few more examples:
     repub -x 'title:body/h1' -x 'toc://table' -x 'toc_item://tr' -X '//pre' -X '//hr' -X '//body/h4' \
       http://www.gutenberg.org/files/11/11-h/11-h.htm
-* The Gelug-Kagyu Tradition of Mahamudra from Berzin Archives
-    repub http://www.berzinarchives.com/web/x/prn/p.html_680632258.html
 == SYNOPSIS:
-  Usage: repub [options] url
-  General options:
-    -D, --downloader NAME            Which downloader to use to get files (wget or httrack).
-                                     Default is wget.
-    -o, --output PATH                Output path for generated ePub file.
-                                     Default is /Users/dg/Projects/repub/<Parsed_Title>.epub
-    -w, --write-profile NAME         Save given options for later reuse as profile NAME.
-    -l, --load-profile NAME          Load options from saved profile NAME.
-    -W, --write-default              Save given options for later reuse as default profile.
-    -L, --list-profiles              List saved profiles.
-    -C, --cleanup                    Clean up download cache.
-    -v, --verbose                    Turn on verbose output.
-    -q, --quiet                      Turn off any output except errors.
-    -V, --version                    Show version.
-    -h, --help                       Show this help message.
-  Parser options:
-    -x, --selector NAME:VALUE        Set parser XPath selector NAME to VALUE.
-                                     Recognized selectors are: [title toc toc_item toc_section]
-    -m, --meta NAME:VALUE            Set publication information metadata NAME to VALUE.
-                                     Valid metadata names are: [creator date description
-                                     language publisher relation rights subject title]
-    -F, --no-fixup                   Do not attempt to make document meet XHTML 1.0 Strict.
-                                     Default is to try and fix things that are broken.
-    -e, --encoding NAME              Set source document encoding. Default is to auto detect.
-  Post-processing options:
-    -s, --stylesheet PATH            Use custom stylesheet at PATH to add or override existing
-                                     CSS references in the source document.
-    -X, --remove SELECTOR            Remove source element using XPath selector.
-                                     Use -X- to ignore stored profile.
-    -R, --rx /PATTERN/REPLACEMENT/   Edit source HTML using regular expressions.
-                                     Use -R- to ignore stored profile.
-    -B, --browse                     After processing, open resulting HTML in default browser.
+Repub is a simple HTML to ePub converter.
+Usage: repub [options] url
+General options:
+  -D, --downloader NAME            Which downloader to use to get files (wget or httrack).
+                                   Default is wget.
+  -o, --output PATH                Output path for generated ePub file.
+                                   Default is /Users/dg/Projects/repub/<Parsed_Title>.epub
+  -w, --write-profile NAME         Save given options for later reuse as profile NAME.
+  -l, --load-profile NAME          Load options from saved profile NAME.
+  -W, --write-default              Save given options for later reuse as default profile.
+  -L, --list-profiles              List saved profiles.
+  -C, --cleanup                    Clean up download cache.
+  -v, --verbose                    Turn on verbose output.
+  -q, --quiet                      Turn off any output except errors.
+  -V, --version                    Show version.
+  -h, --help                       Show this help message.
+Parser options:
+  -x, --selector NAME:VALUE        Set parser XPath selector NAME to VALUE.
+                                   Recognized selectors are: [title toc toc_item toc_section]
+  -m, --meta NAME:VALUE            Set publication information metadata NAME to VALUE.
+                                   Valid metadata names are: [creator date description
+                                   language publisher relation rights subject title]
+  -F, --no-fixup                   Do not attempt to make document meet XHTML 1.0 Strict.
+                                   Default is to try and fix things that are broken.
+  -e, --encoding NAME              Set source document encoding. Default is to autodetect.
+Post-processing options:
+  -s, --stylesheet PATH            Use custom stylesheet at PATH. Use -s- to remove
+                                   all links to stylesheets and <style> blocks from the source.
+  -a, --add PATH                   Add external file to the generated ePub.
+  -N, --new-fragment XHTML         Prepare document fragment for -A and -P operations.
+  -A, --after SELECTOR             Insert fragment after element with XPath selector.
+  -P, --before SELECTOR            Insert fragment before element with XPath selector.
+  -X, --remove SELECTOR            Remove source element using XPath selector.
+                                   Use -X- to ignore stored profile.
+  -R, --rx /PATTERN/REPLACEMENT/   Edit source HTML using regular expressions.
+                                   Use -R- to ignore stored profile.
+  -B, --browser                    After processing, open resulting HTML in default browser.
 == DEPENDENCIES:
@@ -140,6 +142,10 @@ Also, the following tools must be somewhere in $PATH:
 Currently, only "everything-on-one-page" HTML sources are supported. Repub will download and process all page requisites
 (stylesheets and images) but all actual content must be on one page.
+Encoding auto-detection is slow.
+Chardet 0.9.0 is broken under Ruby 1.9.
 Bugs: probably. If you find any, please report them to dg at invisiblellama dot net.
 == INSTALL:

data/Rakefile CHANGED

@@ -1,4 +1,5 @@
 begin
+  require 'rubygems'
   require 'bones'
   Bones.setup
 rescue LoadError

data/bin/repub CHANGED

@@ -1,4 +1,4 @@
-#!/usr/bin/env ruby -w
+#!/usr/bin/env ruby
 require File.expand_path(
     File.join(File.dirname(__FILE__), %w[.. lib repub]))

data/lib/repub.rb CHANGED

@@ -1,7 +1,7 @@
 module Repub
   # :stopdoc:
-  VERSION = '0.3.2'
+  VERSION = '0.3.3'
   LIBPATH = File.expand_path(File.dirname(__FILE__)) + File::SEPARATOR
   PATH = File.dirname(LIBPATH) + File::SEPARATOR
   # :startdoc:

data/lib/repub/app.rb CHANGED

@@ -31,10 +31,10 @@ module Repub
       log.level = options[:verbosity]
       log.info "Making ePub from #{options[:url]}"
-      res = build(parse(fetch))
-      log.info "Saved #{res.output_path}"
+      builder = build(parse(fetch))
+      log.info "Saved #{builder.output_path}"
-      Launchy::Browser.run(res.asset_path) if options[:browser]
+      Launchy::Browser.run(builder.document_path) if options[:browser]
     rescue RuntimeError => ex
       log.fatal "** ERROR: #{ex.to_s}"

data/lib/repub/app/builder.rb CHANGED

@@ -16,7 +16,7 @@ module Repub
         include Epub, Logger
         attr_reader :output_path
-        attr_reader :asset_path
+        attr_reader :document_path
         def initialize(options)
           @options = options
@@ -78,59 +78,69 @@ module Repub
         def copy_and_process_assets
           # Copy html
-          @parser.cache.assets[:documents].each do |asset|
-            log.debug "-- Processing document #{asset}"
+          @parser.cache.assets[:documents].each do |doc|
+            log.debug "-- Processing document #{doc}"
             # Copy asset from cache
-            FileUtils.cp(File.join(@parser.cache.path, asset), '.')
+            FileUtils.cp(File.join(@parser.cache.path, doc), '.')
             # Do post-processing
-            postprocess_file(asset)
-            postprocess_doc(asset)
-            @content.add_document(asset)
-            @asset_path = File.expand_path(asset)
+            postprocess_file(doc)
+            postprocess_doc(doc)
+            @content.add_item(doc)
+            @document_path = File.expand_path(doc)
           end
           # Copy css
           if @options[:css].nil? || @options[:css].empty?
             # No custom css, copy one from assets
             @parser.cache.assets[:stylesheets].each do |css|
               log.debug "-- Copying stylesheet #{css}"
               FileUtils.cp(File.join(@parser.cache.path, css), '.')
-              @content.add_stylesheet(css)
+              @content.add_item(css)
             end
-          else
+          elsif @options[:css] != '-'
             # Copy custom css
             log.debug "-- Using custom stylesheet #{@options[:css]}"
             FileUtils.cp(@options[:css], '.')
-            @content.add_stylesheet(File.basename(@options[:css]))
+            @content.add_item(File.basename(@options[:css]))
           end
           # Copy images
           @parser.cache.assets[:images].each do |image|
             log.debug "-- Copying image #{image}"
             FileUtils.cp(File.join(@parser.cache.path, image), '.')
-            @content.add_image(image)
+            @content.add_item(image)
           end
+          # Copy external custom files (-a option)
+          @options[:add].each do |file|
+            log.debug "-- Copying external file #{file}"
+            FileUtils.cp(file, '.')
+            @content.add_item(file)
+          end if @options[:add]
         end
         def postprocess_file(asset)
           source = IO.read(asset)
           # Do rx substitutions
-          if @options[:rx] && !@options[:rx].empty?
-            @options[:rx].each do |rx|
-              rx.strip!
-              delimiter = rx[0, 1]
-              rx = rx.gsub(/\\#{delimiter}/, "\n")
-              ra = rx.split(/#{delimiter}/).reject {|e| e.empty? }.each {|e| e.gsub!(/\n/, "#{delimiter}")}
-              raise ParserException, "Invalid regular expression" if ra.empty? || ra[0].nil? || ra.size > 2
-              pattern = ra[0]
-              replacement = ra[1] || ''
-              log.info "Replacing pattern /#{pattern.gsub(/#{delimiter}/, "\\#{delimiter}")}/ with \"#{replacement}\""
-              source.gsub!(Regexp.new(pattern), replacement)
-            end
-          end
+          @options[:rx].each do |rx|
+            rx.strip!
+            delimiter = rx[0, 1]
+            rx = rx.gsub(/\\#{delimiter}/, "\n")
+            ra = rx.split(/#{delimiter}/).reject {|e| e.empty? }.each {|e| e.gsub!(/\n/, "#{delimiter}")}
+            raise ParserException, "Invalid regular expression" if ra.empty? || ra[0].nil? || ra.size > 2
+            pattern = ra[0]
+            replacement = ra[1] || ''
+            log.info "Replacing pattern /#{pattern.gsub(/#{delimiter}/, "\\#{delimiter}")}/ with \"#{replacement}\""
+            source.gsub!(Regexp.new(pattern), replacement)
+          end if @options[:rx]
           # Add doctype if missing
           if source !~ /\s*<!DOCTYPE/
             log.debug "-- Adding missing doctype"
             source = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" + source
           end
           # Save processed file
           File.open(asset, 'w') do |f|
             f.write(source)
@@ -139,23 +149,61 @@ module Repub
         def postprocess_doc(asset)
           doc = Nokogiri::HTML.parse(IO.read(asset), nil, 'UTF-8')
-          # Substitute custom CSS
-          if (@options[:css] && !@options[:css].empty?)
-            doc.xpath('//link[@rel="stylesheet"]').each do |link|
+          # Set Content-Type charset to UTF-8
+          doc.xpath('//head/meta[@http-equiv="Content-Type"]').each do |el|
+            el['content'] = 'text/html; charset=utf-8'
+          end
+          # Process styles
+          if @options[:css] && !@options[:css].empty?
+            # Remove all stylesheet links
+            doc.xpath('//head/link[@rel="stylesheet"]').remove
+            if @options[:css] == '-'
+              # Also remove all inline styles
+              doc.xpath('//head/style').remove
+              log.info "Removing all stylesheet links and style elements"
+            else
+              # Add custom stylesheet link
+              link = Nokogiri::XML::Node.new('link', doc)
+              link['rel'] = 'stylesheet'
+              link['type'] = 'text/css'
               link['href'] = File.basename(@options[:css])
-              log.debug "-- Replacing CSS refs with #{link[:href]}"
+              # Add as the last child so it has precedence over (possible) inline styles before
+              doc.at('//head').add_child(link)
+              log.info "Replacing CSS refs with \"#{link['href']}\""
             end
           end
-          # Remove elements
-          if @options[:remove] && !@options[:remove].empty?
-            @options[:remove].each do |selector|
-              log.info "Removing elements matching selector \"#{selector}\""
-              doc.search(selector).remove
+          # Insert elements after/before selector
+          @options[:after].each do |e|
+            selector = e.keys.first
+            fragment = e[selector]
+            element = doc.xpath(selector).first
+            if element
+              log.info "Inserting fragment \"#{fragment.to_html}\" after \"#{selector}\""
+              fragment.children.to_a.reverse.each {|node| element.add_next_sibling(node) }
             end
-          end
+          end if @options[:after]
+          @options[:before].each do |e|
+            selector = e.keys.first
+            fragment = e[selector]
+            element = doc.xpath(selector).first
+            if element
+              log.info "Inserting fragment \"#{fragment}\" before \"#{selector}\""
+              fragment.children.to_a.each {|node| element.add_previous_sibling(node) }
+            end
+          end if @options[:before]
+          # Remove elements
+          @options[:remove].each do |selector|
+            log.info "Removing elements \"#{selector}\""
+            doc.search(selector).remove
+          end if @options[:remove]
           # Save processed doc
           File.open(asset, 'w') do |f|
-            if @options[:fixup]
+            if @options[:fixup] || true
               # HACK: Nokogiri seems to ignore the fact that xmlns and other attrs aleady present
               # in html node and adds them anyway. Just remove them here to avoid duplicates.
               doc.root.attributes.each {|name, value| doc.root.remove_attribute(name) }

data/lib/repub/app/fetcher.rb CHANGED

@@ -4,6 +4,7 @@ require 'uri'
 require 'iconv'
 require 'rubygems'
+# Temporary disable warnings from chardet
 old_verbose = $VERBOSE
 $VERBOSE = false
 require 'UniversalDetector'
@@ -24,26 +25,27 @@ module Repub
         :stylesheets => %w[css],
         :images => %w[jpg jpeg png gif svg]
       }
       class Fetcher
         include Logger
         Downloaders = {
           :wget     => { :cmd => 'wget', :options => '-nv -E -H -k -p -nH -nd' },
-          :httrack  => { :cmd => 'httrack', :options => '-gB -r2 +*.css +*.jpg -*.xml -*.html' }
+          :httrack  => { :cmd => 'httrack', :options => '-gBqQ -r2 +*.css +*.jpg -*.xml -*.html' }
         }
         def initialize(options)
           @options = options
           @downloader_path, @downloader_options = ENV['REPUB_DOWNLOADER'], ENV['REPUB_DOWNLOADER_OPTIONS']
-          begin
-            downloader = Downloaders[@options[:helper].to_sym] rescue Downloaders[:wget]
-            log.debug "-- Using #{downloader[:cmd]} #{downloader[:options]}"
-            @downloader_path ||= which(downloader[:cmd])
-            @downloader_options ||= downloader[:options]
-          rescue RuntimeError
-            raise FetcherException, "unknown helper '#{@options[:helper]}'"
-          end
+          downloader =
+            begin
+              Downloaders[@options[:helper].to_sym] || Downloaders[:wget]
+            rescue
+              Downloaders[:wget]
+            end
+          log.debug "-- Using #{downloader[:cmd]} #{downloader[:options]}"
+          @downloader_path ||= which(downloader[:cmd])
+          @downloader_options ||= downloader[:options]
         end
         def fetch
@@ -82,7 +84,7 @@ module Repub
               encoding = UniversalDetector.chardet(s)['encoding']
             end
             if encoding.downcase != 'utf-8'
-              log.info "Source encoding is #{encoding}, converting to UTF-8"
+              log.info "Source encoding appears to be #{encoding}, converting to UTF-8"
               s = Iconv.conv('utf-8', encoding, IO.read(doc))
               File.open(doc, 'w') { |f| f.write(s) }
             end

data/lib/repub/app/options.rb CHANGED

@@ -11,6 +11,9 @@ module Repub
         # Default options
         @options = {
+          :add            => [],
+          :after          => [],
+          :before         => [],
           :browser        => false,
           :css            => nil,
           :encoding       => nil,
@@ -129,10 +132,38 @@ module Repub
           opts.separator "  Post-processing options:"
           opts.on("-s", "--stylesheet PATH", String,
-            "Use custom stylesheet at PATH to add or override existing",
-            "CSS references in the source document."
-          ) { |value| options[:css] = File.expand_path(value) }
+            "Use custom stylesheet at PATH. Use -s- to remove",
+            "all links to stylesheets and <style> blocks from the source."
+          ) { |value| options[:css] = value == '-' ? value : File.expand_path(value) }
+          opts.on("-a", "--add PATH", String,
+            "Add external file to the generated ePub."
+          ) { |value| options[:add] << File.expand_path(value) }
+          opts.on("-N", "--new-fragment XHTML", String,
+            "Prepare document fragment for -A and -P operations."
+          ) do |value|
+            begin
+              @fragment = Nokogiri::HTML.fragment(value)
+            rescue Exception => ex
+              log.fatal "ERROR: invalid fragment: #{ex.to_s}"
+            end
+          end
+          opts.on("-A", "--after SELECTOR", String,
+            "Insert fragment after element with XPath selector."
+          ) do |value|
+            log.fatal "ERROR: -A requires a fragment. See '#{App.name} --help'." if !@fragment
+            @options[:after] << {value => @fragment.clone}
+          end
+          opts.on("-P", "--before SELECTOR", String,
+            "Insert fragment before element with XPath selector."
+          ) do |value|
+            log.fatal "ERROR: -P requires a fragment. See '#{App.name} --help'." if !@fragment
+            @options[:before] << {value => @fragment.clone}
+          end
           opts.on("-X", "--remove SELECTOR", String,
             "Remove source element using XPath selector.",
             "Use -X- to ignore stored profile."
@@ -143,7 +174,7 @@ module Repub
             "Use -R- to ignore stored profile."
           ) { |value| value == '-' ? options[:rx] = [] : options[:rx] << value }
-          opts.on("-B", "--browse",
+          opts.on("-B", "--browser",
             "After processing, open resulting HTML in default browser."
           ) { |value| options[:browser] = true }
@@ -177,4 +208,4 @@ module Repub
     end
   end
-end
+end