RubyGems - invisiblellama-repub - Versions diffs - 0.3.3 → 0.3.4 - Mend

invisiblellama-repub 0.3.3 → 0.3.4

Files changed (28) hide show

data/History.txt +11 -0
data/README.rdoc +14 -8
data/TODO +0 -2
data/lib/repub.rb +1 -1
data/lib/repub/app.rb +3 -0
data/lib/repub/app/builder.rb +151 -154
data/lib/repub/app/fetcher.rb +10 -23
data/lib/repub/app/filter.rb +30 -0
data/lib/repub/app/options.rb +0 -6
data/lib/repub/app/parser.rb +63 -73
data/lib/repub/app/post_filters.rb +135 -0
data/lib/repub/app/pre_filters.rb +50 -0
data/lib/repub/app/profile.rb +1 -1
data/lib/repub/epub.rb +4 -3
data/lib/repub/epub/container_item.rb +49 -0
data/lib/repub/epub/{toc.rb → ncx.rb} +137 -139
data/lib/repub/epub/ocf.rb +62 -0
data/lib/repub/epub/opf.rb +136 -0
data/repub.gemspec +4 -4
data/test/epub/{test_toc.rb → test_ncx.rb} +14 -12
data/test/epub/test_ocf.rb +28 -0
data/test/epub/{test_content.rb → test_opf.rb} +25 -19
data/test/test_filter.rb +28 -0
data/test/test_parser.rb +3 -4
metadata +17 -11
data/lib/repub/epub/container.rb +0 -28
data/lib/repub/epub/content.rb +0 -178
data/test/epub/test_container.rb +0 -15

data/History.txt CHANGED Viewed

@@ -1,3 +1,14 @@
+== 0.3.4 / 2009-07-17
+* Bug fixes
+    * Pre- and post processing filters moved to separate modules.
+    * Non-conformant element IDs are now fixed automaticly
+    * Regardless of the source settings, doctype now is always set to XHTML 1.0 Transitional
+    * -F (disable fixups) option removed, fixups are always on
+    * Documentation updates
+    * More tests
 == 0.3.3 / 2009-07-05
 * New features

data/README.rdoc CHANGED Viewed

@@ -27,9 +27,9 @@ broken too bad) be readable but will be lacking any metadata or TOC.
 Few examples:
-* Project Gutenberg's THE ADVENTURES OF SHERLOCK HOLMES (with proper table of contents)
+* Project Gutenberg's The Adventures Of Sherlock Holmes (with proper table of contents)
-    repub -x 'title:div[@class='book']//h1' \
+    repub -x 'title:div[@class="book"]//h1' \
       -x 'toc://table' \
       -x 'toc_item://tr' \
       http://www.gutenberg.org/dirs/etext99/advsh12h.htm
@@ -38,7 +38,7 @@ This tells Repub to look for title in the first found H1 in the DIV of class "bo
 located in the first TABLE and TOC item can be found inside TR.
 The above will produce readable ePub which can be further enhanced by removing some "noise" content:
-    repub -x 'title:div[@class='book']//h1' \
+    repub -x 'title:div[@class="book"]//h1' \
       -x 'toc://table' \
       -x 'toc_item://tr' \
       -X '//pre' -X '//hr' -X '//body/h1' -X '//body/h2' \
@@ -69,6 +69,14 @@ For example, if you later decide to regenerate Git Manual ePub without TOC at th
 Few more examples:
+* Open Packaging Format (OPF) 2.0 (one of the ePub standards, in ePub)
+    repub -x 'title://p[@class="Title"]' \
+      -x 'toc://div[@class="TOC"]' \
+      -x 'toc_item:.//p' \
+      -x 'toc_section:.//div[@class="TOCSection"]' \
+      http://www.idpf.org/2007/opf/OPF_2.0_final_spec.html
 * GNU Wget Manual
     repub -m 'creator:gnu.org' \
@@ -76,7 +84,7 @@ Few more examples:
       -X '//div[@class="contents"]' \
       http://www.gnu.org/software/wget/manual/wget.html
-* Project Gutenberg's ALICE'S ADVENTURES IN WONDERLAND
+* And finally, the "Hello World" of e-books, Alice's Adventures In Wonderland
     repub -x 'title:body/h1' -x 'toc://table' -x 'toc_item://tr' -X '//pre' -X '//hr' -X '//body/h4' \
       http://www.gutenberg.org/files/11/11-h/11-h.htm
@@ -108,8 +116,6 @@ Parser options:
   -m, --meta NAME:VALUE            Set publication information metadata NAME to VALUE.
                                    Valid metadata names are: [creator date description
                                    language publisher relation rights subject title]
-  -F, --no-fixup                   Do not attempt to make document meet XHTML 1.0 Strict.
-                                   Default is to try and fix things that are broken.
   -e, --encoding NAME              Set source document encoding. Default is to autodetect.
 Post-processing options:
@@ -144,13 +150,13 @@ Currently, only "everything-on-one-page" HTML sources are supported. Repub will
 Encoding auto-detection is slow.
-Chardet 0.9.0 is broken under Ruby 1.9.
+Chardet 0.9.0 is broken under Ruby 1.9 so if you want to use Ruby 1.9 you have to set encoding manually with -e.
 Bugs: probably. If you find any, please report them to dg at invisiblellama dot net.
 == INSTALL:
-    gem install repub
+    sudo gem install repub
 == LICENSE:

data/TODO CHANGED Viewed

@@ -1,3 +1 @@
-* add support for rx cleaning/modifying source doc
-* make -q/-v actually do something
   more parser tokens: author(s) etc ?

data/lib/repub.rb CHANGED Viewed

@@ -1,7 +1,7 @@
 module Repub
   # :stopdoc:
-  VERSION = '0.3.3'
+  VERSION = '0.3.4'
   LIBPATH = File.expand_path(File.dirname(__FILE__)) + File::SEPARATOR
   PATH = File.dirname(LIBPATH) + File::SEPARATOR
   # :startdoc:

data/lib/repub/app.rb CHANGED Viewed

@@ -5,6 +5,9 @@ require 'repub/app/utility'
 require 'repub/app/logger'
 require 'repub/app/options'
 require 'repub/app/profile'
+require 'repub/app/filter'
+require 'repub/app/pre_filters'
+require 'repub/app/post_filters'
 require 'repub/app/fetcher'
 require 'repub/app/parser'
 require 'repub/app/builder'

data/lib/repub/app/builder.rb CHANGED Viewed

@@ -13,7 +13,7 @@ module Repub
       end
       class Builder
-        include Epub, Logger
+        include Logger
         attr_reader :output_path
         attr_reader :document_path
@@ -25,45 +25,50 @@ module Repub
         def build(parser)
           @parser = parser
-          # Initialize content.opf
-          @content = Content.new(@parser.uid)
+          # Initialize Container
+          @ocf = Epub::OCF.new
+          # Initialize Package
+          @opf = Epub::OPF.new(@parser.uid)
+          @ocf << @opf
           # Default title is the parsed one
-          @content.metadata.title = @parser.title
+          @opf.metadata.title = @parser.title
           # Override metadata values specified in options
           if @options[:metadata]
-            @content.metadata.members.each do |m|
+            @opf.metadata.members.each do |m|
               m = m.to_sym
-              next if m == :identifier   # do not allow to override uid
+              # Do not allow to override uid
+              next if m == :identifier
               if @options[:metadata][m]
-                @content.metadata[m] = @options[:metadata][m]
-                log.debug "-- Setting metadata #{m} to \"#{@content.metadata[m]}\""
+                @opf.metadata[m] = @options[:metadata][m]
+                log.debug "-- Setting metadata #{m} to \"#{@opf.metadata[m]}\""
               end
             end
           end
-          # Initialize toc.ncx
-          @toc = Toc.new(@parser.uid)
-          # TOC title is the same as in content.opf
-          @toc.title = @content.metadata.title
+          # Initialize TOC
+          @ncx = Epub::NCX.new(@parser.uid)
+          @opf << @ncx
+          @ncx.title = @opf.metadata.title
+          @ncx.nav_map.points = @parser.toc
           # Setup output filename and path
           @output_path = File.expand_path(@options[:output_path].if_blank('.'))
           if File.exist?(@output_path) && File.directory?(@output_path)
-            @output_path = File.join(@output_path, @content.metadata.title.gsub(/\s/, '_'))
+            @output_path = File.join(@output_path, @opf.metadata.title.gsub(/\s/, '_'))
           end
           @output_path = @output_path +  '.epub'
-          log.debug "-- Setting output path to #{@output_path}"
+          log.debug "-- Output path is #{@output_path}"
           # Build EPUB
           tmpdir = Dir.mktmpdir(App::name)
           begin
             FileUtils.chdir(tmpdir) do
               copy_and_process_assets
-              write_meta_inf
-              write_mime_type
-              write_content
-              write_toc
-              write_epub
+              @ncx.save
+              @opf.save
+              @ocf.save
+              @ocf.zip(@output_path)
             end
           ensure
             # Keep tmp folder if we're going open processed doc in browser
@@ -74,19 +79,17 @@ module Repub
         private
-        MetaInf = 'META-INF'
         def copy_and_process_assets
           # Copy html
-          @parser.cache.assets[:documents].each do |doc|
-            log.debug "-- Processing document #{doc}"
+          @parser.cache.assets[:documents].each do |file|
+            log.debug "-- Processing document #{file}"
             # Copy asset from cache
-            FileUtils.cp(File.join(@parser.cache.path, doc), '.')
+            FileUtils.cp(File.join(@parser.cache.path, file), '.')
             # Do post-processing
-            postprocess_file(doc)
-            postprocess_doc(doc)
-            @content.add_item(doc)
-            @document_path = File.expand_path(doc)
+            apply_file_filters(file)
+            apply_document_filters(file)
+            @opf << file
+            @document_path = File.expand_path(file)
           end
           # Copy css
@@ -95,158 +98,152 @@ module Repub
             @parser.cache.assets[:stylesheets].each do |css|
               log.debug "-- Copying stylesheet #{css}"
               FileUtils.cp(File.join(@parser.cache.path, css), '.')
-              @content.add_item(css)
+              @opf << css
             end
           elsif @options[:css] != '-'
             # Copy custom css
             log.debug "-- Using custom stylesheet #{@options[:css]}"
             FileUtils.cp(@options[:css], '.')
-            @content.add_item(File.basename(@options[:css]))
+            @opf << File.basename(@options[:css])
           end
           # Copy images
           @parser.cache.assets[:images].each do |image|
             log.debug "-- Copying image #{image}"
             FileUtils.cp(File.join(@parser.cache.path, image), '.')
-            @content.add_item(image)
+            @opf << image
           end
           # Copy external custom files (-a option)
           @options[:add].each do |file|
             log.debug "-- Copying external file #{file}"
             FileUtils.cp(file, '.')
-            @content.add_item(file)
+            @opf << file
           end if @options[:add]
         end
-        def postprocess_file(asset)
-          source = IO.read(asset)
-          # Do rx substitutions
-          @options[:rx].each do |rx|
-            rx.strip!
-            delimiter = rx[0, 1]
-            rx = rx.gsub(/\\#{delimiter}/, "\n")
-            ra = rx.split(/#{delimiter}/).reject {|e| e.empty? }.each {|e| e.gsub!(/\n/, "#{delimiter}")}
-            raise ParserException, "Invalid regular expression" if ra.empty? || ra[0].nil? || ra.size > 2
-            pattern = ra[0]
-            replacement = ra[1] || ''
-            log.info "Replacing pattern /#{pattern.gsub(/#{delimiter}/, "\\#{delimiter}")}/ with \"#{replacement}\""
-            source.gsub!(Regexp.new(pattern), replacement)
-          end if @options[:rx]
-          # Add doctype if missing
-          if source !~ /\s*<!DOCTYPE/
-            log.debug "-- Adding missing doctype"
-            source = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" + source
-          end
-          # Save processed file
-          File.open(asset, 'w') do |f|
-            f.write(source)
-          end
-        end
-        def postprocess_doc(asset)
-          doc = Nokogiri::HTML.parse(IO.read(asset), nil, 'UTF-8')
-          # Set Content-Type charset to UTF-8
-          doc.xpath('//head/meta[@http-equiv="Content-Type"]').each do |el|
-            el['content'] = 'text/html; charset=utf-8'
-          end
-          # Process styles
-          if @options[:css] && !@options[:css].empty?
-            # Remove all stylesheet links
-            doc.xpath('//head/link[@rel="stylesheet"]').remove
-            if @options[:css] == '-'
-              # Also remove all inline styles
-              doc.xpath('//head/style').remove
-              log.info "Removing all stylesheet links and style elements"
-            else
-              # Add custom stylesheet link
-              link = Nokogiri::XML::Node.new('link', doc)
-              link['rel'] = 'stylesheet'
-              link['type'] = 'text/css'
-              link['href'] = File.basename(@options[:css])
-              # Add as the last child so it has precedence over (possible) inline styles before
-              doc.at('//head').add_child(link)
-              log.info "Replacing CSS refs with \"#{link['href']}\""
-            end
-          end
-          # Insert elements after/before selector
-          @options[:after].each do |e|
-            selector = e.keys.first
-            fragment = e[selector]
-            element = doc.xpath(selector).first
-            if element
-              log.info "Inserting fragment \"#{fragment.to_html}\" after \"#{selector}\""
-              fragment.children.to_a.reverse.each {|node| element.add_next_sibling(node) }
-            end
-          end if @options[:after]
-          @options[:before].each do |e|
-            selector = e.keys.first
-            fragment = e[selector]
-            element = doc.xpath(selector).first
-            if element
-              log.info "Inserting fragment \"#{fragment}\" before \"#{selector}\""
-              fragment.children.to_a.each {|node| element.add_previous_sibling(node) }
-            end
-          end if @options[:before]
-          # Remove elements
-          @options[:remove].each do |selector|
-            log.info "Removing elements \"#{selector}\""
-            doc.search(selector).remove
-          end if @options[:remove]
-          # Save processed doc
-          File.open(asset, 'w') do |f|
-            if @options[:fixup] || true
-              # HACK: Nokogiri seems to ignore the fact that xmlns and other attrs aleady present
-              # in html node and adds them anyway. Just remove them here to avoid duplicates.
-              doc.root.attributes.each {|name, value| doc.root.remove_attribute(name) }
-              doc.write_xhtml_to(f, :encoding => 'UTF-8')
-            else
-              doc.write_html_to(f, :encoding => 'UTF-8')
-            end
-          end
+        def apply_file_filters(file)
+          s = PostFilters::FileFilters.apply_filters(IO.read(file), @options)
+          File.open(file, 'w') { |f| f.write(s) }
         end
-        def write_meta_inf
-          FileUtils.mkdir_p(MetaInf)
-          FileUtils.chdir(MetaInf) do
-            Epub::Container.new.save
+        def apply_document_filters(file)
+          doc = Nokogiri::HTML.parse(IO.read(file), nil, 'UTF-8')
+          doc = PostFilters::DocumentFilters.apply_filters(doc, @options)
+          File.open(file, 'w') do |f|
+            # HACK: Nokogiri seems to ignore the fact that xmlns and other attrs aleady present
+            # in html node and adds them anyway. Just remove them here to avoid duplicates.
+            doc.root.attributes.each {|name, value| doc.root.remove_attribute(name) }
+            doc.write_xhtml_to(f, :encoding => 'UTF-8')
           end
         end
-        def write_mime_type
-          File.open('mimetype', 'w') do |f|
-            f << 'application/epub+zip'
-          end
-        end
-        def write_content
-          @content.save
-        end
-        def write_toc
-          add_nav_points(@toc.nav_map, @parser.toc)
-          @toc.save
-        end
+        # def postprocess_file(asset)
+        #   source = IO.read(asset)
+        #
+        #   # Do rx substitutions
+        #   @options[:rx].each do |rx|
+        #     rx.strip!
+        #     delimiter = rx[0, 1]
+        #     rx = rx.gsub(/\\#{delimiter}/, "\n")
+        #     ra = rx.split(/#{delimiter}/).reject {|e| e.empty? }.each {|e| e.gsub!(/\n/, "#{delimiter}")}
+        #     raise ParserException, "Invalid regular expression" if ra.empty? || ra[0].nil? || ra.size > 2
+        #     pattern = ra[0]
+        #     replacement = ra[1] || ''
+        #     log.info "Replacing pattern /#{pattern.gsub(/#{delimiter}/, "\\#{delimiter}")}/ with \"#{replacement}\""
+        #     source.gsub!(Regexp.new(pattern), replacement)
+        #   end if @options[:rx]
+        #
+        #   # Remove xml preamble if any
+        #   preamble_rx = /^\s*<\?xml\s+[^>]+>\s*/mi
+        #   if source =~ preamble_rx
+        #     log.debug "-- Removing xml preamble"
+        #     source.sub!(preamble_rx, '')
+        #   end
+        #
+        #   # Replace doctype
+        #   doctype_rx = /^\s*<!DOCTYPE\s+[^>]+>\s*/mi
+        #   if source =~ doctype_rx
+        #     source.sub!(doctype_rx, '')
+        #   end
+        #   log.debug "-- Replacing doctype"
+        #   source = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" + source
+        #
+        #   # Save processed file
+        #   File.open(asset, 'w') do |f|
+        #     f.write(source)
+        #   end
+        # end
-        def add_nav_points(nav_collection, toc)
-          toc.each do |t|
-            nav_point = nav_collection.add_nav_point(t.title, t.src)
-            add_nav_points(nav_point, t.subitems) if t.subitems
-          end
-        end
-        def write_epub
-          %x(zip -X9 \"#{@output_path}\" mimetype)
-          %x(zip -Xr9D \"#{@output_path}\" * -xi mimetype)
-        end
+        # def postprocess_doc(asset)
+        #   doc = Nokogiri::HTML.parse(IO.read(asset), nil, 'UTF-8')
+        #
+        # # Set Content-Type charset to UTF-8
+        # doc.xpath('//head/meta[@http-equiv="Content-Type"]').each do |el|
+        #   el['content'] = 'text/html; charset=utf-8'
+        # end
+        #
+        # # Process styles
+        # if @options[:css] && !@options[:css].empty?
+        #   # Remove all stylesheet links
+        #   doc.xpath('//head/link[@rel="stylesheet"]').remove
+        #   if @options[:css] == '-'
+        #     # Also remove all inline styles
+        #     doc.xpath('//head/style').remove
+        #     log.info "Removing all stylesheet links and style elements"
+        #   else
+        #     # Add custom stylesheet link
+        #     link = Nokogiri::XML::Node.new('link', doc)
+        #     link['rel'] = 'stylesheet'
+        #     link['type'] = 'text/css'
+        #     link['href'] = File.basename(@options[:css])
+        #     # Add as the last child so it has precedence over (possible) inline styles before
+        #     doc.at('//head').add_child(link)
+        #     log.info "Replacing CSS refs with \"#{link['href']}\""
+        #   end
+        # end
+        #
+        # # Insert elements after/before selector
+        # @options[:after].each do |e|
+        #   selector = e.keys.first
+        #   fragment = e[selector]
+        #   element = doc.xpath(selector).first
+        #   if element
+        #     log.info "Inserting fragment \"#{fragment.to_html}\" after \"#{selector}\""
+        #     fragment.children.to_a.reverse.each {|node| element.add_next_sibling(node) }
+        #   end
+        # end if @options[:after]
+        # @options[:before].each do |e|
+        #   selector = e.keys.first
+        #   fragment = e[selector]
+        #   element = doc.xpath(selector).first
+        #   if element
+        #     log.info "Inserting fragment \"#{fragment}\" before \"#{selector}\""
+        #     fragment.children.to_a.each {|node| element.add_previous_sibling(node) }
+        #   end
+        # end if @options[:before]
+        #
+        # # Remove elements
+        # @options[:remove].each do |selector|
+        #   log.info "Removing elements \"#{selector}\""
+        #   doc.search(selector).remove
+        # end if @options[:remove]
+        #
+        # # XXX
+        # # doc.xpath('//body/a').each do |a|
+        # #   wrapper = Nokogiri::XML::Node.new('p', doc)
+        # #   a.add_next_sibling(wrapper)
+        # #   wrapper << a
+        # # end
+        #
+        #   # Save processed doc
+        #   File.open(asset, 'w') do |f|
+        #     # HACK: Nokogiri seems to ignore the fact that xmlns and other attrs aleady present
+        #     # in html node and adds them anyway. Just remove them here to avoid duplicates.
+        #     doc.root.attributes.each {|name, value| doc.root.remove_attribute(name) }
+        #     doc.write_xhtml_to(f, :encoding => 'UTF-8')
+        #   end
+        # end
       end
     end