repub 0.3.2 → 0.3.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,18 +1,34 @@
1
- == 0.2.1 / 2009-06-26
2
-
3
- * Initial release
1
+ == 0.3.3 / 2009-07-05
4
2
 
5
- == 0.3.0 / 2009-06-28
3
+ * New features
6
4
 
7
- * Switched to Nokogiri for HTML parsing
8
- * Better parsing for hierarchical TOCs
9
- * Many bug fixes
5
+ * Option to add external files to the generated ePub (e.g. cover images, logos etc)
6
+ * Option to insert HTML fragments before/after specific element
7
+ * It is now possible to instruct repub to remove all links to CSS and <style> elements from source doc
10
8
 
11
- == 0.3.1 / 2009-06-28
9
+ * Bug fixes
12
10
 
13
- * Fixed App.data_path bug
11
+ * Metadata double namespace prefix
12
+ * Encoding autodetection now is done only once after download (as it was supposed to be)
13
+ * -e flag actually works
14
+ * Source doc content-type encoding now is always set to utf-8
15
+ * Fixed warnings in Profile helper under Ruby 1.9.1
14
16
 
15
17
  == 0.3.2 / 2009-06-30
16
18
 
17
19
  * Improved Win32 support
18
20
  * Updated documentation
21
+
22
+ == 0.3.1 / 2009-06-28
23
+
24
+ * Fixed App.data_path bug
25
+
26
+ == 0.3.0 / 2009-06-28
27
+
28
+ * Switched to Nokogiri for HTML parsing
29
+ * Better parsing for hierarchical TOCs
30
+ * Many bug fixes
31
+
32
+ == 0.2.1 / 2009-06-26
33
+
34
+ * Initial release
@@ -67,7 +67,7 @@ For example, if you later decide to regenerate Git Manual ePub without TOC at th
67
67
 
68
68
  repub -l git-manual -X '//div[@class="toc"]' http://www.kernel.org/pub/software/scm/git/docs/user-manual.html
69
69
 
70
- A few more examples:
70
+ Few more examples:
71
71
 
72
72
  * GNU Wget Manual
73
73
 
@@ -81,47 +81,49 @@ A few more examples:
81
81
  repub -x 'title:body/h1' -x 'toc://table' -x 'toc_item://tr' -X '//pre' -X '//hr' -X '//body/h4' \
82
82
  http://www.gutenberg.org/files/11/11-h/11-h.htm
83
83
 
84
- * The Gelug-Kagyu Tradition of Mahamudra from Berzin Archives
85
-
86
- repub http://www.berzinarchives.com/web/x/prn/p.html_680632258.html
87
-
88
84
  == SYNOPSIS:
89
85
 
90
- Usage: repub [options] url
91
-
92
- General options:
93
- -D, --downloader NAME Which downloader to use to get files (wget or httrack).
94
- Default is wget.
95
- -o, --output PATH Output path for generated ePub file.
96
- Default is /Users/dg/Projects/repub/<Parsed_Title>.epub
97
- -w, --write-profile NAME Save given options for later reuse as profile NAME.
98
- -l, --load-profile NAME Load options from saved profile NAME.
99
- -W, --write-default Save given options for later reuse as default profile.
100
- -L, --list-profiles List saved profiles.
101
- -C, --cleanup Clean up download cache.
102
- -v, --verbose Turn on verbose output.
103
- -q, --quiet Turn off any output except errors.
104
- -V, --version Show version.
105
- -h, --help Show this help message.
106
-
107
- Parser options:
108
- -x, --selector NAME:VALUE Set parser XPath selector NAME to VALUE.
109
- Recognized selectors are: [title toc toc_item toc_section]
110
- -m, --meta NAME:VALUE Set publication information metadata NAME to VALUE.
111
- Valid metadata names are: [creator date description
112
- language publisher relation rights subject title]
113
- -F, --no-fixup Do not attempt to make document meet XHTML 1.0 Strict.
114
- Default is to try and fix things that are broken.
115
- -e, --encoding NAME Set source document encoding. Default is to auto detect.
116
-
117
- Post-processing options:
118
- -s, --stylesheet PATH Use custom stylesheet at PATH to add or override existing
119
- CSS references in the source document.
120
- -X, --remove SELECTOR Remove source element using XPath selector.
121
- Use -X- to ignore stored profile.
122
- -R, --rx /PATTERN/REPLACEMENT/ Edit source HTML using regular expressions.
123
- Use -R- to ignore stored profile.
124
- -B, --browse After processing, open resulting HTML in default browser.
86
+ Repub is a simple HTML to ePub converter.
87
+
88
+ Usage: repub [options] url
89
+
90
+ General options:
91
+ -D, --downloader NAME Which downloader to use to get files (wget or httrack).
92
+ Default is wget.
93
+ -o, --output PATH Output path for generated ePub file.
94
+ Default is /Users/dg/Projects/repub/<Parsed_Title>.epub
95
+ -w, --write-profile NAME Save given options for later reuse as profile NAME.
96
+ -l, --load-profile NAME Load options from saved profile NAME.
97
+ -W, --write-default Save given options for later reuse as default profile.
98
+ -L, --list-profiles List saved profiles.
99
+ -C, --cleanup Clean up download cache.
100
+ -v, --verbose Turn on verbose output.
101
+ -q, --quiet Turn off any output except errors.
102
+ -V, --version Show version.
103
+ -h, --help Show this help message.
104
+
105
+ Parser options:
106
+ -x, --selector NAME:VALUE Set parser XPath selector NAME to VALUE.
107
+ Recognized selectors are: [title toc toc_item toc_section]
108
+ -m, --meta NAME:VALUE Set publication information metadata NAME to VALUE.
109
+ Valid metadata names are: [creator date description
110
+ language publisher relation rights subject title]
111
+ -F, --no-fixup Do not attempt to make document meet XHTML 1.0 Strict.
112
+ Default is to try and fix things that are broken.
113
+ -e, --encoding NAME Set source document encoding. Default is to autodetect.
114
+
115
+ Post-processing options:
116
+ -s, --stylesheet PATH Use custom stylesheet at PATH. Use -s- to remove
117
+ all links to stylesheets and <style> blocks from the source.
118
+ -a, --add PATH Add external file to the generated ePub.
119
+ -N, --new-fragment XHTML Prepare document fragment for -A and -P operations.
120
+ -A, --after SELECTOR Insert fragment after element with XPath selector.
121
+ -P, --before SELECTOR Insert fragment before element with XPath selector.
122
+ -X, --remove SELECTOR Remove source element using XPath selector.
123
+ Use -X- to ignore stored profile.
124
+ -R, --rx /PATTERN/REPLACEMENT/ Edit source HTML using regular expressions.
125
+ Use -R- to ignore stored profile.
126
+ -B, --browser After processing, open resulting HTML in default browser.
125
127
 
126
128
  == DEPENDENCIES:
127
129
 
@@ -140,6 +142,10 @@ Also, the following tools must be somewhere in $PATH:
140
142
  Currently, only "everything-on-one-page" HTML sources are supported. Repub will download and process all page requisites
141
143
  (stylesheets and images) but all actual content must be on one page.
142
144
 
145
+ Encoding auto-detection is slow.
146
+
147
+ Chardet 0.9.0 is broken under Ruby 1.9.
148
+
143
149
  Bugs: probably. If you find any, please report them to dg at invisiblellama dot net.
144
150
 
145
151
  == INSTALL:
data/Rakefile CHANGED
@@ -1,4 +1,5 @@
1
1
  begin
2
+ require 'rubygems'
2
3
  require 'bones'
3
4
  Bones.setup
4
5
  rescue LoadError
data/bin/repub CHANGED
@@ -1,4 +1,4 @@
1
- #!/usr/bin/env ruby -w
1
+ #!/usr/bin/env ruby
2
2
 
3
3
  require File.expand_path(
4
4
  File.join(File.dirname(__FILE__), %w[.. lib repub]))
@@ -1,7 +1,7 @@
1
1
  module Repub
2
2
 
3
3
  # :stopdoc:
4
- VERSION = '0.3.2'
4
+ VERSION = '0.3.3'
5
5
  LIBPATH = File.expand_path(File.dirname(__FILE__)) + File::SEPARATOR
6
6
  PATH = File.dirname(LIBPATH) + File::SEPARATOR
7
7
  # :startdoc:
@@ -31,10 +31,10 @@ module Repub
31
31
 
32
32
  log.level = options[:verbosity]
33
33
  log.info "Making ePub from #{options[:url]}"
34
- res = build(parse(fetch))
35
- log.info "Saved #{res.output_path}"
34
+ builder = build(parse(fetch))
35
+ log.info "Saved #{builder.output_path}"
36
36
 
37
- Launchy::Browser.run(res.asset_path) if options[:browser]
37
+ Launchy::Browser.run(builder.document_path) if options[:browser]
38
38
 
39
39
  rescue RuntimeError => ex
40
40
  log.fatal "** ERROR: #{ex.to_s}"
@@ -16,7 +16,7 @@ module Repub
16
16
  include Epub, Logger
17
17
 
18
18
  attr_reader :output_path
19
- attr_reader :asset_path
19
+ attr_reader :document_path
20
20
 
21
21
  def initialize(options)
22
22
  @options = options
@@ -78,59 +78,69 @@ module Repub
78
78
 
79
79
  def copy_and_process_assets
80
80
  # Copy html
81
- @parser.cache.assets[:documents].each do |asset|
82
- log.debug "-- Processing document #{asset}"
81
+ @parser.cache.assets[:documents].each do |doc|
82
+ log.debug "-- Processing document #{doc}"
83
83
  # Copy asset from cache
84
- FileUtils.cp(File.join(@parser.cache.path, asset), '.')
84
+ FileUtils.cp(File.join(@parser.cache.path, doc), '.')
85
85
  # Do post-processing
86
- postprocess_file(asset)
87
- postprocess_doc(asset)
88
- @content.add_document(asset)
89
- @asset_path = File.expand_path(asset)
86
+ postprocess_file(doc)
87
+ postprocess_doc(doc)
88
+ @content.add_item(doc)
89
+ @document_path = File.expand_path(doc)
90
90
  end
91
+
91
92
  # Copy css
92
93
  if @options[:css].nil? || @options[:css].empty?
93
94
  # No custom css, copy one from assets
94
95
  @parser.cache.assets[:stylesheets].each do |css|
95
96
  log.debug "-- Copying stylesheet #{css}"
96
97
  FileUtils.cp(File.join(@parser.cache.path, css), '.')
97
- @content.add_stylesheet(css)
98
+ @content.add_item(css)
98
99
  end
99
- else
100
+ elsif @options[:css] != '-'
100
101
  # Copy custom css
101
102
  log.debug "-- Using custom stylesheet #{@options[:css]}"
102
103
  FileUtils.cp(@options[:css], '.')
103
- @content.add_stylesheet(File.basename(@options[:css]))
104
+ @content.add_item(File.basename(@options[:css]))
104
105
  end
106
+
105
107
  # Copy images
106
108
  @parser.cache.assets[:images].each do |image|
107
109
  log.debug "-- Copying image #{image}"
108
110
  FileUtils.cp(File.join(@parser.cache.path, image), '.')
109
- @content.add_image(image)
111
+ @content.add_item(image)
110
112
  end
113
+
114
+ # Copy external custom files (-a option)
115
+ @options[:add].each do |file|
116
+ log.debug "-- Copying external file #{file}"
117
+ FileUtils.cp(file, '.')
118
+ @content.add_item(file)
119
+ end if @options[:add]
111
120
  end
112
121
 
113
122
  def postprocess_file(asset)
114
123
  source = IO.read(asset)
124
+
115
125
  # Do rx substitutions
116
- if @options[:rx] && !@options[:rx].empty?
117
- @options[:rx].each do |rx|
118
- rx.strip!
119
- delimiter = rx[0, 1]
120
- rx = rx.gsub(/\\#{delimiter}/, "\n")
121
- ra = rx.split(/#{delimiter}/).reject {|e| e.empty? }.each {|e| e.gsub!(/\n/, "#{delimiter}")}
122
- raise ParserException, "Invalid regular expression" if ra.empty? || ra[0].nil? || ra.size > 2
123
- pattern = ra[0]
124
- replacement = ra[1] || ''
125
- log.info "Replacing pattern /#{pattern.gsub(/#{delimiter}/, "\\#{delimiter}")}/ with \"#{replacement}\""
126
- source.gsub!(Regexp.new(pattern), replacement)
127
- end
128
- end
126
+ @options[:rx].each do |rx|
127
+ rx.strip!
128
+ delimiter = rx[0, 1]
129
+ rx = rx.gsub(/\\#{delimiter}/, "\n")
130
+ ra = rx.split(/#{delimiter}/).reject {|e| e.empty? }.each {|e| e.gsub!(/\n/, "#{delimiter}")}
131
+ raise ParserException, "Invalid regular expression" if ra.empty? || ra[0].nil? || ra.size > 2
132
+ pattern = ra[0]
133
+ replacement = ra[1] || ''
134
+ log.info "Replacing pattern /#{pattern.gsub(/#{delimiter}/, "\\#{delimiter}")}/ with \"#{replacement}\""
135
+ source.gsub!(Regexp.new(pattern), replacement)
136
+ end if @options[:rx]
137
+
129
138
  # Add doctype if missing
130
139
  if source !~ /\s*<!DOCTYPE/
131
140
  log.debug "-- Adding missing doctype"
132
141
  source = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" + source
133
142
  end
143
+
134
144
  # Save processed file
135
145
  File.open(asset, 'w') do |f|
136
146
  f.write(source)
@@ -139,23 +149,61 @@ module Repub
139
149
 
140
150
  def postprocess_doc(asset)
141
151
  doc = Nokogiri::HTML.parse(IO.read(asset), nil, 'UTF-8')
142
- # Substitute custom CSS
143
- if (@options[:css] && !@options[:css].empty?)
144
- doc.xpath('//link[@rel="stylesheet"]').each do |link|
152
+
153
+ # Set Content-Type charset to UTF-8
154
+ doc.xpath('//head/meta[@http-equiv="Content-Type"]').each do |el|
155
+ el['content'] = 'text/html; charset=utf-8'
156
+ end
157
+
158
+ # Process styles
159
+ if @options[:css] && !@options[:css].empty?
160
+ # Remove all stylesheet links
161
+ doc.xpath('//head/link[@rel="stylesheet"]').remove
162
+ if @options[:css] == '-'
163
+ # Also remove all inline styles
164
+ doc.xpath('//head/style').remove
165
+ log.info "Removing all stylesheet links and style elements"
166
+ else
167
+ # Add custom stylesheet link
168
+ link = Nokogiri::XML::Node.new('link', doc)
169
+ link['rel'] = 'stylesheet'
170
+ link['type'] = 'text/css'
145
171
  link['href'] = File.basename(@options[:css])
146
- log.debug "-- Replacing CSS refs with #{link[:href]}"
172
+ # Add as the last child so it has precedence over (possible) inline styles before
173
+ doc.at('//head').add_child(link)
174
+ log.info "Replacing CSS refs with \"#{link['href']}\""
147
175
  end
148
176
  end
149
- # Remove elements
150
- if @options[:remove] && !@options[:remove].empty?
151
- @options[:remove].each do |selector|
152
- log.info "Removing elements matching selector \"#{selector}\""
153
- doc.search(selector).remove
177
+
178
+ # Insert elements after/before selector
179
+ @options[:after].each do |e|
180
+ selector = e.keys.first
181
+ fragment = e[selector]
182
+ element = doc.xpath(selector).first
183
+ if element
184
+ log.info "Inserting fragment \"#{fragment.to_html}\" after \"#{selector}\""
185
+ fragment.children.to_a.reverse.each {|node| element.add_next_sibling(node) }
154
186
  end
155
- end
187
+ end if @options[:after]
188
+ @options[:before].each do |e|
189
+ selector = e.keys.first
190
+ fragment = e[selector]
191
+ element = doc.xpath(selector).first
192
+ if element
193
+ log.info "Inserting fragment \"#{fragment}\" before \"#{selector}\""
194
+ fragment.children.to_a.each {|node| element.add_previous_sibling(node) }
195
+ end
196
+ end if @options[:before]
197
+
198
+ # Remove elements
199
+ @options[:remove].each do |selector|
200
+ log.info "Removing elements \"#{selector}\""
201
+ doc.search(selector).remove
202
+ end if @options[:remove]
203
+
156
204
  # Save processed doc
157
205
  File.open(asset, 'w') do |f|
158
- if @options[:fixup]
206
+ if @options[:fixup] || true
159
207
  # HACK: Nokogiri seems to ignore the fact that xmlns and other attrs aleady present
160
208
  # in html node and adds them anyway. Just remove them here to avoid duplicates.
161
209
  doc.root.attributes.each {|name, value| doc.root.remove_attribute(name) }
@@ -4,6 +4,7 @@ require 'uri'
4
4
  require 'iconv'
5
5
  require 'rubygems'
6
6
 
7
+ # Temporary disable warnings from chardet
7
8
  old_verbose = $VERBOSE
8
9
  $VERBOSE = false
9
10
  require 'UniversalDetector'
@@ -24,26 +25,27 @@ module Repub
24
25
  :stylesheets => %w[css],
25
26
  :images => %w[jpg jpeg png gif svg]
26
27
  }
27
-
28
+
28
29
  class Fetcher
29
30
  include Logger
30
31
 
31
32
  Downloaders = {
32
33
  :wget => { :cmd => 'wget', :options => '-nv -E -H -k -p -nH -nd' },
33
- :httrack => { :cmd => 'httrack', :options => '-gB -r2 +*.css +*.jpg -*.xml -*.html' }
34
+ :httrack => { :cmd => 'httrack', :options => '-gBqQ -r2 +*.css +*.jpg -*.xml -*.html' }
34
35
  }
35
36
 
36
37
  def initialize(options)
37
38
  @options = options
38
39
  @downloader_path, @downloader_options = ENV['REPUB_DOWNLOADER'], ENV['REPUB_DOWNLOADER_OPTIONS']
39
- begin
40
- downloader = Downloaders[@options[:helper].to_sym] rescue Downloaders[:wget]
41
- log.debug "-- Using #{downloader[:cmd]} #{downloader[:options]}"
42
- @downloader_path ||= which(downloader[:cmd])
43
- @downloader_options ||= downloader[:options]
44
- rescue RuntimeError
45
- raise FetcherException, "unknown helper '#{@options[:helper]}'"
46
- end
40
+ downloader =
41
+ begin
42
+ Downloaders[@options[:helper].to_sym] || Downloaders[:wget]
43
+ rescue
44
+ Downloaders[:wget]
45
+ end
46
+ log.debug "-- Using #{downloader[:cmd]} #{downloader[:options]}"
47
+ @downloader_path ||= which(downloader[:cmd])
48
+ @downloader_options ||= downloader[:options]
47
49
  end
48
50
 
49
51
  def fetch
@@ -82,7 +84,7 @@ module Repub
82
84
  encoding = UniversalDetector.chardet(s)['encoding']
83
85
  end
84
86
  if encoding.downcase != 'utf-8'
85
- log.info "Source encoding is #{encoding}, converting to UTF-8"
87
+ log.info "Source encoding appears to be #{encoding}, converting to UTF-8"
86
88
  s = Iconv.conv('utf-8', encoding, IO.read(doc))
87
89
  File.open(doc, 'w') { |f| f.write(s) }
88
90
  end
@@ -11,6 +11,9 @@ module Repub
11
11
 
12
12
  # Default options
13
13
  @options = {
14
+ :add => [],
15
+ :after => [],
16
+ :before => [],
14
17
  :browser => false,
15
18
  :css => nil,
16
19
  :encoding => nil,
@@ -129,10 +132,38 @@ module Repub
129
132
  opts.separator " Post-processing options:"
130
133
 
131
134
  opts.on("-s", "--stylesheet PATH", String,
132
- "Use custom stylesheet at PATH to add or override existing",
133
- "CSS references in the source document."
134
- ) { |value| options[:css] = File.expand_path(value) }
135
+ "Use custom stylesheet at PATH. Use -s- to remove",
136
+ "all links to stylesheets and <style> blocks from the source."
137
+ ) { |value| options[:css] = value == '-' ? value : File.expand_path(value) }
135
138
 
139
+ opts.on("-a", "--add PATH", String,
140
+ "Add external file to the generated ePub."
141
+ ) { |value| options[:add] << File.expand_path(value) }
142
+
143
+ opts.on("-N", "--new-fragment XHTML", String,
144
+ "Prepare document fragment for -A and -P operations."
145
+ ) do |value|
146
+ begin
147
+ @fragment = Nokogiri::HTML.fragment(value)
148
+ rescue Exception => ex
149
+ log.fatal "ERROR: invalid fragment: #{ex.to_s}"
150
+ end
151
+ end
152
+
153
+ opts.on("-A", "--after SELECTOR", String,
154
+ "Insert fragment after element with XPath selector."
155
+ ) do |value|
156
+ log.fatal "ERROR: -A requires a fragment. See '#{App.name} --help'." if !@fragment
157
+ @options[:after] << {value => @fragment.clone}
158
+ end
159
+
160
+ opts.on("-P", "--before SELECTOR", String,
161
+ "Insert fragment before element with XPath selector."
162
+ ) do |value|
163
+ log.fatal "ERROR: -P requires a fragment. See '#{App.name} --help'." if !@fragment
164
+ @options[:before] << {value => @fragment.clone}
165
+ end
166
+
136
167
  opts.on("-X", "--remove SELECTOR", String,
137
168
  "Remove source element using XPath selector.",
138
169
  "Use -X- to ignore stored profile."
@@ -143,7 +174,7 @@ module Repub
143
174
  "Use -R- to ignore stored profile."
144
175
  ) { |value| value == '-' ? options[:rx] = [] : options[:rx] << value }
145
176
 
146
- opts.on("-B", "--browse",
177
+ opts.on("-B", "--browser",
147
178
  "After processing, open resulting HTML in default browser."
148
179
  ) { |value| options[:browser] = true }
149
180
 
@@ -177,4 +208,4 @@ module Repub
177
208
 
178
209
  end
179
210
  end
180
- end
211
+ end