ebook_tools 0.0.1
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +2 -0
- data/README +76 -0
- data/bin/ebook_tools +196 -0
- data/ebook_tools.gemspec +38 -0
- data/lib/ebook_tools.rb +248 -0
- data/lib/epub.rb +104 -0
- data/lib/extract_book_struct.rb +415 -0
- data/lib/header_detect.rb +161 -0
- data/lib/pdf.rb +265 -0
- data/lib/txt.rb +108 -0
- data/lib/utils.rb +224 -0
- metadata +170 -0
data/CHANGELOG
ADDED
data/README
ADDED
@@ -0,0 +1,76 @@
|
|
1
|
+
# encoding: UTF-8
|
2
|
+
# = ExtractBookStruct
|
3
|
+
# ExtractBookStruct的目的是从各类电子书内容中提取书的结构信息。目前支持txt,epub,html。
|
4
|
+
# ExtractBookStruct选择从TXT文档提取书的结构信息。对TXT的文档要有如下要求:
|
5
|
+
# 1. 文档的编码格式必须是UTF-8或GB2312,推荐使用UTF-8格式
|
6
|
+
# 2. 文档的内容只包含书内容部分(书名、作者、目录等信息应该不包含在文档内)
|
7
|
+
# 3. 文档的段落应该完整(有些PDF转换过来的文档会破坏句子,需要进行预处理)
|
8
|
+
# 4. 文档必须符合正常的文档流(错位的章节段落等情况将影响正常的结构提取)
|
9
|
+
# 5. 文档需要包含结构信息(例如: 卷、篇、部分、章(回)节或者有连续的序号)
|
10
|
+
# 6. 每个结构信息都应该独立成行。
|
11
|
+
#
|
12
|
+
# 文档结构信息分析
|
13
|
+
# 一本书在编排的时候会有自己的结构信息,这些结构信息通常通过卷、篇、部分、章(回)节等表述,也会使用序号的方式表述。总体上可以分为以下几种:
|
14
|
+
# 1. 文本描述(text): 按卷、部分(篇)、章(回)、节等文字表述
|
15
|
+
# 2. 数字描述(digital): 所有结构信息都是按照数字序号表示,比如 1 xxxxx; 1.1 xxxxx
|
16
|
+
# 3. 混合描述(hybrid):章按照文字表述,节按照序号表示,比如 1.1 xxxxxx
|
17
|
+
# 根据不同的类型,对结构信息的提取采用不同的处理手段。
|
18
|
+
#
|
19
|
+
# 有效的标题信息应该符合以下规则:
|
20
|
+
# 1. 标题应该不包含完整的句子(应该不包含句子分隔符,例如“。","!"等)
|
21
|
+
# 2. 应该包含结构信息表述,具体如下:
|
22
|
+
# 文本描述:
|
23
|
+
# 卷: 以"第xxx卷"开始
|
24
|
+
# 以"卷"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
25
|
+
# 以"volume"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
26
|
+
# 部分(篇): 以"第xxx部"或"第xxx篇"开始
|
27
|
+
# 以"part"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
28
|
+
# 章(回): 以"第xxx章"或"第xxx回"开始
|
29
|
+
# 以"chapter"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
30
|
+
# 节: 以"第xxx节"开始
|
31
|
+
# 前言: 以"前"开始,以"言"结束,中间加入空白字符。例如"前言","前 言"等。
|
32
|
+
# 以"序"开始,以"言"结束,中间加入空白字符。例如"序言","序 言"等。
|
33
|
+
# 单个"序"
|
34
|
+
# 以"序"或"序言"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
35
|
+
# "preface"
|
36
|
+
# "foreword"
|
37
|
+
# 以"preface"或"foreword"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
38
|
+
# 索引: 以"索"开始,以"引"结束,中间加入空白字符。例如"索引","索 引"等。
|
39
|
+
# 以"索引"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
40
|
+
# "index"
|
41
|
+
# 以"index"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
42
|
+
# 附录: 以"附"开始,以"录"结束,中间加入空白字符。例如"附录","附 录"等。
|
43
|
+
# 以"附录"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
44
|
+
# "appendix"
|
45
|
+
# 以"appendix"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
46
|
+
# 术语: 以"术"开始,以"语"结束,中间加入空白字符。例如"术语","术 语"等。
|
47
|
+
# 以"术语"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
48
|
+
# "glossary"
|
49
|
+
# 以"glossary"开始,后面跟序号表述方式,例如 “I”,“Ⅱ”,“1”等
|
50
|
+
#
|
51
|
+
# 数字描述:
|
52
|
+
# 以数字序号层级表达,数字序号和标题内容之间有空白字符分隔。例如"1 管理的概念", "1.1 定义", "1.1.1 管理"等。
|
53
|
+
#
|
54
|
+
# ==API接口
|
55
|
+
#
|
56
|
+
# === ExtractBookStruct.from_txt
|
57
|
+
# 从文本文件中提取目录结构,使用示例:
|
58
|
+
# ExtractBookStruct.from_txt('1.txt',{:title=>'title',:author=>'author'})
|
59
|
+
#
|
60
|
+
# === ExtractBookStruct.from_epub
|
61
|
+
# 从EPUB文件中提取目录结构,使用示例:
|
62
|
+
# ExtractBookStruct.from_epub('1.epub',{:title=>'title',:author=>'author'})
|
63
|
+
#
|
64
|
+
# === ExtractBookStruct.from_html
|
65
|
+
# 从HTML中提取目录结构,使用示例:
|
66
|
+
# ExtractBookStruct.from_html('1.html',{:title=>'title',:author=>'author'})
|
67
|
+
#
|
68
|
+
# == 命令行工具
|
69
|
+
# extract_book_struct,使用示例:
|
70
|
+
# extract_book_struct '1.txt', '1.xml'
|
71
|
+
#
|
72
|
+
# == 依赖
|
73
|
+
# ExtractBookStruct依赖以下工具和包:
|
74
|
+
# ebook-convert: calibre cli tools.
|
75
|
+
# uuid: ruby gem.
|
76
|
+
# iconv: ruby gem.
|
data/bin/ebook_tools
ADDED
@@ -0,0 +1,196 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# encoding: UTF-8
|
3
|
+
require 'rubygems'
|
4
|
+
require 'optparse'
|
5
|
+
require File.join(File.expand_path('../../',__FILE__),'lib','ebook_tools')
|
6
|
+
|
7
|
+
|
8
|
+
def help(command=nil)
|
9
|
+
case command
|
10
|
+
when :convert
|
11
|
+
puts <<-EOF
|
12
|
+
usage:
|
13
|
+
ebook_tools convert [options] source destination
|
14
|
+
|
15
|
+
source: 源文件
|
16
|
+
destination: 输出的文件
|
17
|
+
|
18
|
+
options:
|
19
|
+
-K <keywords>, --keywords <keywords> : 当需要给epub打关键词时使用该参数。
|
20
|
+
-F, --fix : 当需要自动修复异常中断的句子时使用该参数
|
21
|
+
--length : 每行的长度,当需要自动修复异常中断的句子时使用该参数。
|
22
|
+
-H <row_count>, --header <row_count> : 仅对pdf文件有效,当需要指定页眉行数时使用该参数。
|
23
|
+
--footer <row_count> : 仅对pdf文件有效,当需要指定页脚行数时使用该参数。
|
24
|
+
EOF
|
25
|
+
when :batch_convert
|
26
|
+
puts <<-EOF
|
27
|
+
usage:
|
28
|
+
ebook_tools batch_convert [options] source destination
|
29
|
+
|
30
|
+
source: 源文件所在目录
|
31
|
+
destination: 输出的目标目录
|
32
|
+
|
33
|
+
options:
|
34
|
+
-K <keywords>, --keywords <keywords> : 当需要给epub打关键词时使用该参数。
|
35
|
+
-F, --fix : 当需要自动修复异常中断的句子时使用该参数
|
36
|
+
-H <row_count>, --header <row_count> : 仅对pdf文件有效,当需要指定页眉行数时使用该参数。
|
37
|
+
--footer <row_count> : 仅对pdf文件有效,当需要指定页脚行数时使用该参数。
|
38
|
+
EOF
|
39
|
+
when :extract
|
40
|
+
puts <<-EOF
|
41
|
+
usage:
|
42
|
+
ebook_tools extract [options] source destination
|
43
|
+
|
44
|
+
source: 指定需要提取结构信息的书文件
|
45
|
+
destination: 指定提取的书结构信息所输出的文件
|
46
|
+
|
47
|
+
options:
|
48
|
+
-T <title>, --title <title> : 书的标题
|
49
|
+
-A <author>, --author <author> : 书作者
|
50
|
+
--pubdate <pubdate> : 出版时间
|
51
|
+
--publisher <publisher> : 出版社
|
52
|
+
EOF
|
53
|
+
when :batch_extract
|
54
|
+
puts <<-EOF
|
55
|
+
usage:
|
56
|
+
ebook_tools batch_extract source destination
|
57
|
+
|
58
|
+
source: 源文件所在目录
|
59
|
+
destination: 输出的目标目录
|
60
|
+
EOF
|
61
|
+
when :paras_repair
|
62
|
+
puts <<-EOF
|
63
|
+
usage:
|
64
|
+
ebook_tools paras_repair [options] source destination
|
65
|
+
|
66
|
+
source: 指定需要修复段落的源文件,必须是文本文件
|
67
|
+
destination: 指定修复后输出的文件
|
68
|
+
|
69
|
+
options:
|
70
|
+
-l <length>, --length <length> : 指定异常段落被截断的最小长度。
|
71
|
+
EOF
|
72
|
+
else
|
73
|
+
puts <<-EOF
|
74
|
+
ebook_tools: ebook处理工具集,包括格式转换,结构提取等。
|
75
|
+
usage:
|
76
|
+
ebook_tools command [options] source destination
|
77
|
+
|
78
|
+
command:
|
79
|
+
convert: 从source文件格式转换成epub格式,source文件支持txt,html,epub,pdf格式
|
80
|
+
extract: 从source文件中提取书结构信息
|
81
|
+
paras_repair: 对文本文件进行段落修复
|
82
|
+
batch_convert: 批量转换指定目录中的文件为epub格式文件,并存放到目标目录
|
83
|
+
batch_extract: 批量提取指定目录中文件的书结构信息,并生成Docbook存放到目标目录
|
84
|
+
|
85
|
+
适用对象要求:
|
86
|
+
编码格式为utf-8
|
87
|
+
|
88
|
+
具体命令的更多信息请通过'ebook_tools help <command>'查看。
|
89
|
+
EOF
|
90
|
+
end
|
91
|
+
|
92
|
+
exit
|
93
|
+
end
|
94
|
+
|
95
|
+
def extract_argv(argv)
|
96
|
+
argv = argv.dup
|
97
|
+
command = argv.shift
|
98
|
+
source = argv[-2]
|
99
|
+
destination = argv[-1]
|
100
|
+
[command,source,destination,argv]
|
101
|
+
end
|
102
|
+
|
103
|
+
command,source,destination,opt_args = extract_argv(ARGV)
|
104
|
+
|
105
|
+
options = {}
|
106
|
+
opts = OptionParser.new do |opts|
|
107
|
+
opts.on('-F','--fix') do |fix|
|
108
|
+
options[:fix] = fix
|
109
|
+
end
|
110
|
+
|
111
|
+
opts.on('-H row_count','--header row_count') do |row_count|
|
112
|
+
options[:header_rows_count] = row_count.to_i
|
113
|
+
end
|
114
|
+
|
115
|
+
opts.on('--footer row_count') do |row_count|
|
116
|
+
options[:footer_rows_count] = row_count.to_i
|
117
|
+
end
|
118
|
+
|
119
|
+
opts.on('-K keywords','--keywords keywords') do |keywords|
|
120
|
+
options[:keywords] = keywords
|
121
|
+
end
|
122
|
+
|
123
|
+
opts.on('-L length','--length length') do |length|
|
124
|
+
options[:length] = length.to_i
|
125
|
+
end
|
126
|
+
|
127
|
+
opts.on('-T title','--title title','title') do |title|
|
128
|
+
options[:title] = title
|
129
|
+
end
|
130
|
+
|
131
|
+
opts.on('-A author','--author author','author') do |author|
|
132
|
+
options[:author] = author
|
133
|
+
end
|
134
|
+
|
135
|
+
opts.on('--publisher publisher','publisher') do |publisher|
|
136
|
+
options[:publisher] = publisher
|
137
|
+
end
|
138
|
+
|
139
|
+
opts.on('--pubdate pubdate','pubdate') do |pubdate|
|
140
|
+
options[:pubdate] = pubdate
|
141
|
+
end
|
142
|
+
|
143
|
+
opts.on('-h','--help') do
|
144
|
+
help
|
145
|
+
end
|
146
|
+
end
|
147
|
+
opts.parse opt_args
|
148
|
+
command = command.to_sym if command
|
149
|
+
|
150
|
+
if source.nil? || destination.nil?
|
151
|
+
help(command)
|
152
|
+
end
|
153
|
+
|
154
|
+
unless Utils.source_exists?(source)
|
155
|
+
puts "error: source #{source} no found"
|
156
|
+
exit
|
157
|
+
end
|
158
|
+
|
159
|
+
begin
|
160
|
+
Utils.make_destination_dir(destination)
|
161
|
+
rescue
|
162
|
+
puts "error: destination #{destination} not created"
|
163
|
+
exit
|
164
|
+
end
|
165
|
+
|
166
|
+
begin
|
167
|
+
case command
|
168
|
+
when :convert
|
169
|
+
if EbookTools.convert(source,destination,options)
|
170
|
+
puts "success: #{source} conversion successfully!"
|
171
|
+
else
|
172
|
+
puts "error: 只允许转换txt,html,pdf,epub格式"
|
173
|
+
end
|
174
|
+
when :batch_convert
|
175
|
+
EbookTools.batch_convert(source,destination,options)
|
176
|
+
when :extract
|
177
|
+
if EbookTools.allow_extract_struct?(source)
|
178
|
+
if EbookTools.extract_book_struct_to_file(source,destination,options)
|
179
|
+
puts "success: extract book struct successfully!"
|
180
|
+
else
|
181
|
+
puts "警告: 没有检测到书结构信息."
|
182
|
+
end
|
183
|
+
else
|
184
|
+
puts "error: #{source}不是允许的文件格式: txt,html,epub"
|
185
|
+
end
|
186
|
+
when :batch_extract
|
187
|
+
EbookTools.batch_extract_from_dir(source,destination,options)
|
188
|
+
when :paras_repair
|
189
|
+
EbookTools.text_paras_repair(source,destination,options)
|
190
|
+
puts "success: #{source} repair successfully!"
|
191
|
+
else
|
192
|
+
help
|
193
|
+
end
|
194
|
+
rescue => e
|
195
|
+
puts "error: #{source} \n#{e.backtrace.join("\n")}"
|
196
|
+
end
|
data/ebook_tools.gemspec
ADDED
@@ -0,0 +1,38 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
|
3
|
+
Gem::Specification.new do |s|
|
4
|
+
s.name = %q{ebook_tools}
|
5
|
+
s.version = '0.0.1'
|
6
|
+
|
7
|
+
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
8
|
+
s.authors = ["Aaron"]
|
9
|
+
s.date = %q{2013-04-01}
|
10
|
+
s.description = %q{电子书工具集.}
|
11
|
+
s.email = %q{aaron@nonobo.com}
|
12
|
+
s.require_paths = ["lib"]
|
13
|
+
s.requirements = ["none"]
|
14
|
+
s.summary = %q{电子书工具集.}
|
15
|
+
s.has_rdoc = true
|
16
|
+
s.rdoc_options = ["--charset=UTF-8"]
|
17
|
+
s.executables << "ebook_tools"
|
18
|
+
s.files = [
|
19
|
+
"README",
|
20
|
+
"CHANGELOG",
|
21
|
+
"bin/ebook_tools",
|
22
|
+
"lib/ebook_tools.rb",
|
23
|
+
"lib/extract_book_struct.rb",
|
24
|
+
"lib/header_detect.rb",
|
25
|
+
"lib/pdf.rb",
|
26
|
+
"lib/txt.rb",
|
27
|
+
"lib/epub.rb",
|
28
|
+
"lib/utils.rb",
|
29
|
+
"ebook_tools.gemspec"
|
30
|
+
]
|
31
|
+
s.add_dependency(%q<uuid>)
|
32
|
+
s.add_dependency(%q<iconv>)
|
33
|
+
s.add_dependency(%q<gepub>)
|
34
|
+
s.add_dependency(%q<poppler>)
|
35
|
+
s.add_dependency(%q<pdf-reader>)
|
36
|
+
s.add_dependency(%q<nokogiri>)
|
37
|
+
s.add_dependency(%q<levenshtein>)
|
38
|
+
end
|
data/lib/ebook_tools.rb
ADDED
@@ -0,0 +1,248 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
# encoding: UTF-8
|
3
|
+
['utils','epub','txt','pdf','header_detect','extract_book_struct'].each do |file|
|
4
|
+
require File.join(File.dirname(__FILE__),file)
|
5
|
+
end
|
6
|
+
|
7
|
+
module EbookTools
|
8
|
+
extend self
|
9
|
+
|
10
|
+
def convert(filename,epub_file,options={})
|
11
|
+
method_name = "#{File.extname(filename).gsub('.','')}2epub"
|
12
|
+
if EbookTools.respond_to?(method_name)
|
13
|
+
EbookTools.send(method_name,filename,epub_file,options)
|
14
|
+
return true
|
15
|
+
else
|
16
|
+
return nil
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
# txt2epub
|
21
|
+
# 将文本格式转换成EPUB格式
|
22
|
+
def txt2epub(filename,epub_file,options={})
|
23
|
+
basename = File.basename(filename,'.txt')
|
24
|
+
temp_dir = "#{basename}"
|
25
|
+
FileUtils.mkdir(temp_dir) unless File.exists?(temp_dir)
|
26
|
+
|
27
|
+
title,outlines, content = TXT.extract_book_part(filename)
|
28
|
+
|
29
|
+
if options[:fix]
|
30
|
+
content = Utils.fixed_page_break(content)
|
31
|
+
end
|
32
|
+
|
33
|
+
html_content = TXT.gen_html_from_txt_book(title,outlines,content)
|
34
|
+
html = Utils.wrapper_html(html_content)
|
35
|
+
|
36
|
+
html_file = File.join([temp_dir,"#{basename}.html"].compact)
|
37
|
+
Utils.write_file(html,html_file)
|
38
|
+
sections = Utils.detect_sections_from_html(html_file)
|
39
|
+
|
40
|
+
nav_file = EPUB.gen_nav_file(html_file,sections)
|
41
|
+
|
42
|
+
EPUB.write_epub(epub_file,options.merge(:files=>[nav_file,html_file]))
|
43
|
+
ensure
|
44
|
+
FileUtils.remove_dir(temp_dir,true)
|
45
|
+
end
|
46
|
+
|
47
|
+
# html2epub
|
48
|
+
# 将HTML格式转换成EPUB格式
|
49
|
+
def html2epub(filename,epub_file,options={})
|
50
|
+
basename = File.basename(filename,'.html')
|
51
|
+
temp_dir = "#{basename}"
|
52
|
+
FileUtils.mkdir(temp_dir) unless File.exists?(temp_dir)
|
53
|
+
|
54
|
+
html = File.open(filename).read
|
55
|
+
|
56
|
+
html_file = File.join([temp_dir,"#{basename}.html"].compact)
|
57
|
+
Utils.write_file(html,html_file)
|
58
|
+
sections = Utils.detect_sections_from_html(html_file)
|
59
|
+
|
60
|
+
nav_file = EPUB.gen_nav_file(html_file,sections)
|
61
|
+
|
62
|
+
EPUB.write_epub(epub_file,options.merge(:files=>[nav_file,html_file]))
|
63
|
+
ensure
|
64
|
+
FileUtils.remove_dir(temp_dir,true)
|
65
|
+
end
|
66
|
+
|
67
|
+
# pdf2epub
|
68
|
+
# 将PDF格式转换成EPUB格式
|
69
|
+
def pdf2epub(filename,epub_file,options={})
|
70
|
+
basename = File.basename(filename,'.pdf')
|
71
|
+
temp_dir = "#{basename}"
|
72
|
+
FileUtils.mkdir(temp_dir) unless File.exists?(temp_dir)
|
73
|
+
|
74
|
+
pages_text = PDF.extract_pdf_pages_text(filename)
|
75
|
+
pages_text = PDF.sanitize_page_header_and_footer(pages_text,options)
|
76
|
+
pages_text = PDF.fixed_break_with_pages_text(pages_text)
|
77
|
+
|
78
|
+
sections = PDF.extract_sections(filename)
|
79
|
+
|
80
|
+
illustrations = PDF.extract_illustrations(filename,{:dir=>temp_dir})
|
81
|
+
|
82
|
+
html_content = PDF.gen_html_from_sections_and_page_texts(sections,pages_text,illustrations)
|
83
|
+
html = Utils.wrapper_html(html_content)
|
84
|
+
html_file = File.join([temp_dir,"#{basename}.html"].compact)
|
85
|
+
Utils.write_file(html,html_file)
|
86
|
+
|
87
|
+
illustrations_path = illustrations.map{|image_path| File.join(temp_dir,image_path)}
|
88
|
+
|
89
|
+
nav_file = EPUB.gen_nav_file(html_file,sections)
|
90
|
+
|
91
|
+
files = [html_file,nav_file,illustrations_path].flatten
|
92
|
+
|
93
|
+
meta = PDF.extract_pdf_meta(filename)
|
94
|
+
epub_options = options.merge(meta).merge(:files=>files)
|
95
|
+
|
96
|
+
EPUB.write_epub(epub_file,epub_options)
|
97
|
+
|
98
|
+
ensure
|
99
|
+
FileUtils.remove_dir(temp_dir,true)
|
100
|
+
end
|
101
|
+
|
102
|
+
def batch_convert(source,destination,options={})
|
103
|
+
log = File.open('batch.log','a')
|
104
|
+
success_log = File.open('success.log','a')
|
105
|
+
error_log = File.open('error.log','a')
|
106
|
+
scan_log = File.open('scan.log','a')
|
107
|
+
unknown_log = File.open('unknown.log','a')
|
108
|
+
|
109
|
+
source_path = File.absolute_path(source)
|
110
|
+
dest_path = File.join(File.absolute_path(destination),'epub')
|
111
|
+
scan_path = File.join(File.absolute_path(destination),'scan')
|
112
|
+
unknown_path = File.join(File.absolute_path(destination),'unknown')
|
113
|
+
backup_path = File.join(File.absolute_path(destination),'backup')
|
114
|
+
|
115
|
+
format = options[:format]
|
116
|
+
|
117
|
+
files = Utils.scan_file_from_dir(source_path,:format=>format)
|
118
|
+
|
119
|
+
total_count = files.count
|
120
|
+
scan_count = 0
|
121
|
+
success_count = 0
|
122
|
+
error_count = 0
|
123
|
+
unknown_count = 0
|
124
|
+
|
125
|
+
puts "count: #{total_count} file "
|
126
|
+
log.puts "****batch convert****** : #{Time.now}"
|
127
|
+
log.puts "#{source_path} => #{dest_path} "
|
128
|
+
log.puts "count: #{total_count} file "
|
129
|
+
|
130
|
+
success_log.puts "****batch convert****** : #{Time.now}"
|
131
|
+
success_log.puts "#{source_path} => #{dest_path} "
|
132
|
+
|
133
|
+
error_log.puts "****batch convert****** : #{Time.now}"
|
134
|
+
error_log.puts "#{source_path} => #{dest_path} "
|
135
|
+
|
136
|
+
scan_log.puts "****batch convert****** : #{Time.now}"
|
137
|
+
scan_log.puts "#{source_path} => #{dest_path} "
|
138
|
+
|
139
|
+
unknown_log.puts "****batch convert****** : #{Time.now}"
|
140
|
+
unknown_log.puts "#{source_path} => #{dest_path} "
|
141
|
+
|
142
|
+
|
143
|
+
files.each do |file|
|
144
|
+
dest_file = File.join(File.dirname(File.join(dest_path,file.gsub(source_path,''))),"#{File.basename(file,File.extname(file))}.epub")
|
145
|
+
|
146
|
+
keywords = Utils.extract_keywords_from_path(File.dirname(file).gsub(source_path,''))
|
147
|
+
puts "start convert #{file}"
|
148
|
+
|
149
|
+
method_name = "#{File.extname(file).gsub('.','')}2epub"
|
150
|
+
if EbookTools.respond_to?(method_name)
|
151
|
+
begin
|
152
|
+
if PDF.scan_pdf?(file)
|
153
|
+
scan_file = File.join(scan_path,file.gsub(source_path,''))
|
154
|
+
FileUtils.mkdir_p(File.dirname(scan_file)) unless Dir.exists?(File.dirname(scan_file))
|
155
|
+
FileUtils.mv(file,scan_file,:force=>true)
|
156
|
+
scan_count += 1
|
157
|
+
scan_log.puts "warning: #{file} is scan pdf."
|
158
|
+
else
|
159
|
+
EbookTools.send(method_name,file,dest_file,{:keywords=>keywords})
|
160
|
+
success_file = File.join(backup_path,file.gsub(source_path,''))
|
161
|
+
FileUtils.mkdir_p(File.dirname(success_file)) unless Dir.exists?(File.dirname(success_file))
|
162
|
+
FileUtils.mv(file,success_file,:force=>true)
|
163
|
+
success_count += 1
|
164
|
+
success_log.puts "success: #{source} conversion successfully!"
|
165
|
+
end
|
166
|
+
rescue Exception => e
|
167
|
+
unknown_file = File.join(unknown_path,file.gsub(source_path,''))
|
168
|
+
FileUtils.mkdir_p(File.dirname(unknown_file)) unless Dir.exists?(File.dirname(unknown_file))
|
169
|
+
FileUtils.mv(file,unknown_file,:force=>true)
|
170
|
+
error_count += 1
|
171
|
+
error_log.puts "error: #{source} \n#{e.backtrace.join("\n")}"
|
172
|
+
end
|
173
|
+
end
|
174
|
+
end
|
175
|
+
|
176
|
+
success_log.puts "count: #{success_count} Time: #{Time.now} \n"
|
177
|
+
scan_log.puts "count: #{scan_count} Time: #{Time.now} \n"
|
178
|
+
error_log.puts "count: #{error_count} Time: #{Time.now} \n"
|
179
|
+
unknown_log.puts "unknown: #{unknown_count} Time: #{Time.now} \n"
|
180
|
+
log.puts "success: #{success_count} scan: #{scan_count} error: #{error_count} Time: #{Time.now} \n"
|
181
|
+
|
182
|
+
ensure
|
183
|
+
success_log.close
|
184
|
+
error_log.close
|
185
|
+
scan_log.close
|
186
|
+
unknown_log.close
|
187
|
+
log.close
|
188
|
+
end
|
189
|
+
|
190
|
+
def allow_extract_struct?(file)
|
191
|
+
extname = File.extname(file)
|
192
|
+
['.txt','.html','.epub'].include?(extname.downcase)
|
193
|
+
end
|
194
|
+
|
195
|
+
def extract_book_struct_to_file(source,destination,options={})
|
196
|
+
method_name = "from_#{File.extname(source).gsub('.','')}"
|
197
|
+
if ExtractBookStruct.respond_to?(method_name)
|
198
|
+
docbook_xml = ExtractBookStruct.send(method_name,source,options)
|
199
|
+
if docbook_xml
|
200
|
+
File.open(destination,'wb'){|file|file.write docbook_xml}
|
201
|
+
return true
|
202
|
+
else
|
203
|
+
return nil
|
204
|
+
end
|
205
|
+
end
|
206
|
+
end
|
207
|
+
|
208
|
+
# batch_extract_from_dir
|
209
|
+
# batch extract book struct form dir
|
210
|
+
# parameters:
|
211
|
+
# +source+ source directory
|
212
|
+
# +destination+ output directory
|
213
|
+
# +options+ optional parameter.
|
214
|
+
# :format 指定需要提取结构的文件后缀名,例如要从所有txt文件中提取,通过:format=>'.txt'指定
|
215
|
+
def batch_extract_from_dir(source,destination,options={})
|
216
|
+
format = options.delete(:format)
|
217
|
+
files = Utils.scan_file_from_dir(source,{:format=>format})
|
218
|
+
|
219
|
+
files.each do |file|
|
220
|
+
extname = File.extname(file)
|
221
|
+
basename = File.basename(file,extname)
|
222
|
+
dest_file = File.join(File.dirname(File.join(destination,file.gsub(source,''))),"#{basename}.xml")
|
223
|
+
if allow_extract_struct?(file)
|
224
|
+
puts "start extract #{file} ..."
|
225
|
+
begin
|
226
|
+
if extract_book_struct_to_file(file,dest_file)
|
227
|
+
puts "success: extract book struct successfully!"
|
228
|
+
else
|
229
|
+
puts "警告: 没有检测到书结构信息."
|
230
|
+
end
|
231
|
+
rescue Exception => e
|
232
|
+
puts "error: #{file} \n#{e.backtrace.join("\n")}"
|
233
|
+
end
|
234
|
+
else
|
235
|
+
puts "error: #{file}不是允许的文件格式: txt,html,epub"
|
236
|
+
end
|
237
|
+
end
|
238
|
+
end
|
239
|
+
|
240
|
+
# text_paras_repair
|
241
|
+
# 对文本文件格式中的中断段落进行修复
|
242
|
+
def text_paras_repair(source_file,target_file,options={})
|
243
|
+
content = File.open(source_file).read
|
244
|
+
content = Utils.to_utf8 unless Utils.detect_utf8(content)
|
245
|
+
content = Utils.fixed_page_break(content,options)
|
246
|
+
File.open(target_file,'w'){|file| file.write content}
|
247
|
+
end
|
248
|
+
end
|