pdfbeads 1.0.9 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 577a47277a3e474a47b740ce93ca89041402c4fb
4
+ data.tar.gz: 88a1f950e31e41e47f79ef978d9e589e1b9724fb
5
+ SHA512:
6
+ metadata.gz: 16ab45bd0c4d63d7c6f567f3df5e2dea2ecc1e9237ac9b5acfc9cd92e31a72a8df7ad184e5c7fd779cb990f58ff64dbb3027464c7579fd79d58fc270ac01c841
7
+ data.tar.gz: b946c901407f06aeaf509988341759ba4f4cfc217067b00fb72ad5bc9ae60e4d1dd0729bb7a3761a6c06f71816dcb9398e2cde4335ee8a5620c1686e208fe901
data/ChangeLog CHANGED
@@ -55,3 +55,26 @@
55
55
  * Don't attempt to use 'ocrx_word' elements which contain no bounding box
56
56
  data (this should fix the problem with the hOCR output produced by some
57
57
  tesseract versions).
58
+
59
+ 2013 Mar 20 (Alexey Kryukov) Version 1.1.0
60
+
61
+ + It is now possible to take the text layer from another PDF document (normally
62
+ this would be a file produced by passing the same set of images to an
63
+ OCR application) and embed it into the pdfbeads output. Warning: this feature
64
+ has been tested so far only with files produced with ABBYY FineReader. It may or
65
+ may not work with PDF files generated by other OCR programs.
66
+
67
+ * The default PDF page layout is now "OneColumn".
68
+
69
+ + Make it possible to specify that the preferred reading direction for the
70
+ PDF document is left-to-right.
71
+
72
+ + In order to simplify debugging of resulting files I have added a special
73
+ flag allowing to make the hidden text layer visible and to disable
74
+ compression in page streams.
75
+
76
+ 2014 Jan 26 (Alexey Kryukov) Version 1.1.1
77
+
78
+ * hpricot is no longer developed, so switch to Nokagiri for hOCR processing.
79
+
80
+ + English HTML documentation added.
@@ -41,9 +41,12 @@ include PDFBeads
41
41
  pdfargs = Hash[
42
42
  :labels => nil,
43
43
  :toc => nil,
44
- :pagelayout => 'TwoPageRight',
44
+ :pagelayout => 'OneColumn',
45
45
  :meta => nil,
46
- :delfiles => false
46
+ :textpdf => nil,
47
+ :delfiles => false,
48
+ :debug => false,
49
+ :rtl => false
47
50
  ]
48
51
  pageargs = Hash[
49
52
  :threshold => 1,
@@ -87,6 +90,23 @@ OptionParser.new() do |opts|
87
90
 
88
91
  pdfargs[:pagelayout] = pagelayout
89
92
  end
93
+ opts.on("-R", "--right-to-left",
94
+ "Set the flag indicating that the preferred reading",
95
+ "direction for the resulting PDF file is right to left") do |rtl|
96
+ pdfargs[:rtl] = rtl
97
+ end
98
+ opts.on("-T", "--text-pdf PDFFILE",
99
+ "Specify a PDF file produced by passing the same set",
100
+ "of files to an OCR program. Pdfbeads will use that file",
101
+ "to generate the hidden text layer for its PDF output.") do |pdffile|
102
+
103
+ if $has_pdfreader
104
+ pdfargs[:textpdf] = pdffile
105
+ else
106
+ $stderr.puts( "Warning: the pdf/reader extension is not available." )
107
+ $stderr.puts( "\tthe -T/--text-pdf option is ignored." )
108
+ end
109
+ end
90
110
 
91
111
  opts.separator "\n"
92
112
  opts.separator "Image encoding and compression options:\n"
@@ -147,7 +167,7 @@ OptionParser.new() do |opts|
147
167
  "Compression method for background images. Acceptable",
148
168
  "values are JP2|JPX|JPEG2000, JPG|JPEG or PNG|LOSSLESS.",
149
169
  "JP2 is used by default, unless this format is not",
150
- "supported by the available version of ImageMagick" ) do |format|
170
+ "supported by the available ImageMagick version" ) do |format|
151
171
  case format.upcase
152
172
  when 'JP2', 'JPX', 'J2K', 'JPEG2000'
153
173
  pageargs[:bg_format] = 'JP2'
@@ -178,6 +198,11 @@ OptionParser.new() do |opts|
178
198
  "Print output to a file instead of STDERR") do |f|
179
199
  outpath = f
180
200
  end
201
+ opts.on("-D", "--debug",
202
+ "Simplify debugging the PDF output by making the hidden",
203
+ "text layer visible and using uncompressed page streams") do |dbg|
204
+ pdfargs[:debug] = dbg
205
+ end
181
206
  opts.on_tail("-h", "--help", "Show this message") do
182
207
  puts opts
183
208
  exit
@@ -0,0 +1,552 @@
1
+ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2
+ <html>
3
+ <head>
4
+
5
+ <title>PDFBeads -- convert scanned images to a single PDF file</title>
6
+
7
+ <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
8
+
9
+ <meta name="Generator" content="Written directly in html">
10
+
11
+ <meta name="Description" content="pdfbeads v. 1.1 User's Manual">
12
+
13
+ <style type="text/css">
14
+ body {
15
+ font-family: Times New Roman, Times, serif;
16
+ text-align: justify;
17
+ }
18
+ a:link {
19
+ color: blue; text-decoration: underline
20
+ }
21
+ a:hover {
22
+ color: fuchsia
23
+ }
24
+ a:active {
25
+ color: fuchsia
26
+ }
27
+ a:visited {
28
+ color: purple
29
+ }
30
+ h1 {
31
+ font-size: 36px;
32
+ font-family: Times New Roman, Times, serif;
33
+ text-align: center;
34
+ font-style: normal;
35
+ font-weight: bold
36
+ }
37
+ h2 {
38
+ font-size: 20px;
39
+ font-family: Arial, Helvetica, sans-serif;
40
+ text-align: center;
41
+ font-style: normal;
42
+ font-weight: bold;
43
+ }
44
+ h3 {
45
+ font-size: 16px;
46
+ font-family: Arial, Helvetica, sans-serif;
47
+ text-align: left;
48
+ font-style: italic;
49
+ font-weight: bold;
50
+ }
51
+ dt {
52
+ font-weight: bold;
53
+ }
54
+ </style>
55
+
56
+ </head>
57
+
58
+ <body>
59
+
60
+ <h1>pdfbeads v. 1.1 User's Manual</h1>
61
+
62
+ <p>(c) Alexey Kryukov, 2013</p>
63
+
64
+ <p>pdfbeads is a small utility intended for creating e-books in PDF format
65
+ from specially prepared scanned pages. Unlike other similar utilities,
66
+ pdfbeads attempts to implement the approach more commonly used for DjVu files.
67
+ Its key feature is separating the scanned page into distinct layers, each
68
+ layer having its own compression format and resolution.</p>
69
+
70
+ <p>Here are some features of pdfbeads:</p>
71
+
72
+ <ul>
73
+
74
+ <li><p>JBIG2 and JPEG2000 graphical data compression;</p></li>
75
+
76
+ <li><p>separating out text and image layers from “mixed” files produced with
77
+ <a href="http://scantailor.sourceforge.net/">ScanTailor</a>;</p></li>
78
+
79
+ <li><p>adding halftone background images to text pages previously
80
+ converted to B&amp;W;</p></li>
81
+
82
+ <li><p>processing indexed images with a limitedо (small) number of colors
83
+ so that the colors are preserved and the content is placed into the foreground
84
+ layer;</p></li>
85
+
86
+ <li><p>separating out background and foreground data by masking a source
87
+ color image;</p></li>
88
+
89
+ <li><p>producing PDF files with a table of contents and metadata;</p></li>
90
+
91
+ <li><p>adding a hidden text layer produced from previously generated hOCR
92
+ files or transferring the text data from another PDF file.</p></li>
93
+
94
+ </ul>
95
+
96
+ <p>The program has been called pdfbeads because building an e-book from
97
+ separate graphical data seems to me a bit similar to threading beads on a
98
+ string. Moreover, this name seems appropriate for a Ruby script, since
99
+ it has something to do with jewelry where rubies (like other gems) are
100
+ used.</p>
101
+
102
+ <h2>Requirements</h2>
103
+
104
+ <p>In order to run the program you’ll need first the Ruby runtime
105
+ environment v. 1.8 or above, which is available by default on most Unix-like
106
+ systems. A Windows installable package is available at the
107
+ <a href="http://www.rubyinstaller.org/">RubyInstaller</a> site. You will
108
+ also need RubyGems (the standard Ruby package manager) and some extensions
109
+ available (like pdfbeads itself) via the RubyGems framework, namely RMagick,
110
+ Nokogiri and PDF::Reader. Note that pdfbeads will may even if the last two
111
+ packages are not available. However, without Nokogiri it will not be possible
112
+ to read optically recognized text from hOCR files, while PDF::Reader is needed
113
+ in order to be able to import the text layer from another PDF file.</p>
114
+
115
+ <p>If you are interested in creating PDF files using the JBIG2 data compression
116
+ format, then your system should have also the jbig2 utility from the
117
+ <a href="http://github.com/agl/jbig2enc">jbig2enc</a> package installed.</p>
118
+
119
+ <h2>Installation</h2>
120
+
121
+ <p>Downloading and installing the most recent pdfbeads version with the
122
+ RubyGems package manager is quite simple. Just type in your command line:</p>
123
+
124
+ <pre>
125
+ gem install pdfbeads
126
+ </pre>
127
+
128
+ <p>Before running the script you should ensure the RMagick extension
129
+ is installed and accessible to the Ruby runtime. <strong>Unfortunately,
130
+ it is not currently possible to automatically resolve this dependency</strong>,
131
+ since in some Linux distributions (Ubuntu in particular) the standard
132
+ installation of the RMagick package circumvents the RubyGems engine, so that
133
+ the <tt>gem</tt> utility knows nothing about it.</p>
134
+
135
+ <p>Ubuntu users should also take into account that in this distribution
136
+ executable files from gem packages are unpacked into the
137
+ <tt>/var/lib/gems/&lt;RUBY_VERSION&gt;/bin</tt> directory, which is not
138
+ included into the PATH environment variable by default. So in order to be
139
+ able to run pdfbeads without specifying the full path to the executable file
140
+ you should either modify the PATH variable as necessary, or move the
141
+ <tt>pdfbeads</tt> script into some directory normally used for executables
142
+ (<tt>/usr/local/bin</tt> for example).</p>
143
+
144
+ <h2>Basic principles</h2>
145
+
146
+ <p>pdfbeads workflow is based upon making a difference between a “main” image
147
+ representing a core of the PDF page and various auxiliary files related
148
+ with the current page.</p>
149
+
150
+ <p>Those files which contain scanned text supposed to be placed into the
151
+ foreground layer are considered “main”. Images used for this purpose
152
+ should normally be previously converted to bitonal. pdfbeads is also
153
+ able to process indexed images with a limited number of colors and a white
154
+ or transparent background, as well as “mixed” images where bitonal text is
155
+ combined with halftone pictures. The latter feature is most useful for
156
+ postprocessing files produced with
157
+ <a href="http://scantailor.sourceforge.net/">ScanTailor</a>.</p>
158
+
159
+ <p>A special treatment is applied to files with double extension, where
160
+ one of the following suffixes precedes an extension typical for a common
161
+ graphical format (TIF(F), PNG, JP(E)G, JP2 or JPX):</p>
162
+
163
+ <dl>
164
+
165
+ <dt>bg or sep</dt>
166
+ <dd><p>A background image (halftone or indexed);</p></dd>
167
+
168
+ <dt>fg</dt>
169
+ <dd><p>An image supposed to be used to color the foreground layer (like
170
+ a FG44 chunk in a DJVU file);</p></dd>
171
+
172
+ <dt>color</dt>
173
+ <dd><p>A color image supposed to be used as a source for producing
174
+ images with the <tt>*.bg.*</tt> and <tt>*.fg.*</tt> suffixes;</p></dd>
175
+
176
+ <dt>An RGB color specification (e.&nbsp;g. <tt>black</tt> or <tt>#ff00ff</tt>)</dt>
177
+ <dd><p>A bitonal image supposed to be displayed with the given color in the
178
+ target PDF file.</p></dd>
179
+
180
+ </dl>
181
+
182
+ <p>Furthermore, if the current directory contains any hOCR files with
183
+ recognized text (their extensions should be either HTM(L) or HOCR),
184
+ pdfbeads will attempt to use them for building hidden text layer
185
+ in its PDF output.</p>
186
+
187
+ <p>Some of the auxiliary files which have been mentioned above may be
188
+ produced by pdfbeads during an intermediate stage of its work. Since
189
+ processing images with the ImageMagick library (which pdfbeads is based on)
190
+ can take quite a long time, those files are not removed afterwards from
191
+ the hard drive and may be reused on subsequent runs for time saving.
192
+ In order to force pdfbeads to recreate those files you may run it with the
193
+ <tt>-f</tt> (or <tt>--force-update</tt>) option.</p>
194
+
195
+ <p>pdfbeads is supposed to be used for building PDF files from previously
196
+ processed scanned images, and this is the reason which explains some
197
+ of its features and limitations:</p>
198
+
199
+ <ul>
200
+
201
+ <li><p>it is not possible to somehow modify scanned images of text pages
202
+ (except forcing them to a specific DPI value), for they are supposed to
203
+ be created with the settings the user would like to get, so that there is
204
+ no need to additionally process them;</p></li>
205
+
206
+ <li><p>it is not possible as well to convert color or grayscale scanned images
207
+ to bitonal. The only exception is those situations (like splitting “mixed”
208
+ pages where bitonal text areas are combined with halftone pictures or separating out
209
+ background and foreground data from a source color image by applying a mask)
210
+ where pdfbeads just finishes the job started by some other applications;</p></li>
211
+
212
+ <li><p>any background images taken directly from user’s hard drive are
213
+ encoded “as is” without any additional processing.</p></li>
214
+
215
+ </ul>
216
+
217
+ <h2>Getting started</h2>
218
+
219
+ <p>The generic command line syntax is as follows:</p>
220
+
221
+ <pre>
222
+ pdfbeads [options] [files to process] [&gt; output_file.pdf]
223
+ </pre>
224
+
225
+ <p>The list of files to be processed may be either obtained from the
226
+ current directory listing or directly specified in the command line.
227
+ However in both cases <strong>pdfbeads accepts for processing only those
228
+ files whose names match a specific pattern:</strong> the extension should
229
+ be either TIF(F) or PNG (the case doesn’t matter) and there should be
230
+ no dots inside the base name (i.&nbsp;e. double extensions are not
231
+ allowed). The reason for this limitation is that the program, as explained
232
+ above, uses dot-separated file name suffixes to denote some types
233
+ of auxiliary files, accompanying the scanned text page itself.</p>
234
+
235
+ <p>Instead of writing the resulting PDF file to the standard output
236
+ stream one can use the <tt>-o</tt> (or <tt>--output</tt>) option
237
+ followed by the name of the file to be created.</p>
238
+
239
+ <h2>Processing bitonal images</h2>
240
+
241
+ <p>The foreground layer of a PDF page, or its “mask”, is created from the
242
+ “main” scanned page file passed to pdfbeads. The following rules are applied
243
+ here:</p>
244
+
245
+ <ul>
246
+
247
+ <li><p>TIFF or PNG images already converted to bitonal are used “as
248
+ is”;</p></li>
249
+
250
+ <li><p>pages with mixed content are cleared from any halftone pictures
251
+ (their processing is described in the next section), while the remaining
252
+ bitonal image is saved into a file with the same base name and the
253
+ <tt>black.tiff</tt> extension. That’s the <tt>black.tiff</tt> image file
254
+ which is further used to produce the foreground layer for such a page;</p></li>
255
+
256
+ <li><p>indexed images with a white or transparent background which contain
257
+ a small number of colors (4 by default; this value can be changed via the
258
+ <tt>-x</tt> (<tt>--max-colors</tt>) option) are splitted into several
259
+ bitonal image files according to the number of colors. Each of those
260
+ files is further encoded separately, so that the resulting PDF page
261
+ will have several foreground layers, each with its own color. NB: use an
262
+ indexed PNG image with a transparent background if you want to produce
263
+ a PDF page with a white-colored text.</p></li>
264
+
265
+ </ul>
266
+
267
+ <p>It is recommended to encode bitonal text pages as CCITT Group 4 fax
268
+ compressed TIFF files, since pdfbeads is usually able to read the image
269
+ data from such files without using the ImageMagick library, thus making
270
+ the processing speed significantly faster.</p>
271
+
272
+ <p>By default pdfbeads attempts to apply JBIG2 compression to the foreground
273
+ layer, using Adam Langley’s <a href="http://github.com/agl/jbig2enc">jbig2enc</a>
274
+ utility. You can run pdfbeads with the <tt>-p</tt> (<tt>--pages-per-dict</tt>)
275
+ option in order to directly specify the desired number of PDF document
276
+ pages using the common dictionary of shared symbols (15 by default).</p>
277
+
278
+ <p>If jbig2enc is not accessible to pdfbeads, then the CCITT Group 4 fax
279
+ compression method will be used instead. It is also possible to explicitly
280
+ request this compression type by specifying the <tt>-m</tt>
281
+ (<tt>--mask-compression</tt>) option with the `G4' parameter (or its synonyms:
282
+ `Group4', `CCITTFax').</p>
283
+
284
+ <h2>Processing halftone images</h2>
285
+
286
+ <p>Halftone images are placed into the background layer of a PDF page.
287
+ This layer is normally supposed to have a lower resolution than its mask.
288
+ pdfbeads can either take a background image directly from the hard drive
289
+ (i.&nbsp;e. from a file with a <strong>bg</strong> or <strong>sep</strong>
290
+ extension suffix), or produce it by splitting a mixed image file.</p>
291
+
292
+ <p>When processing mixed image files pdfbeads first separates pictures
293
+ from text areas. This is achieved by filling any black pixels with white.
294
+ The resulting image is saved into the hard drive by a such thay, that the
295
+ following commkand line options are taken into account:</p>
296
+
297
+ <dl>
298
+ <dt>-b, --bg-compression</dt>
299
+ <dd><p>The data compression format. The fololowing values are allowed:
300
+ `JPEG2000' (also `JP2' also `JPX'), `JPEG' (or `JPG') and `LOSSLESS'
301
+ (synonyms are `DEFLATE' and `PNG'). pdfbeads will attempt to use the JPEG2000
302
+ compression by default. However it falls back to JPEG if JPEG2000 format
303
+ is not supported by the currently used ImageMagick build (which is often
304
+ the case). If the option has been set to LOSSLESS, then pdfbeads will
305
+ compress background images with the deflate method. Of course this choice
306
+ would normally result into producing a much larger output file than with
307
+ JPEG2000 or JPEG.</p></dd>
308
+
309
+ <dt>-B, --bg-resolution DPI</dt>
310
+ <dd><p>The resolution for the background. Reasonable values usually lie
311
+ between 150 and 300&nbsp;dpi (300 by default).</p></dd>
312
+
313
+ <dt>-g, --grayscale</dt>
314
+ <dd><p>Forces pdfbeads to convert color images into grayscale. This option
315
+ would be useful for processing images which have been produced by scanning
316
+ pages with gray pictures in color mode and haven’t been previously converted
317
+ to grayscale. Such a situation may often occur, for example, when processing
318
+ digital photos with ScanTailor.</p></dd>
319
+
320
+ <p>When pdfbeads loads previously produced background image from the hard
321
+ drive, it doesn’t perform any additional processing. JPEG and JPEG2000
322
+ images are inserted into the resulting PDF file “as is”, while images
323
+ taken from TIFF and PNG files are compressed with the deflate method.
324
+ However if there are several <tt>*.bg.*</tt> or <tt>*.sep.*</tt> files
325
+ which have the same base name, but different extensions, then the graphical
326
+ format specified with the <tt>--bg-compression</tt> option will take
327
+ precedence.</p>
328
+
329
+ </dl>
330
+
331
+ <h2>Separating color images using a mask image</h2>
332
+
333
+ <p>Separating a scanned image into distinct layers is especially difficult
334
+ in case the text has been printed above a picture or texture. In order to
335
+ effectively package such a page into a pdf file one should prepare two
336
+ graphical files:</p>
337
+
338
+ <ul>
339
+
340
+ <li><p>a bitonal or indexed image containing just the scanned text or any
341
+ other elements supposed to be placed into the foreground layer;</p></li>
342
+
343
+ <li><p>a color scan of the same page (pdfbeads recognizes such images
344
+ by the <tt>*.color.*</tt> filename suffix).</p></li>
345
+
346
+ </ul>
347
+
348
+ <p>The first file will serve a stencil: basing on its shapes pdfbeads will
349
+ attempt to produce from the color scan two new images, so that the first
350
+ one (with the <tt>*.bg.*</tt> suffix) will contain just the color
351
+ background cleaned up from any text data, while on the second one (with the
352
+ <tt>*.fg.*</tt> suffix) just the mask elements with the corresponding
353
+ texture will remain. This procedure is very similar to one performed by
354
+ the <tt>djvumake</tt> when we run it with the <tt>PPM</tt> option.
355
+ In either case the purpose is to produce a 3-layered page where the first
356
+ color layer is responsible for the image background while the second one
357
+ is used to specify colors and textures for the mask which is placed
358
+ above.</p>
359
+
360
+ <p>In order to achieve the desired result it is necessary that the mask
361
+ can be placed above the color images without any shifts or distortions.
362
+ On the other hand, it is OK if the two images have different resolutions
363
+ (and thus different pixel sizes): in such a case pdfbeads will first resize
364
+ the stencil so that it matches the size of the color image. Note that, if
365
+ all the text at the page is black (or at least darker than the background),
366
+ it would be convenient to use ScanTailor for producing both the source
367
+ graphical files. In order to do that one should output the same page first
368
+ as “Black and White” and then as “Color/Grayscale”.</p>
369
+
370
+ <p>Also note that, if the stencil image is represented with an indexed
371
+ (but not bitolnal) image with the number of colors equal to or less than
372
+ the current value of the <tt>--max-colors</tt> options, then pdfbeads
373
+ will not create a <tt>*.fg.*</tt> file: instead it will just place
374
+ the stencil with the previously specified colors above the background
375
+ layer cleaned up from the text data.</p>
376
+
377
+ <p>To conclude this section I’d like to mention that the segmentation
378
+ algorithm used by pdfbeads has been inspired by
379
+ <a href="http://www.imagemagick.org/discourse-server/viewtopic.php?p=41498#p41498">a
380
+ thread at the ImageMagick forum</a>, where possible methods to remove text
381
+ from an image and then fill the resulting “gaps” basing on the values of the
382
+ neighboring pixels have been discussed.</p>
383
+
384
+ <h2>Additional features</h2>
385
+
386
+ <h3>Adding metadata</h3>
387
+
388
+ <p>In order to include some information about author, book title etc.
389
+ into the PDF file going to be produced by pdfbeads, one should first put
390
+ those data into a special ASCII or UTF-8 encoded text file. Each line
391
+ of the file should be formatted as follows:</p>
392
+
393
+ <pre>/&lt;KEYWORD&gt;: "Some text"
394
+ </pre>
395
+
396
+ <p>The following keyword strings are currently recognized by pdfbeads:
397
+ <tt>Title</tt>, <tt>Author</tt>, <tt>Subject</tt> and <tt>Keywords</tt>.
398
+ Any lines starting with the `#' character are considered comments and
399
+ ignored.</p>
400
+
401
+ <p>A reference to the metadata file can be passed to pdfbeads via the
402
+ <tt>-M</tt> (or <tt>--meta</tt>) option.</p>
403
+
404
+ <h3>Page labels</h3>
405
+
406
+ <p>pdfbeads allows to generate page labels which may be then displayed by
407
+ a PDF viewer instead of physical page numbers. Thus it is possible
408
+ to bring page numbering of the electronic document into accordance with
409
+ the pagination of the paper book. Page labels may be specified with the
410
+ <tt>-L</tt> (or <tt>--labels</tt>) command line key. This option takes
411
+ an argument which should be enclosed into quotation marks and may contain
412
+ one or more numbering range specifications, separated with semicolons.</p>
413
+
414
+ <p>A numbering range is constructed from the following components (each of
415
+ them being optional):</p>
416
+
417
+ <ul>
418
+ <li><p>The physical number of the first page of the given range in the
419
+ PDF document, separated with a colon from the rest of the specification.
420
+ Note that pages in PDF documents are numbered starting from zero, so for
421
+ the first range this value should always be zero.</p></li>
422
+
423
+ <li><p>An arbitrary numbering prefix (any characters, except a double
424
+ quotation mark, a colon, a semicolon, and a percent sign are allowed
425
+ here).</p></li>
426
+
427
+ <li><p>A numbering format description, which starts from a percent sign
428
+ followed by a single Latin letter corresponding to a particular numbering
429
+ style:</p></li>
430
+
431
+ <dl>
432
+
433
+ <dt>D</dt>
434
+ <dd><p>arabic digits;</p></dd>
435
+
436
+ <dt>R</dt>
437
+ <dd><p>uppercase Roman numerals;</p></dd>
438
+
439
+ <dt>r</dt>
440
+ <dd><p>lowercase Roman numerals;</p></dd>
441
+
442
+ <dt>A</dt>
443
+ <dd><p>uppercase Latin letters;</p></dd>
444
+
445
+ <dt>a</dt>
446
+ <dd><p>lowercase Latin letters.</p></dd>
447
+
448
+ </dl>
449
+
450
+ <p>Between the percent sign and the numbering format identifier it is
451
+ possible to put an arbitrary number, thus setting the number to be
452
+ displayed for the first page of the given range (1 by default).</p>
453
+
454
+ </ul>
455
+
456
+ <p>Suppose for example that a book starts from two unnumbered title pages
457
+ followed by 32 pages numbered with roman digits. Then goes an arabic
458
+ pagination, which, however, starts straight from 33. So the following argument
459
+ for the <tt>--labels</tt> option would be appropriate:</p>
460
+
461
+ <pre>
462
+ "0:Title %D;2:%R;34:%33D"
463
+ </pre>
464
+
465
+ <h3>Building table of contents</h3>
466
+
467
+ <p>pdfbeads allows to add table of contents (PDF bookmarks) to the PDF file
468
+ to be generated. This is done with the <tt>-C</tt> (or <tt>--toc</tt>) option,
469
+ which accepts as an argument a path to a text file.</p>
470
+
471
+ <p>The TOC file should be UTF-8 encoded and consist of lines formatted
472
+ as follows (lines beginning from the `#' character are considered comments
473
+ and ignored):</p>
474
+
475
+ <pre>
476
+ &lt;indent&gt;"Heading" "Page number" [0|-|1|+]
477
+ </pre>
478
+
479
+ <p>The heading level is determined by its indent (which may be formed either
480
+ from spaces or from tabs, but mixing both styles inside the same file is not
481
+ allowed). The indent is followed by the fields of heading and page number,
482
+ which are separated with any number of space characters and may be enclosed
483
+ into double quotation marks if mecessary. The last optional parameter
484
+ specifies, if this TOC entry should be displayed unfolded by default (the
485
+ characters `+' and `1' mean “yes”).</p>
486
+
487
+ <p>It is a good idea to use the <tt>--toc</tt> option together with
488
+ <tt>--labels</tt>. Thus it is possible to use in the TOC file the same
489
+ page numbers as in the paper book without taking care about any shifts
490
+ of the numbering.</p>
491
+
492
+ <h3>Adding text layer</h3>
493
+
494
+ <p>It is possible with pdfbeads to create PDF files with a hidden text
495
+ layer. The text for a hidden layer may be either obtained from
496
+ <a href="http://docs.google.com/View?docid=dfxcv4vc_67g844kf">hOCR</a>
497
+ files (hOCR is a HTML language extension, allowing to store information
498
+ about exact positioning of characters and markup elements on the page) or
499
+ imported from another PDF file.</p>
500
+
501
+ <p>Для создания файлов в формате hOCR необходимо воспользоваться программой
502
+ оптического распознавания символов, поддерживающей этот формат, например
503
+ <a href="https://launchpad.net/cuneiform-linux/">Cuneiform</a> или
504
+ <a href="http://code.google.com/p/tesseract-ocr/">Tesseract</a>.
505
+ Распознанный текст следует сохранить в той же директории, что и остальные
506
+ файлы, относящиеся к проекту. При этом каждой распознанной странице должен
507
+ соответствовать отдельный файл с тем же базовым именем, что и у исходного
508
+ изображения, при расширении HTM(L) или HOCR. Обработка файлов hOCR
509
+ осуществляется автоматически при условии, что интерпретатору Ruby доступно
510
+ расширение Nokogiri.</p>
511
+
512
+ <p>Иное возможное решение заключается в том, чтобы импортировать текстовый
513
+ слой из другого PDF-файла (естественно, последний должен быть получен путем
514
+ распознавания тех же самых изображений, которые предполагается затем обработать
515
+ с помощью pdfbeads). Имя полученного файла следует передать pdfbeads с помощью
516
+ ключа <tt>-T</tt> (полная форма&nbsp;&mdash; <tt>-text-pdf</tt>). Эта
517
+ возможность особенно важна в тех случаях, когда приходится использовать для
518
+ распознавания текста коммерческое приложение (например,
519
+ <a href="http://www.abbyy.ru/finereader/">ABBYY Finereader</a>), в котором
520
+ не предусмотрена поддержка формата hOCR. <strong>Внимание:</strong> при
521
+ создании промежуточного PDF-файла в ABBYY Finereader следует использовать
522
+ настройки &laquo;текст под изображением&raquo; или &laquo;текст поверх
523
+ изображения&raquo;, поскольку при иных настройках размещение символов на
524
+ странице может оказаться не вполне соответствующим исходному графическому
525
+ файлу.</p>
526
+
527
+ <h3>Processing files with the right-to-left text direction</h3>
528
+
529
+ <p>The <tt>-R</tt> (or <tt>--right-to-left</tt> option allows to mark the
530
+ PDF file produced by pdfbeads with a special flag indicating that the main
531
+ text direction for the given document is right-to-left. This flag will allow
532
+ Adobe Reader&trade; to correctly order pages when displaying them in the
533
+ side-by-side mode.</p>
534
+
535
+ <h2>License</h2>
536
+
537
+ <p>This program is free software; you can redistribute it and/or modify
538
+ it under the terms of the GNU General Public License as published by
539
+ the Free Software Foundation; either version 2 of the License, or
540
+ (at your option) any later version.</p>
541
+
542
+ <p>This program is distributed in the hope that it will be useful,
543
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
544
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
545
+ GNU General Public License for more details.</p>
546
+
547
+ <p>You should have received a copy of the GNU General Public License
548
+ along with this program; if not, write to the Free Software
549
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.</p>
550
+
551
+ </body>
552
+ </html>