RubyGems - pdf_paradise - Versions diffs - 0.1.66 - Mend

pdf_paradise 0.1.66

Potentially problematic release.

This version of pdf_paradise might be problematic. Click here for more details.

Files changed (110) hide show

data/doc/README.gen ADDED Viewed

@@ -0,0 +1,662 @@
+ADD_RUBY_HEADER
+ADD_TIME_STAMP
+<img src="https://i.imgur.com/unhKNEw.png" style="margin-left: 2em">
+This project can help with pdf-related activities, such as extracting
+a .pdf page, converting .pdf page, merging .pdf files, splitting
+.pdf files, setting the title of a .pdf page and similar actions.
+The project has to remain quite flexible. We may use external
+programs such as **ghoscript** or **qpdf**, or we may use pure
+ruby solutions, such as via the gem called **combine_pdf**,
+**prawn** or **hexapdf**.
+The file here (README.gen, respectively the generated file called
+**README.md**), will describe some of the components that make
+up this gem.
+## Rationale for making use of separate pdf-related projects
+There are many pdf-related tools if you look on the www. For
+example, we have prawn, we have qpdf, we have calibre, we
+have hexapdf, we have ghostscript, and many more applications.
+Some of these have unique features; and some of them have overlapping
+functionality, such as reading the content of .pdf files in a
+simplified manner (number of pages, title, author and so forth).
+The PdfParadise project attempts to support as many different
+(open-source) projects as possible. It is also permissive to
+support closed source projects, provided that **the code remains
+simple** (and simple to change). The primary focus is on
+open-source projects, though.
+Why does the PdfParadise project attempt to support many different
+pdf-related projects?
+The answer to this question is rather simple: on Linux I have a lot of
+flexibility and can use literally any pdf-related project just fine. On
+Windows, however had, I am more restricted in what I can use. Not all
+programs are available on windows or can be easily compiled there. Thus,
+in order to allow the pdf_paradise .gem to work on windows, we need
+this flexibility.
+The reason why I added this subsection here in June 2021 was that
+I am slowly changing the sinatra-related part of the PdfParadise
+project, in order to embed the functionality into my main controller
+which is handled by the **Roebe** namespace. In that controller
+I wanted to easily offer pdf-related functionality "out of the
+box" when I start the sinatra-application on windows. Because I
+want to be able to offer pdf-related modifications on windows
+as well, the PdfParadise project had to become more flexible,
+so that a simple toplevel route, such as **/pdf**, will work
+properly, and lead to entry points (subroutes) that allow
+us to tap into the features offered by the PdfParadise project.
+So, the **summary** is: the PdfParadise project must remain
+flexible in order to support a proper workflow on windows
+systems as well. (We could use WSL on windows, but not every
+computer has this available, so I am targeting "vanilla"
+windows really.)
+Note that one slight drawback is that the sinatra part of
+the PdfParadise project now has a dependency on the
+**cyberweb** project, so if you want to use that, you also
+have to install the cyberweb gem. This is a trade-off - for me
+the more important part is long-term maintainability of
+the pdf_paradise project in the long run, so a unified
+code base had to be used in this regard.
+## Converting a .pdf file to text
+Sometimes you may wish to have a text-file describing the content
+of a .pdf file, rather than the .pdf file itself.
+Via class **PdfParadise::ConvertPdfToText**, residing in the file
+at **pdf_paradise/convert_pdf_to_text.rb**, you can convert a
+.pdf file to a text file.
+Usage example from ruby, for the file called **foobar.pdf**:
+    PdfParadise::ConvertPdfToText.new(ARGV)
+    PdfParadise::ConvertPdfToText.new('foobar.pdf')
+You can also use the bin/ file from the commandline:
+    convert_pdf_to_text
+    convert_pdf_to_text foobar.pdf
+There is also a ruby-gtk3 widget that offers the functionality
+from class **PdfParadise::ConvertPdfToText**, if the user
+has gtk3 installed and the ruby-bindings to it as well.
+You can start that ruby-gtk3 widget via:
+    convert_pdf_to_text --gui
+## Commandline usage
+You can use the **pdf_paradise** gem from the commandline, as
+the example above shows.
+For instance, say that you wish to modify **the title of a .pdf
+file**, you can use a commandline invocation such as via
+this way:
+    pdf_paradise --use-this-pdf-file=location_to_your_pdf_file.pdf --set_title="The title you want to use goes in here."
+You can also **shrink** a .pdf file, by using the commandline
+switch <b>--shrink-pdf-size-of=foobar.pdf</b> or just
+<b>--shrink</b>, such as:
+    pdf_paradise --shrink-pdf-size-of=foobar.pdf
+    pdf_paradise --shrink=foobar.pdf
+The <b>shrink</b> functionality is contained in the module-method
+<b>PdfParadise.reduce_size_of_this_pdf_file()</b>.
+## Storing the .pdf pages that are currently open
+If you need to store the .pdf files that are currently open,
+you can use the following commandline to do so:
+    pdfparadise --store-open-pdf-files
+This will attempt to store the full path to the .pdf files
+into a local file. That way you may also be able to batch-open
+these .pdf files at a later time, e. g. when you switch your
+window manager or after a reboot.
+## Deleting the last or the first page of a .pdf file
+You can use **class DeleteLastPageOfThisPdfFile**, more
+accurately called **class PdfParadise::DeleteLastPageOfThisPdfFile**,
+to ***delete the last page in a .pdf file***.
+In ruby code, you can invoke this like so:
+    require 'pdf_paradise'
+    PdfParadise::DeleteLastPageOfThisPdfFile.new('path_to_the_pdf_file/goes_in_here.pdf')
+or shorter:
+    require 'pdf_paradise'
+    PdfParadise.delete_last_page_of_this_pdf_file('foobar.pdf')
+A very similar API exists for deleting the first page of a given .pdf
+file, too.
+See:
+In ruby code, you can invoke this like so:
+    require 'pdf_paradise'
+    PdfParadise::DeleteFirstPageOfThisPdfFile.new('path_to_the_pdf_file/goes_in_here.pdf')
+or shorter:
+    require 'pdf_paradise'
+    PdfParadise.delete_first_page_of_this_pdf_file('foobar.pdf')
+## Converting markdown .md files to .pdf files
+If you use kramdown, prawn and kramdown-pdf-converter, then you
+can convert .md files on the commandline, via:
+    convert_markdown_to_pdf path_to_pdf_file_goes_here.pdf
+Install the necessary gems prior to using this commandline
+functionality.
+## sinatra interface
+Since as of April 2019 there is a minimal sinatra interface to the
+PdfParadise project. Consider this incomplete <b>work-in-progress</b>.
+To start it, try:
+    pdf_paradise --sinatra
+## Querying the title of a .pdf file
+<b>class PdfParadise::QueryPdfTitle</b> will report the title of
+any .pdf file that is passed into it, on the commandline.
+This currently depends on <b>exiftool</b> but at a later time,
+this may change to also allow a query via prawn or other tools.
+If you need to determine whether a given .pdf file has a title
+or whether it does not, you can use
+<b>PdfParadise.does_this_pdf_file_have_a_title?</b>, such
+as in:
+    PdfParadise.does_this_pdf_file_have_a_title? "foobar.pdf" # => true
+This method will return **true** if the .pdf file at hand has a
+title; and **false** otherwise.
+## Determining how many pages a given .pdf file has
+class **PdfParadise::PdfFileNTotalPages** can be used to query
+how many pages a given .pdf file has.
+The executable called **bin/n_pages** (thus, **n_pages**) can
+be used to query this, on the commandline.
+Example:
+    n_pages foobar.pdf
+Do note that the class requires the external program
+called **pdfinfo**.
+It is possible to query the number of pages in a given .pdf
+file without **pdfinfo**, but some .pdf files are a bit buggy,
+and **pdfinfo** is simply more reliable than the regex that
+was used until March 2020. So, past March 2020, the program
+**pdfinfo** is now used by default. Note that pdfinfo is
+part of the poppler software suite.
+You can also use the following toplevel API for this:
+    PdfParadise.n_pages? 'THE_PATH_TO_THE_PDF_FILE_GOES_IN_HERE.pdf'
+    PdfParadise.n_pages? 'foobar.pdf'
+## Adding page numbers to .pdf files
+Via the combine_pdf gem it is now possible to add page numbers
+to .pdf files. This has a few limitations for complex .pdf files,
+due to combine_pdf having limitations in turn - but for simple
+.pdf files this should work really well.
+How to use that functionality?
+Consider using the following toplevel API:
+    PdfParadise.number_pages('this_file.pdf')
+The file called **this_file.pdf** has to exist in order for
+this to work, of course.
+The current default is to display the page numbers on the bottom
+right side. This is hardcoded, but you could modify the code
+to adapt to your needs; see also how combine_pdf does this.
+(You have to pass an option-hash.)
+## Various GUI component of the PdfParadise project
+The **PdfParadise project** comes with some ruby-gtk3 specific
+GUIs, but a few ruby-gtk2 and ruby-tk bindings may exist
+as well. The **ruby-gtk3** components constitute the main GUI
+elements of this project, though.
+You can start, from the commandline, the gtk-wrapper
+over the **split_pdf_file** functionality.
+In order to do this, do either one of the following:
+    pdf_paradise --gui
+    pdf_paradise --gtk
+This will require the **gtk_paradise** project and the gtk
+bindings, so quite a lot. **gem install gtk3** and
+**gem install gtk_paradise** should help.
+The GUI for class SplitPdfFile is called **PdfParadise::Gtk::SplitPdfFile**.
+The idea behind it is to allow you to determine some of the parameters
+in a graphical fashion.
+Since as of **September 2019**, there is also a mini-widget for quickly
+removing the first page of a .pdf file. This is really minimal right
+now and not very elegant; it may be improved in the future, but for
+the time being it is what it is. It is more a proof-of-concept that
+it can work.
+You can start this via:
+    require 'pdf_paradise/gui/gtk2/remove_first_page_of_pdf_file.rb'
+    PdfParadise.start_gtk_gui_remove_first_page_of_pdf_file
+Note that as of **January 2021** the gtk bindings will default to
+**ruby-gtk3**. Support for ruby-gtk2 will be retained, though,
+but new code may not necessarily be written for ruby-gtk2 in
+mind. ruby-gtk3 is now the main GUI target for this project.
+I am slowly porting the individual widgets.
+The following widgets have been ported so far:
+    PdfParadise::GUI::Gtk::StatisticsWidget # can be found under pdf_paradise/gui/gtk3/statistics_widget/statistics_widget.rb
+## Specification of the .pdf format
+This subsection is a stub - I only needed it to gather information
+about the .pdf specification. This is NOT complete - it only shall
+contain some useful information and snippets about the .pdf
+specification.
+PDF stands short for **Portable Document Format**.
+PDF has been standardized as **ISO-32000** in the year **2008**.
+In the pdf-specification we can distinguish these entities:
+    Objects: these are not objects in the OOP sense, but simply the
+    basic data type of the PDF standard. There are 9 types of objects:
+    null, boolean, integer, real, name, string, array, dictionary and
+    stream.
+    Dictionary: this is a key-value pair that is unordered. They are
+    denoted by << and >> at the beginning and the end.
+    Indirect Objects: these are objects that are referred to by
+    reference.
+    Direct Objects: these are objects that appear inline and are
+    obtained directly.
+    Conforming Reader: is ann application that parses a PDF
+    file according to the PDF Standard.
+A .pdf file is made up of a specific structure, usually a four-part
+layout.
+These four parts are:
+    Header
+    Body
+    Cross-reference table
+    Trailer
+### The .pdf Header tag
+The header may begin with an entry such as **%PDF-1.7**.
+The general format for the header is:
+    %PDF- followed by the version number in the form of 1.N.
+This is not valid for all .pdf files, though. Past PDF Version
+1.4, the **Version** entry in the document's catalog dictionary,
+which is within the **Root** entry of the **Trailer**, may be
+used instead of the Header - **if present**.
+If a .pdf file contains binary data - which most PDF files
+will do nowadays, such as **stream objects** - then the
+**Header** line shall be immediately followed by a line
+containing at the least **four binary characters**. These
+are character codes of 128 or greater.
+### The .pdf Body tag
+The body of a PDF File consist of these aforementioned **Indirect
+Objects** representing the contents of a document.
+**Indirect Objects** begin with a **unique object identifier**
+that allows other objects to refer to them.
+That identifier is made up of the following two components:
+    (1) Object Number:     a positive Integer, can be in any arbitrary order
+    (2) Generation Number: a non-negative Integer)
+The **Indirect Objects** can be referred to from elsewhere by an
+Indirect Reference. This must consist of:
+    Object Number
+    Generation Number, and
+    keyword R # for instance: 4 0 R
+After the identifier is the keyword **obj** (start of the object)
+and **endobj** (end of the object). Anything in between that is
+is a key-value pair that describes the object.
+A a simple example showing the use of **Indirect Objects** will be
+shown next:
+    1 0 obj % Object Number 1, Generation Number 0
+    <<
+    /Type /Pages % Describe type of object
+    /Kids [ 4 0 R ] $ Kids Entry referring to an indirect reference (Object number 4, Generation number 0)
+    /Count 1
+    >>
+    endobj
+    2 0 obj % Object Number 2, Generation Number 0
+    <<
+    /Type /Catalog % Describe type of object
+    /Pages 1 0 R % Referring another object via unique object identifier
+    >>
+    endobj
+The **Body** section of a .pdf  file is thus a tree of objects that
+are linked together, ultimately coming down to the Root Object
+(Defined by the **Root** entry in the **Trailer** section, as a
+catalog dictionary).
+The **Cross-Reference Table** is a table that contains a list of byte
+offset pointing to the indirect objects.
+A pdf-conforming reader uses the Cross-Reference Table as a lookup
+table to access certain objects quickly when needed.
+The format for entries in Cross-Reference Table can be summarized ass
+follows:
+    - In the following format nnnnnnnnnn ggggg n eol, a total of 20 bytes
+    - nnnnnnnnnn is a 10-digit byte offset in the decoded stream
+    - ggggg 5-digit generation number
+    - n keyword for in-use entry or f keyword for free entry
+    - eol 2 character end-of-line sequence (Like CR LF)
+The **Cross-Reference Table** always begins with the special entry
+**0000000000 65535** - see the following example:
+    0000000000 65535 f % special entry, f denoting it is a free entry
+## Graphical User Interfaces (GUIs)
+The pdf_paradise gem comes with a few, small-ish widgets, primarily
+written in ruby-gtk. Since as of August 2021 I am also experimenting
+with libui but this is a slow process - stay tuned for more updates
+in the coming months in this regard.
+One big advantage of libui is that it works on windows out-of-the-box,
+so we can use GUIs on windows as well. \o/
+## Compressing a .pdf file (optizime the size of a .pdf file)
+Sometimes you may have to reduce the filesize of a given .pdf
+file, such as when you need to upload a .pdf file, and there
+is some file size limit otherwise. This happened to me a few
+times when using webmail-based email services, where an
+automatic notice was generated when the .pdf file was too
+large, e. g. above 25MB in size or something similar.
+So, let us now assume that you **do** have a use case such
+as described above, or any other use case - you want to
+reduce the file size of a given .pdf file at hand.
+How can this be done?
+Well, there are several ways. One is to use online-based
+tools, which tend to work surprisingly well; I verified
+this in February 2022. But, as far as the gem here is
+concerned, we will focus primarily on means that can be
+used by you on your own, without having to depend on
+external websites. Two methods will be described here -
+the first one requiring **ghostscript**, the second
+one requiring **hexapdf**.
+The important parameter in regards for **ghostscript** is
+the **dPDFSETTINGS** parameter. This one will determine
+the compression level, which ultimately will affect
+the quality of the compressed .pdf file.
+Available parameters to **dPDFSETTINGS** include
+**/screen**, **/ebook**, **/printer**, **/prepress**
+and **/default**.
+class **PdfParadise::CompressThisPdfFile** can be of
+help here. Simply pass, as argument to .new(), the path
+of the local .pdf to that class.
+You can also use a toplevel method if you'd like to:
+    require 'pdf_paradise'
+    PdfParadise.compress_this_pdf_file
+    PdfParadise.compress_this_pdf_file('/foobar.pdf')
+The variant using hexapdf is called:
+    PdfParadise.compress_via_pdf
+    PdfParadise.compress_via_pdf('foobar.pdf')
+The API name may change at a later point in time; perhaps
+we will just add a toplevel API called **PdfParadise.compress()**,
+but for the time being the above APIs will be retained as they
+are.
+## Storing all open .pdf files in a yaml file
+In **February 2022* the yaml file working_on_these_pdf_files.yml
+was added at:
+    pdf_paradise/yaml/working_on_these_pdf_files.yml
+The idea here is that this yaml-file retains the local path
+to any .pdf file that the user (in this case me) is working
+on, aka reading right now.
+I needed this because I tend to work through .pdf files and
+remove page after page when I read it. The idea is that
+I do not lose that information when I reboot my computer
+or when said computer crashes; I needed to make this
+persistent information.
+Why is this yaml file part of the pdf_paradise gem, though?
+This is mostly due to convenience. I wanted to have this
+available in one of my ruby gems by default. In the long
+run I will add code that allows other users to adjust
+this to their own use case (and perhaps in their home
+directory rather than store this in the gem itself). As
+of February 2022 code for the latter is currently not
+part of the gem, but I may add code for this - either
+in the **pdf_paradise** gem or the **roebe** gem.
+## Splitting a single pdf file into individual several .pdf files
+You can use the following toplevel API to split up a single
+.pdf file into several .pdf files:
+    PdfParadise.burst(ARGV)
+    PdfParadise.burst('foobar.pdf')
+## Merging pdf files
+<b>class PdfParadise::MergePdf.new(ARGV)</b> can be used for
+<b>merging .pdf files</b>. This functionality depends on
+external software, so you have to install this first.
+Currently <b>ghostscript</b> and <b>hexapdf</b> can be used for
+the <b>merging</b> step.
+Examples for how to use either of these two variants, as
+far as <b>class PdfParadise::MergePdf</b> is concerned,
+follows next:
+    mergepdf one.pdf two.pdf --use-ghostscript
+    mergepdf one.pdf two.pdf --use-hexapdf
+(The two -- hyphen are mandatory for commandline arguments
+right now; otherwise it is assumed to be a locally existing
+.pdf file.)
+If you need to do this from within ruby code, consider
+using the following code:
+    require 'pdf_paradise'
+    merge_pdf = PdfParadise::MergePdf.new('one.pdf two.pdf')
+    merge_pdf.feedback_where_it_is_stored # Call it manually.
+## Combining individual pages from .pdf files into a new .pdf file via class PdfParadise::CombineThesePdfPages
+class **PdfParadise::CombineThesePdfPages** can be used to
+extract individual pdf pages from a given .pdf file and
+combine these into a new .pdf file.
+There is also an executable at **bin/combine_these_pdf_pages**
+which can be used on the commandline.
+This functionality depends on the **hexapdf** gem.
+Usage example:
+    combine_these_pdf_pages foobar.pdf 3,4,5
+This would retain the pages at 3, 4 and 5 and create a new
+.pdf file.
+## Extracting all images from a .pdf file
+If you make use of <b>poppler</b> then you can extract
+all images from a given .pdf file.
+A small libui-GUI was added for this functionality - this
+is mostly for quick demo purposes. It does not work extremely
+well.
+On IceWM it looks like this right now:
+<img src="https://i.imgur.com/QXelVyy.png" style="margin:1em">
+Not pretty, but it took only about 20 minutes to write this.
+<b>pdfimages</b> from poppler must be installed. On Windows
+you can probably download an executable for poppler here:
+    https://blog.alivate.com.au/poppler-windows/
+I tested whether the above executables work on windows, and
+indeed, they still work fine. I also tested the libui
+variant on windows, and it works. The code is a bit
+brittle, so use with care, but I was able to use it
+successfully on <b>August 2022</b> to extract all images
+from a given .pdf file. At a later time I may add am
+to-image converter via libui, probably in the other
+gem called image_paradise. Stay tuned in this regard.
+To start the libui wrapper from the commandline, you can
+use the following:
+    /usr/bin/pdf_paradise --libui
+    bin/pdf_paradise  --libui
+    pdf_paradise --libui # This variant should work, or try the other
+                         # variants; it is stored in bin/pdf_paradise
+                         # of this gem
+## Converting .jpg files to .pdf files
+If you have a use case to convert several .jpg files into .pdf
+files then the following commandline example should be
+helpful:
+    convert /path/to/image foobar.pdf
+    convert *.jpg foobar.pdf
+Note that this requires **ImageMagick**. **ImageMagick** is
+not always perfect; it has a few problems, unfortunately.
+For instance, in <b>April 2022</b> when I tried the above,
+the image was repeated three times on the x-axis. I do not
+know why, but that makes **absolutely no sense**. It is just
+a single image, so why is the resulting .pdf file repeated
+three times? Perhaps imagemagick's **convert** tool does
+this automatically, but then I question the default behaviour -
+**it makes no sense** for the use case I have. One image
+should be one image, not three images or fifty images.
+In the event that **ImageMagick** does not work very well
+for your use case, consider using another software suite,
+such as **img2pdf**.
+The syntax for **img2pdf** goes something like this:
+    img2pdf -o document.pdf *jpg
+I liked this, so in **April 2022** this was added to
+**ImageParadise**. The API for this is as follows:
+    ImageParadise.img2pdf('*.jpg') # If a '*' is part of the input Dir[] will be used.
+As that functionality may be useful on the commandline
+as well, an executable has been added at
+**bin/imageparadise_img2pdf**. Simply pass the image
+files that you want to convert.
+Usage example:
+    imageparadise_img2pdf *jpg
+If you need the images to be ordered or sorted then you
+may have to do so when specifying the image file at hand
+specifically, e. g. the path to it.
+So for instance:
+    imageparadise_img2pdf image3.jpg image1.jpg image2.png
+The only drawback I have found with <b>img2pdf</b> so far is
+that you can not easily add text to an image. This makes it
+hard to identify which image is named how. A work around for
+this is to embed the filename into the image itself, e. g.
+create temporary images, and then pack them together via
+<b>img2pdf</b>.
+ADD_CONTACT_INFORMATION

data/doc/todo/todo.md ADDED Viewed

@@ -0,0 +1,7 @@
+- Add a converter-GUI.
+  From .docx to .pdf via libreoffice.
+    ^^^^ support this via that GUI.
+         ^^^ yeah
+  ^^ this works but has to be polished still.