PyPI - pdfalyzer - Versions diffs - 1.14.10__tar.gz → 1.15.0__tar.gz - Mend

pdfalyzer 1.14.10tar.gz → 1.15.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of pdfalyzer might be problematic. Click here for more details.

Files changed (45) hide show

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,9 @@
 # NEXT RELEASE
+# 1.15.0
+* Add `combine_pdfs` command line script to merge a bunch of PDFs into one
+* Remove unused `Deprecated` dependency
 ### 1.14.10
 * Add `malware_MaldocinPDF` YARA rule

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: pdfalyzer
-Version: 1.14.10
+Version: 1.15.0
 Summary: A PDF analysis toolkit. Scan a PDF with relevant YARA rules, visualize its inner tree-like data structure in living color (lots of colors), force decodes of suspicious font binaries, and more.
 Home-page: https://github.com/michelcrypt4d4mus/pdfalyzer
 License: GPL-3.0-or-later
@@ -16,7 +16,6 @@ Classifier: Programming Language :: Python :: 3.11
 Classifier: Topic :: Artistic Software
 Classifier: Topic :: Scientific/Engineering :: Visualization
 Classifier: Topic :: Security
-Requires-Dist: Deprecated (>=1.2.13,<2.0.0)
 Requires-Dist: PyPDF2 (>=2.10,<3.0)
 Requires-Dist: anytree (>=2.8,<3.0)
 Requires-Dist: chardet (>=5.0.0,<6.0.0)
@@ -63,25 +62,32 @@ If you're looking for one of these things this may be the tool for you.
 ### What It Don't Do
 This tool is mostly for examining/working with a PDF's data and logical structure. As such it doesn't have much to offer as far as extracting text, rendering[^3], writing, etc. etc.
+-------------
 # Installation
-Installation with [pipx](https://pypa.github.io/pipx/)[^4] is preferred though `pip3` should also work.
+Installation with [pipx](https://pypa.github.io/pipx/)[^4] is preferred though `pip3` / `pip` should also work.
 ```sh
 pipx install pdfalyzer
 ```
 See [PyPDF2 installation notes](https://github.com/py-pdf/PyPDF2#installation) about `PyCryptodome` if you plan to `pdfalyze` any files that use AES encryption.
-### Troubleshooting The Installation
+If you are on macOS someone out there was kind enough to make [The Pdfalyzer available via homebrew](https://formulae.brew.sh/formula/pdfalyzer) so `brew install pdfalyzer` should work.
+### Troubleshooting
 1. If you used `pip3` instead of `pipx` and have an issue you should try to install with `pipx`.
 1. If you run into an issue about missing YARA try to install [yara-python](https://pypi.org/project/yara-python/).
 1. If you encounter an error building the python `cryptography` package check your `pip` version (`pip --version`). If it's less than 22.0, upgrade `pip` with `pip install --upgrade pip`.
+1. If you get a YARA internal error number you can look up what it actually means [here](https://github.com/VirusTotal/yara/blob/master/libyara/include/yara/error.h).
+1. If you can't get the `pdfalyze` command to work try `python -m pdfalyzer`. It's an equivalent but more portable version of the same command that does not rely on your python script paths being set up in a sane way.
+1. While The Pdfalyzer has been tested on quite a few large and very complicated PDFs there are no doubt a bunch of edge cases that will trip up the code. Sifting through the various interconnected internal PDF objects and building the correct tree representation is much, much harder than it should be and requires multiple scans and a little bit of educated guessing. If a PDF fails to parse and you hit an error please open [a GitHub issue](https://github.com/michelcrypt4d4mus/pdfalyzer/issues) with the compressed (`.zip`, `.gz`, whatever) PDF that is causing the problem attached (if possible) and I'll take a look when I can. I will _not_ take a look at any uncompressed PDFs due to the security risks so make sure you zip it before you ship it.
 1. On Linux if you encounter an error building `wheel` or `cffi` you may need to install some packages:
    ```bash
    sudo apt-get install build-essential libssl-dev libffi-dev rustc
    ```
-1. If you get a YARA internal error number you can look up what it actually means [here](https://github.com/VirusTotal/yara/blob/master/libyara/include/yara/error.h).
+-------------
 # Usage
@@ -92,8 +98,8 @@ Run `pdfalyze --help` to see usage instructions. As of right now these are the o
 ## Runtime Options
 If you provide none of the flags in the `ANALYSIS SELECTION` section of the `--help` then all of the analyses will be done _except_ the `--streams`.  In other words, these two commands are equivalent:
-1. `pdfalyzer lacan_buys_the_dip.pdf`
-1. `pdfalyzer lacan_buys_the_dip.pdf -d -t -r -f -y -c`
+1. `pdfalyze lacan_buys_the_dip.pdf`
+1. `pdfalyze lacan_buys_the_dip.pdf -d -t -r -f -y -c`
 The `--streams` output is the one used to hunt for patterns in the embedded bytes and can be _extremely_ verbose depending on the `--quote-char` options chosen (or not chosen) and contents of the PDF. [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer) handles this task; if you want to hunt for patterns in the bytes other than bytes surrounded by backticks/frontslashes/brackets/quotes/etc. you may want to use The Yaralyzer directly. As The Yaralyzer is a prequisite for The Pdfalyzer you may already have the `yaralyze` command installed and available.
@@ -106,15 +112,11 @@ Even if you don't configure your own `.pdfalyzer` file you may still glean some
 ### Colors And Themes
 Run `pdfalyzer_show_color_theme` to see the color theme employed.
-## Guarantees
+### Guarantees
 Warnings will be printed if any PDF object ID between 1 and the `/Size` reported by the PDF itself could not be successfully placed in the tree. If you do not get any warnings then all[^2] of the inner PDF objects should be seen in the output.
-## Troubleshooting
-1. If you can't get the `pdfalyze` command to work try `python -m pdfalyzer`. It's an equivalent but more portable version of the same command that does not rely on your python script paths being set up in a sane way.
-1. While The Pdfalyzer has been tested on quite a few large and very complicated PDFs there are no doubt a bunch of edge cases that will trip up the code. If that does happen and you hit an error, please open [a GitHub issue](https://github.com/michelcrypt4d4mus/pdfalyzer/issues) with the compressed (`.zip`, `.gz`, whatever) PDF that is causing the problem attached (if possible) and I'll take a look when I can. I will _not_ take a look at any uncompressed PDFs due to the security risks so make sure you zip it before you ship it.
+## Example Usage
+[BUFFERZONE Team](https://bufferzonesecurity.com) posted [an excellent example](https://bufferzonesecurity.com/the-beginners-guide-to-adobe-pdf-malware-reverse-engineering-part-1/) of how one might use The Pdfalyzer in tandem with [Didier Stevens' PDF tools](#installing-didier-stevenss-pdf-analysis-tools) to investigate a potentially malicious PDF (archived in [the `doc/` dir in this repo](./doc/) if the link rots).
 -------------
@@ -135,6 +137,7 @@ pdfalyzer = Pdfalyzer("/path/to/the/evil_or_non_evil.pdf")
 actual_pdf_tree: PdfTreeNode = pdfalyzer.pdf_tree
 # The PdfalyzerPresenter handles formatting/prettifying output
+from pdfalyzer.output.pdfalyzer_presenter import PdfalyzerPresenter
 PdfalyzerPresenter(pdfalyzer).print_everything()
 # Iterate over all nodes in the PDF tree
@@ -164,6 +167,7 @@ for backtick_quoted_string in font.binary_scanner.extract_backtick_quoted_bytes(
     do_stuff(backtick_quoted_string)
 ```
+-------------
 # Example Output
 The Pdfalyzer can export visualizations to HTML, ANSI colored text, and SVG images using the file export functionality that comes with [Rich](https://github.com/Textualize/rich). SVGs can be turned into `png` format images with a tool like Inkscape or `cairosvg` (Inkscape works a lot better in our experience). See `pdfalyze --help` for the specifics.
@@ -188,7 +192,7 @@ This image shows a more in-depth view of of the PDF tree for the same document s
 ## Fonts
-#### **Extract character mappings from ancient Adobe font formats:** It's actually `PyPDF2` doing the lifting here but we're happy to take the credit.
+#### **Extract character mappings from ancient Adobe font formats**. It's actually `PyPDF2` doing the lifting here but we're happy to take the credit.
 ![](https://github.com/michelcrypt4d4mus/pdfalyzer/raw/master/doc/svgs/rendered_images/font_character_mapping.png)
@@ -223,8 +227,11 @@ Things like, say, a hidden binary `/F` (PDF instruction meaning "URL") followed
 ![](https://github.com/michelcrypt4d4mus/pdfalyzer/raw/master/doc/svgs/rendered_images/decoding_and_chardet_table_2.png)
+-------------
 # PDF Resources
+## Included PDF Tools
+The Pdfalyzer ships with a command line tool `combine_pdfs` that combines multiple PDFs into a single PDF. Run `combine_pdfs --help` to see the options.
 ## 3rd Party PDF Tools
 ### Installing Didier Stevens's PDF Analysis Tools
@@ -247,7 +254,7 @@ There's [a script](scripts/install_t1utils.sh) to help you install the suite if
 scripts/install_t1utils.sh
 ```
-## Documentation
+## External Documentation
 ### Official Adobe Documentation
 * [Official Adobe PDF 1.7 Specification](https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf) - Indispensable map when navigating a PDF forest.
 * [Adobe Type 1 Font Format Specification](https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf) - Official spec for Adobe's original font description language and file format. Useful if you have suspicions about malicious fonts. Type1 seems to be the attack vector of choice recently which isn't so surprising when you consider that it's a 30 year old technology and the code that renders these fonts probably hasn't been extensively tested in decades because almost no one uses them anymore outside of people who want to use them as attack vectors.
@@ -270,6 +277,8 @@ This tool was built to fill a gap in the PDF assessment landscape following [my
 Thus I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects ([AnyTree](https://github.com/c0fec0de/anytree), [PyPDF2](https://github.com/py-pdf/PyPDF2), [Rich](https://github.com/Textualize/rich), and [YARA](https://github.com/VirusTotal/yara-python) via [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer)) into this tool.
+-------------
 # Contributing
 One easy way of contributing is to run [the script to test against all the PDFs in your `~/Documents` folder](scripts/test_against_all_pdfs_in_Documents_folder.sh) and report any issues.
@@ -290,7 +299,12 @@ These are the naming conventions at play in The Pdfalyzer code base:
 | **`indeterminate_node`** | any node whose place in the tree cannot be decided until every node has been seen |
 | **`link_node`** | nodes like `/Dest` that just contain a pointer to another node |
+### Reference
+* [`PyPDF2 2.12.0` documentation](https://pypdf2.readthedocs.io/en/2.12.0/) (latest is 4.x or something so these are the relevant docs for `pdfalyze`)
 # TODO
+* Upgrade `PyPDF` to latest and expand `combine_pdfs` compression command line option
 * Highlight decodes with a lot of Javascript keywords
 * https://github.com/mandiant/flare-floss (https://github.com/mandiant/flare-floss/releases/download/v2.1.0/floss-v2.1.0-linux.zip)
 * https://github.com/1Project/Scanr/blob/master/emulator/emulator.py

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/README.md RENAMED Viewed

@@ -33,25 +33,32 @@ If you're looking for one of these things this may be the tool for you.
 ### What It Don't Do
 This tool is mostly for examining/working with a PDF's data and logical structure. As such it doesn't have much to offer as far as extracting text, rendering[^3], writing, etc. etc.
+-------------
 # Installation
-Installation with [pipx](https://pypa.github.io/pipx/)[^4] is preferred though `pip3` should also work.
+Installation with [pipx](https://pypa.github.io/pipx/)[^4] is preferred though `pip3` / `pip` should also work.
 ```sh
 pipx install pdfalyzer
 ```
 See [PyPDF2 installation notes](https://github.com/py-pdf/PyPDF2#installation) about `PyCryptodome` if you plan to `pdfalyze` any files that use AES encryption.
-### Troubleshooting The Installation
+If you are on macOS someone out there was kind enough to make [The Pdfalyzer available via homebrew](https://formulae.brew.sh/formula/pdfalyzer) so `brew install pdfalyzer` should work.
+### Troubleshooting
 1. If you used `pip3` instead of `pipx` and have an issue you should try to install with `pipx`.
 1. If you run into an issue about missing YARA try to install [yara-python](https://pypi.org/project/yara-python/).
 1. If you encounter an error building the python `cryptography` package check your `pip` version (`pip --version`). If it's less than 22.0, upgrade `pip` with `pip install --upgrade pip`.
+1. If you get a YARA internal error number you can look up what it actually means [here](https://github.com/VirusTotal/yara/blob/master/libyara/include/yara/error.h).
+1. If you can't get the `pdfalyze` command to work try `python -m pdfalyzer`. It's an equivalent but more portable version of the same command that does not rely on your python script paths being set up in a sane way.
+1. While The Pdfalyzer has been tested on quite a few large and very complicated PDFs there are no doubt a bunch of edge cases that will trip up the code. Sifting through the various interconnected internal PDF objects and building the correct tree representation is much, much harder than it should be and requires multiple scans and a little bit of educated guessing. If a PDF fails to parse and you hit an error please open [a GitHub issue](https://github.com/michelcrypt4d4mus/pdfalyzer/issues) with the compressed (`.zip`, `.gz`, whatever) PDF that is causing the problem attached (if possible) and I'll take a look when I can. I will _not_ take a look at any uncompressed PDFs due to the security risks so make sure you zip it before you ship it.
 1. On Linux if you encounter an error building `wheel` or `cffi` you may need to install some packages:
    ```bash
    sudo apt-get install build-essential libssl-dev libffi-dev rustc
    ```
-1. If you get a YARA internal error number you can look up what it actually means [here](https://github.com/VirusTotal/yara/blob/master/libyara/include/yara/error.h).
+-------------
 # Usage
@@ -62,8 +69,8 @@ Run `pdfalyze --help` to see usage instructions. As of right now these are the o
 ## Runtime Options
 If you provide none of the flags in the `ANALYSIS SELECTION` section of the `--help` then all of the analyses will be done _except_ the `--streams`.  In other words, these two commands are equivalent:
-1. `pdfalyzer lacan_buys_the_dip.pdf`
-1. `pdfalyzer lacan_buys_the_dip.pdf -d -t -r -f -y -c`
+1. `pdfalyze lacan_buys_the_dip.pdf`
+1. `pdfalyze lacan_buys_the_dip.pdf -d -t -r -f -y -c`
 The `--streams` output is the one used to hunt for patterns in the embedded bytes and can be _extremely_ verbose depending on the `--quote-char` options chosen (or not chosen) and contents of the PDF. [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer) handles this task; if you want to hunt for patterns in the bytes other than bytes surrounded by backticks/frontslashes/brackets/quotes/etc. you may want to use The Yaralyzer directly. As The Yaralyzer is a prequisite for The Pdfalyzer you may already have the `yaralyze` command installed and available.
@@ -76,15 +83,11 @@ Even if you don't configure your own `.pdfalyzer` file you may still glean some
 ### Colors And Themes
 Run `pdfalyzer_show_color_theme` to see the color theme employed.
-## Guarantees
+### Guarantees
 Warnings will be printed if any PDF object ID between 1 and the `/Size` reported by the PDF itself could not be successfully placed in the tree. If you do not get any warnings then all[^2] of the inner PDF objects should be seen in the output.
-## Troubleshooting
-1. If you can't get the `pdfalyze` command to work try `python -m pdfalyzer`. It's an equivalent but more portable version of the same command that does not rely on your python script paths being set up in a sane way.
-1. While The Pdfalyzer has been tested on quite a few large and very complicated PDFs there are no doubt a bunch of edge cases that will trip up the code. If that does happen and you hit an error, please open [a GitHub issue](https://github.com/michelcrypt4d4mus/pdfalyzer/issues) with the compressed (`.zip`, `.gz`, whatever) PDF that is causing the problem attached (if possible) and I'll take a look when I can. I will _not_ take a look at any uncompressed PDFs due to the security risks so make sure you zip it before you ship it.
+## Example Usage
+[BUFFERZONE Team](https://bufferzonesecurity.com) posted [an excellent example](https://bufferzonesecurity.com/the-beginners-guide-to-adobe-pdf-malware-reverse-engineering-part-1/) of how one might use The Pdfalyzer in tandem with [Didier Stevens' PDF tools](#installing-didier-stevenss-pdf-analysis-tools) to investigate a potentially malicious PDF (archived in [the `doc/` dir in this repo](./doc/) if the link rots).
 -------------
@@ -105,6 +108,7 @@ pdfalyzer = Pdfalyzer("/path/to/the/evil_or_non_evil.pdf")
 actual_pdf_tree: PdfTreeNode = pdfalyzer.pdf_tree
 # The PdfalyzerPresenter handles formatting/prettifying output
+from pdfalyzer.output.pdfalyzer_presenter import PdfalyzerPresenter
 PdfalyzerPresenter(pdfalyzer).print_everything()
 # Iterate over all nodes in the PDF tree
@@ -134,6 +138,7 @@ for backtick_quoted_string in font.binary_scanner.extract_backtick_quoted_bytes(
     do_stuff(backtick_quoted_string)
 ```
+-------------
 # Example Output
 The Pdfalyzer can export visualizations to HTML, ANSI colored text, and SVG images using the file export functionality that comes with [Rich](https://github.com/Textualize/rich). SVGs can be turned into `png` format images with a tool like Inkscape or `cairosvg` (Inkscape works a lot better in our experience). See `pdfalyze --help` for the specifics.
@@ -158,7 +163,7 @@ This image shows a more in-depth view of of the PDF tree for the same document s
 ## Fonts
-#### **Extract character mappings from ancient Adobe font formats:** It's actually `PyPDF2` doing the lifting here but we're happy to take the credit.
+#### **Extract character mappings from ancient Adobe font formats**. It's actually `PyPDF2` doing the lifting here but we're happy to take the credit.
 ![](https://github.com/michelcrypt4d4mus/pdfalyzer/raw/master/doc/svgs/rendered_images/font_character_mapping.png)
@@ -193,8 +198,11 @@ Things like, say, a hidden binary `/F` (PDF instruction meaning "URL") followed
 ![](https://github.com/michelcrypt4d4mus/pdfalyzer/raw/master/doc/svgs/rendered_images/decoding_and_chardet_table_2.png)
+-------------
 # PDF Resources
+## Included PDF Tools
+The Pdfalyzer ships with a command line tool `combine_pdfs` that combines multiple PDFs into a single PDF. Run `combine_pdfs --help` to see the options.
 ## 3rd Party PDF Tools
 ### Installing Didier Stevens's PDF Analysis Tools
@@ -217,7 +225,7 @@ There's [a script](scripts/install_t1utils.sh) to help you install the suite if
 scripts/install_t1utils.sh
 ```
-## Documentation
+## External Documentation
 ### Official Adobe Documentation
 * [Official Adobe PDF 1.7 Specification](https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf) - Indispensable map when navigating a PDF forest.
 * [Adobe Type 1 Font Format Specification](https://adobe-type-tools.github.io/font-tech-notes/pdfs/T1_SPEC.pdf) - Official spec for Adobe's original font description language and file format. Useful if you have suspicions about malicious fonts. Type1 seems to be the attack vector of choice recently which isn't so surprising when you consider that it's a 30 year old technology and the code that renders these fonts probably hasn't been extensively tested in decades because almost no one uses them anymore outside of people who want to use them as attack vectors.
@@ -240,6 +248,8 @@ This tool was built to fill a gap in the PDF assessment landscape following [my
 Thus I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects ([AnyTree](https://github.com/c0fec0de/anytree), [PyPDF2](https://github.com/py-pdf/PyPDF2), [Rich](https://github.com/Textualize/rich), and [YARA](https://github.com/VirusTotal/yara-python) via [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer)) into this tool.
+-------------
 # Contributing
 One easy way of contributing is to run [the script to test against all the PDFs in your `~/Documents` folder](scripts/test_against_all_pdfs_in_Documents_folder.sh) and report any issues.
@@ -260,7 +270,12 @@ These are the naming conventions at play in The Pdfalyzer code base:
 | **`indeterminate_node`** | any node whose place in the tree cannot be decided until every node has been seen |
 | **`link_node`** | nodes like `/Dest` that just contain a pointer to another node |
+### Reference
+* [`PyPDF2 2.12.0` documentation](https://pypdf2.readthedocs.io/en/2.12.0/) (latest is 4.x or something so these are the relevant docs for `pdfalyze`)
 # TODO
+* Upgrade `PyPDF` to latest and expand `combine_pdfs` compression command line option
 * Highlight decodes with a lot of Javascript keywords
 * https://github.com/mandiant/flare-floss (https://github.com/mandiant/flare-floss/releases/download/v2.1.0/floss-v2.1.0-linux.zip)
 * https://github.com/1Project/Scanr/blob/master/emulator/emulator.py

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/__init__.py RENAMED Viewed

@@ -1,10 +1,14 @@
 import code
-import logging
 import sys
 from os import environ, getcwd, path
+from pathlib import Path
 from dotenv import load_dotenv
+# TODO: PdfMerger is deprecated in favor of PdfWriter at v3.9.1 (see https://pypdf.readthedocs.io/en/latest/user/merging-pdfs.html#basic-example)
+from PyPDF2 import PdfMerger
+from PyPDF2.errors import PdfReadError
+# Should be first local import before load_dotenv() (or at least I think it needs to come first)
 from pdfalyzer.config import PdfalyzerConfig
 # load_dotenv() should be called as soon as possible (before parsing local classes) but not for pytest
@@ -16,16 +20,19 @@ if not environ.get('INVOKED_BY_PYTEST', False):
 from rich.columns import Columns
 from rich.panel import Panel
+from rich.text import Text
 from yaralyzer.helpers.rich_text_helper import prefix_with_plain_text_obj
 from yaralyzer.output.file_export import invoke_rich_export
 from yaralyzer.output.rich_console import console
 from yaralyzer.util.logging import log, log_and_print
+from pdfalyzer.helpers.filesystem_helper import file_size_in_mb, set_max_open_files
+from pdfalyzer.helpers.rich_text_helper import print_highlighted
 from pdfalyzer.output.pdfalyzer_presenter import PdfalyzerPresenter
 from pdfalyzer.output.styles.rich_theme import PDFALYZER_THEME_DICT
 from pdfalyzer.pdfalyzer import Pdfalyzer
+from pdfalyzer.util.argument_parser import ask_to_proceed, output_sections, parse_arguments, parse_combine_pdfs_args
 from pdfalyzer.util.pdf_parser_manager import PdfParserManager
-from pdfalyzer.util.argument_parser import output_sections, parse_arguments
 # For the table shown by running pdfalyzer_show_color_theme
 MAX_THEME_COL_SIZE = 35
@@ -82,3 +89,36 @@ def pdfalyzer_show_color_theme() -> None:
     ]
     console.print(Columns(colors, column_first=True, padding=(0,3)))
+def combine_pdfs():
+    """Utility method to combine multiple PDFs into one. Invocable with 'combine_pdfs PDF1 [PDF2...]'."""
+    args = parse_combine_pdfs_args()
+    set_max_open_files(args.number_of_pdfs)
+    merger = PdfMerger()
+    for pdf in args.pdfs:
+        try:
+            print_highlighted(f"  -> Merging '{pdf}'...", style='dim')
+            merger.append(pdf)
+        except PdfReadError as e:
+            print_highlighted(f"      -> Failed to merge '{pdf}'! {e}", style='red')
+            ask_to_proceed()
+    if args.compression_level == 0:
+        print_highlighted("\nSkipping content stream compression...")
+    else:
+        print_highlighted(f"\nCompressing content streams with zlib level {args.compression_level}...")
+        for i, page in enumerate(merger.pages):
+            # TODO: enable image quality reduction + zlib level once PyPDF is upgraded to 4.x and option is available
+            # See https://pypdf.readthedocs.io/en/latest/user/file-size.html#reducing-image-quality
+            print_highlighted(f"  -> Compressing page {i + 1}...", style='dim')
+            page.pagedata.compress_content_streams()  # This is CPU intensive!
+    print_highlighted(f"\nWriting '{args.output_file}'...", style='cyan')
+    merger.write(args.output_file)
+    merger.close()
+    txt = Text('').append(f"  -> Wrote ")
+    txt.append(str(file_size_in_mb(args.output_file)), style='cyan').append(" megabytes\n")
+    print_highlighted(txt)

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/binary/binary_scanner.py RENAMED Viewed

@@ -20,9 +20,8 @@ from yaralyzer.util.logging import log
 from pdfalyzer.config import PdfalyzerConfig
 from pdfalyzer.decorators.pdf_tree_node import PdfTreeNode
-from pdfalyzer.detection.constants.binary_regexes import (BACKTICK,
-     DANGEROUS_PDF_KEYS_TO_HUNT_ONLY_IN_FONTS, DANGEROUS_STRINGS, FRONTSLASH, GUILLEMET,
-     QUOTE_PATTERNS)
+from pdfalyzer.detection.constants.binary_regexes import (BACKTICK, DANGEROUS_PDF_KEYS_TO_HUNT_ONLY_IN_FONTS,
+     DANGEROUS_PDF_KEYS_TO_HUNT_ONLY_IN_FONTS, DANGEROUS_STRINGS, FRONTSLASH, GUILLEMET, QUOTE_PATTERNS)
 from pdfalyzer.helpers.string_helper import generate_hyphen_line
 from pdfalyzer.output.layout import print_headline_panel, print_section_sub_subheader
 from pdfalyzer.util.adobe_strings import CONTENTS, CURRENTFILE_EEXEC, FONT_FILE_KEYS

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/decorators/document_model_printer.py RENAMED Viewed

@@ -1,5 +1,5 @@
 """
-Deprecated old, pre-tree, more rawformat reader.
+Deprecated old, pre-tree, more rawformat reader. Only used for debugging these days.
 """
 from io import StringIO

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/decorators/indeterminate_node.py RENAMED Viewed

@@ -22,6 +22,7 @@ class IndeterminateNode:
         self.node = node
     def place_node(self) -> None:
+        """Attempt to find the appropriate parent/child relationships for this node."""
         log.debug(f"Attempting to resolve indeterminate node: {self.node}")
         if self._check_for_common_ancestor():
@@ -34,7 +35,7 @@ class IndeterminateNode:
         parent = self.find_node_with_most_descendants()
         parent_str = escape(str(parent))
-        # Any branch that doesn't return or raise will end with parent being node w/most descendants
+        # Any if/else branch that doesn't return or raise will decide parent to be the node w/most descendants
         if self._has_only_similar_relationships():
             log.info(f"  Fuzzy match addresses or labels; placing under node w/most descendants: {parent_str}")
         elif self._make_parent_if_one_remains(lambda r: r.from_node.type in PAGE_AND_PAGES):
@@ -43,7 +44,8 @@ class IndeterminateNode:
         elif self.node.type == COLOR_SPACE:
             log.info(f"  Color space node found; placing under node w/most descendants: {parent_str}")
         elif set(self.node.unique_labels_of_referring_nodes()) == set(PAGE_AND_PAGES):
-            # An edge case seen in the wild involving a PDF that doesn't conform to the PDF spec
+            # Handle an edge case seen in the wild involving a PDF that doesn't conform to the PDF spec
+            # in a particular way.
             log.warning(f"  {self.node} seems to be a loose {PAGE}. Linking to first {PAGES}")
             pages_nodes = [n for n in self.node.nodes_with_here_references() if self.node.type == PAGES]
             self.node.set_parent(self.find_node_with_most_descendants(pages_nodes))
@@ -63,7 +65,7 @@ class IndeterminateNode:
     def _has_only_similar_relationships(self) -> bool:
         """
         Returns True if all the nodes w/references to this one have the same type or if all the
-        reference_keys that point to this node are the same
+        reference_keys that point to this node are the same.
         """
         unique_refferer_labels = self.node.unique_labels_of_referring_nodes()
         unique_addresses = self.node.unique_addresses()
@@ -99,7 +101,7 @@ class IndeterminateNode:
                 log.info(f"{possible_ancestor} is the common ancestor of {other_nodes_str}")
                 return possible_ancestor
-    def _check_single_relation_rules(self):
+    def _check_single_relation_rules(self) -> bool:
         """Check various ways of narrowing down the list of potential parents to one node."""
         if self._make_parent_if_one_remains(lambda r: r.reference_key in [K, KIDS]):
             log.info("  Found single explicit /K or /Kids ref")
@@ -111,7 +113,7 @@ class IndeterminateNode:
         return True
     def _make_parent_if_one_remains(self, is_possible_parent: Callable) -> bool:
-        """Relationships are filtered w/filter_parents(). If only one remains it's made the parent"""
+        """Relationships are filtered w/is_possible_parent(); if there's only one possibility it's made the parent."""
         remaining_relationships = [r for r in self.node.non_tree_relationships if is_possible_parent(r)]
         if len(remaining_relationships) == 1:
@@ -123,6 +125,6 @@ class IndeterminateNode:
 def find_node_with_lowest_id(list_of_nodes: List[PdfTreeNode]) -> PdfTreeNode:
-    """Find node in list_of_nodes_with_lowest ID"""
+    """Find node in list_of_nodes_with_lowest ID."""
     lowest_idnum = min([n.idnum for n in list_of_nodes])
     return next(n for n in list_of_nodes if n.idnum == lowest_idnum)

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/decorators/pdf_tree_node.py RENAMED Viewed

@@ -104,10 +104,11 @@ class PdfTreeNode(NodeMixin, PdfObjectProperties):
             self.non_tree_relationships.remove(relationship)
     def nodes_with_here_references(self) -> List['PdfTreeNode']:
-        """Return a list of nodes that contain this nodes PDF object as an IndirectObject reference."""
+        """Return a list of nodes that contain this node's PDF object as an IndirectObject reference."""
         return [r.from_node for r in self.non_tree_relationships if r.from_node]
     def non_tree_relationship_count(self) -> int:
+        """Number of non parent/child relationships containing this node."""
         return len(self.non_tree_relationships)
     def unique_addresses(self) -> List[str]:
@@ -128,7 +129,7 @@ class PdfTreeNode(NodeMixin, PdfObjectProperties):
         return isinstance(self.obj, StreamObject)
     def tree_address(self, max_length: Optional[int] = DEFAULT_MAX_ADDRESS_LENGTH) -> str:
-        """Creates a string like '/Catalog/Pages/Resources[2]/Font' truncated to max_length (if given)"""
+        """Creates a string like '/Catalog/Pages/Resources[2]/Font' truncated to max_length (if given)."""
         if self.label == TRAILER:
             return '/'
         elif self.parent is None:
@@ -163,7 +164,7 @@ class PdfTreeNode(NodeMixin, PdfObjectProperties):
         else:
             address = refs_to_this_node[0].address
             # If other node's label doesn't start with a NON_STANDARD_ADDRESS string
-            #   and any of the relationships pointing at this nod use something other than a
+            #   and any of the relationships pointing at this node use something other than a
             #       NON_STANDARD_ADDRESS_NODES string to refer here, print a warning about multiple refs.
             if not (is_prefixed_by_any(from_node.label, NON_STANDARD_ADDRESS_NODES) or \
                         all(ref.address in NON_STANDARD_ADDRESS_NODES for ref in refs_to_this_node)):
@@ -193,6 +194,7 @@ class PdfTreeNode(NodeMixin, PdfObjectProperties):
         return len(self.children) + sum([child.descendants_count() for child in self.children])
     def unique_labels_of_referring_nodes(self) -> List[str]:
+        """Unique label strings of nodes referring here outside the parent/child hierarchy."""
         return list(set([r.from_node.label for r in self.non_tree_relationships]))
     def print_non_tree_relationships(self) -> None:

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/detection/constants/binary_regexes.py RENAMED Viewed

@@ -1,13 +1,7 @@
 """
 Configuration of what to scan for in binary data. Regexes here will be matched against binary streams
-and then force decoded
+and then force decoded.
 """
-import re
-from typing import Union
-from deprecated import deprecated
 from pdfalyzer.util.adobe_strings import DANGEROUS_PDF_KEYS
 DANGEROUS_JAVASCRIPT_INSTRUCTIONS = ['eval']

pdfalyzer-1.15.0/pdfalyzer/helpers/filesystem_helper.py ADDED Viewed

@@ -0,0 +1,102 @@
+"""
+Some helpers for stuff with the local filesystem.
+"""
+import re
+from pathlib import Path
+from typing import Union
+from yaralyzer.output.rich_console import console
+from pdfalyzer.helpers.rich_text_helper import print_highlighted
+NUMBERED_PAGE_REGEX = re.compile(r'.*_(\d+)\.\w{3,4}$')
+DEFAULT_MAX_OPEN_FILES = 256  # macOS default
+OPEN_FILES_BUFFER = 30        # we might have some files open already so we need to go beyond DEFAULT_MAX_OPEN_FILES
+PDF_EXT = '.pdf'
+# TODO: this kind of type alias is not supported until Python 3.12
+#type StrOrPath = Union[str, Path]
+def with_pdf_extension(file_path: Union[str, Path]) -> str:
+    """Append '.pdf' to 'file_path' if it doesn't already end with '.pdf'."""
+    return str(file_path) + ('' if is_pdf(file_path) else PDF_EXT)
+def is_pdf(file_path: Union[str, Path]) -> bool:
+    """Return True if 'file_path' ends with '.pdf'."""
+    return str(file_path).endswith(PDF_EXT)
+def file_exists(file_path: Union[str, Path]) -> bool:
+    """Return True if 'file_path' exists."""
+    return Path(file_path).exists()
+def do_all_files_exist(file_paths: list[Union[str, Path]]) -> bool:
+    """Print an error for each element of 'file_paths' that's not a file. Return True if all 'file_paths' exist."""
+    all_files_exist = True
+    for file_path in file_paths:
+        if not file_exists(file_path):
+            console.print(f"File not found: '{file_path}'", style='error')
+            all_files_exist = False
+    return all_files_exist
+def extract_page_number(file_path: Union[str, Path]) -> int|None:
+    """Extract the page number from the end of a filename if it exists."""
+    match = NUMBERED_PAGE_REGEX.match(str(file_path))
+    return int(match.group(1)) if match else None
+def file_size_in_mb(file_path: Union[str, Path], decimal_places: int = 2) -> float:
+    """Return the size of 'file_path' in MB rounded to 2 decimal places,"""
+    return round(Path(file_path).stat().st_size / 1024.0 / 1024.0, decimal_places)
+def set_max_open_files(num_filehandles: int = DEFAULT_MAX_OPEN_FILES) -> tuple[int | None, int | None]:
+    """
+    Sets the OS level max open files to at least 'num_filehandles'. Current value can be seen with 'ulimit -a'.
+    Required when you might be opening more than DEFAULT_MAX_OPEN_FILES file handles simultaneously
+    (e.g. when you are merging a lot of small images or PDFs). Equivalent of something like
+    'default ulimit -n 1024' on macOS.
+    NOTE: Does nothing on Windows (I think).
+    NOTE: This mostly came from somewhere on stackoverflow but I lost the link.
+    """
+    try:
+        import resource  # Windows doesn't have this package / doesn't need to bump up the ulimit (??)
+    except ImportError:
+        resource = None
+    if resource is None:
+        print_highlighted(f"No resource module; cannot set max open files on this platform...", style='yellow')
+        return (None, None)
+    elif num_filehandles <= DEFAULT_MAX_OPEN_FILES:
+        # Then the OS max open files value is already sufficient.
+        return (DEFAULT_MAX_OPEN_FILES, DEFAULT_MAX_OPEN_FILES)
+    # %% (0) what is current ulimit -n setting?
+    (soft, hard) = resource.getrlimit(resource.RLIMIT_NOFILE)
+    num_filehandles = num_filehandles + OPEN_FILES_BUFFER
+    # %% (1) increase limit (soft and even hard) if needed
+    if soft < num_filehandles:
+        soft = num_filehandles
+        hard = max(soft, hard)
+        print_highlighted(f"Increasing max open files soft & hard 'ulimit -n {soft} {hard}'...")
+        try:
+            resource.setrlimit(resource.RLIMIT_NOFILE, (soft, hard))
+        except (ValueError, resource.error):
+            try:
+               hard = soft
+               print_highlighted(f"Retrying setting max open files (soft, hard)=({soft}, {hard})", style='yellow')
+               resource.setrlimit(resource.RLIMIT_NOFILE, (soft, hard))
+            except Exception:
+               print_highlighted('Failed to set max open files / ulimit, giving up!', style='error')
+               soft,hard = resource.getrlimit(resource.RLIMIT_NOFILE)
+    return (soft, hard)

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/helpers/rich_text_helper.py RENAMED Viewed

@@ -1,14 +1,26 @@
 """
 Functions for miscellaneous Rich text/string operations.
 """
+from functools import partial
 from typing import List
 from PyPDF2.generic import PdfObject
+from rich.console import Console
+from rich.highlighter import RegexHighlighter, JSONHighlighter
 from rich.text import Text
+from yaralyzer.output.rich_console import console
 from pdfalyzer.helpers.pdf_object_helper import pypdf_class_name
 from pdfalyzer.output.styles.node_colors import get_label_style, get_class_style_italic
+# Usually we use the yaralyzer console but that has no highlighter
+pdfalyzer_console = Console(color_system='256')
+def print_highlighted(msg: str|Text, **kwargs) -> None:
+    """Print 'msg' with Rich highlighting."""
+    pdfalyzer_console.print(msg, highlight=True, **kwargs)
 def quoted_text(
         _string: str,

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/util/adobe_strings.py RENAMED Viewed

@@ -79,7 +79,8 @@ XREF_STREAM     = '/XRefStm'
 FONT_LENGTHS = [f'/Length{i + 1}' for i in range(3)]
 FONT_FILE_KEYS = [FONT_FILE, FONT_FILE2, FONT_FILE3]
-# Instructions to flag when scanning stream data for malicious content.
+# Instructions to flag when scanning stream data for malicious content. The leading
+# front slash will be removed when pattern matching.
 DANGEROUS_PDF_KEYS = [
     # AA,  # AA is too generic; can't afford to remove the frontslash
     ACRO_FORM,

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/util/argument_parser.py RENAMED Viewed

@@ -1,5 +1,5 @@
 import sys
-from argparse import ArgumentError, ArgumentParser
+from argparse import ArgumentError, ArgumentParser, Namespace
 from collections import namedtuple
 from functools import partial, update_wrapper
 from importlib.metadata import version
@@ -7,11 +7,16 @@ from os import getcwd, path
 from typing import List
 from rich_argparse_plus import RichHelpFormatterPlus
+from rich.prompt import Confirm
+from rich.text import Text
 from yaralyzer.util.argument_parser import export, parser, parse_arguments as parse_yaralyzer_args
 from yaralyzer.util.logging import log, log_and_print, log_argparse_result, log_current_config, log_invocation
 from pdfalyzer.config import ALL_STREAMS, PdfalyzerConfig
 from pdfalyzer.detection.constants.binary_regexes import QUOTE_PATTERNS
+from pdfalyzer.helpers.filesystem_helper import (do_all_files_exist, extract_page_number, file_exists, is_pdf,
+     with_pdf_extension)
+from pdfalyzer.helpers.rich_text_helper import print_highlighted
 # NamedTuple to keep our argument selection orderly
 OutputSection = namedtuple('OutputSection', ['argument', 'method'])
@@ -25,7 +30,7 @@ DESCRIPTION = "Explore PDF's inner data structure with absurdly large and in dep
 EPILOG = "Values for various config options can be set permanently by a .pdfalyzer file in your home directory; " + \
          "see the documentation for details. " + \
-         f"A registry of previous pdfalyzer invocations will be incribed to a file if the " + \
+         f"A registry of previous pdfalyzer invocations will be inscribed to a file if the " + \
          "{YaralyzerConfig.LOG_DIR_ENV_VAR} environment variable is configured."
 # Analysis selection sections
@@ -107,7 +112,9 @@ select.add_argument('--preview-stream-length',
 parser._action_groups = parser._action_groups[:2] + [parser._action_groups[-1]] + parser._action_groups[2:-1]
-# The Parsening Begins
+################################
+# Main argument parsing begins #
+################################
 def parse_arguments():
     """Parse command line args. Most settings are communicated to the app by setting env vars"""
     if '--version' in sys.argv:
@@ -175,3 +182,71 @@ def output_sections(args, pdfalyzer) -> List[OutputSection]:
 def all_sections_chosen(args):
     """Returns true if all flags are set or no flags are set."""
     return len([s for s in ALL_SECTIONS if vars(args)[s]]) == len(ALL_SECTIONS)
+###############################################
+# Separate arg parser for combine_pdfs script #
+###############################################
+combine_pdfs_parser = ArgumentParser(
+    description="Combine multiple PDFs into one.",
+    epilog="If all PDFs end in a number (e.g. 'xyz_1.pdf', 'xyz_2.pdf', etc. sort the files as if those were" \
+           " page numebrs prior to merging.",
+    formatter_class=RichHelpFormatterPlus)
+combine_pdfs_parser.add_argument('pdfs',
+                                 help='two or more PDFs to combine',
+                                 metavar='PDF_PATH',
+                                 nargs='+')
+combine_pdfs_parser.add_argument('-c', '--compression-level',
+                                 help='zlib image compression level (0=none, max=1 until PyPDF is upgraded)',
+                                 choices=range(0, 2),
+                                 default=1,
+                                 type=int)
+combine_pdfs_parser.add_argument('-o', '--output-file',
+                                 help='path to write the combined PDFs to',
+                                 required=True)
+def parse_combine_pdfs_args() -> Namespace:
+    """Parse command line args for combine_pdfs script."""
+    args = combine_pdfs_parser.parse_args()
+    args.output_file = with_pdf_extension(args.output_file)
+    confirm_overwrite_txt = Text("Overwrite '").append(args.output_file, style='cyan').append("'?")
+    args.number_of_pdfs = len(args.pdfs)
+    if args.number_of_pdfs < 2:
+        exit_with_error(f"Need at least 2 PDFs to merge.")
+    elif not do_all_files_exist(args.pdfs):
+        exit_with_error()
+    elif file_exists(args.output_file) and not Confirm.ask(confirm_overwrite_txt):
+        exit_with_error()
+    if all(is_pdf(pdf) for pdf in args.pdfs):
+        if all(extract_page_number(pdf) for pdf in args.pdfs):
+            print_highlighted("PDFs appear to have page number suffixes so sorting numerically...")
+            args.pdfs.sort(key=lambda pdf: extract_page_number(pdf))
+        else:
+            print_highlighted("PDFs don't seem to end in page numbers so using provided order...", style='yellow')
+    else:
+        print_highlighted("WARNING: At least one of the PDF args doesn't end in '.pdf'", style='bright_yellow')
+        ask_to_proceed()
+    print_highlighted(f"\nMerging {args.number_of_pdfs} individual PDFs into '{args.output_file}'...")
+    return args
+def ask_to_proceed() -> None:
+    """Exit if user doesn't confirm they want to proceed."""
+    if not Confirm.ask(Text("Proceed anyway?")):
+        exit_with_error()
+def exit_with_error(error_message: str|None = None) -> None:
+    """Print 'error_message' and exit with status code 1."""
+    if error_message:
+        print_highlighted(error_message, style='bold red')
+    print_highlighted('Exiting...', style='red')
+    sys.exit(1)

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pdfalyzer/yara_rules/PDF.yara RENAMED Viewed

@@ -1026,7 +1026,7 @@ rule malware_MaldocinPDF {
         author         = "Yuma Masubuchi and Kota Kino"
         description    = "Search for embeddings of malicious Word files into a PDF file."
         created_date   = "2023-08-15"
-        blog_reference = "https://malware.news/t/maldoc-in-pdf-detection-bypass-by-embedding-a-malicious-word-file-into-a-pdf-file/72815"
+        blog_reference = "https://blogs.jpcert.or.jp/en/2023/08/maldocinpdf.html"
         labs_reference = "N/A"
         labs_pivot     = "N/A"
         samples        = "ef59d7038cfd565fd65bae12588810d5361df938244ebad33b71882dcf683058"

{pdfalyzer-1.14.10 → pdfalyzer-1.15.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "pdfalyzer"
-version = "1.14.10"
+version = "1.15.0"
 description = "A PDF analysis toolkit. Scan a PDF with relevant YARA rules, visualize its inner tree-like data structure in living color (lots of colors), force decodes of suspicious font binaries, and more."
 authors = ["Michel de Cryptadamus <michel@cryptadamus.com>"]
 license = "GPL-3.0-or-later"
@@ -42,6 +42,7 @@ packages = [
 [tool.poetry.scripts]
+combine_pdfs = 'pdfalyzer:combine_pdfs'
 pdfalyze = 'pdfalyzer:pdfalyze'
 pdfalyzer_show_color_theme = 'pdfalyzer:pdfalyzer_show_color_theme'
@@ -50,7 +51,6 @@ pdfalyzer_show_color_theme = 'pdfalyzer:pdfalyzer_show_color_theme'
 python = "^3.9"
 anytree = "~=2.8"
 chardet = ">=5.0.0,<6.0.0"
-Deprecated = "^1.2.13"
 PyPDF2 = "^2.10"
 python-dotenv = "^0.21.0"
 rich = "^12.5.1"