PyPI - pdfalyzer - Versions diffs - 1.15.0__tar.gz → 1.16.0__tar.gz - Mend

pdfalyzer 1.15.0tar.gz → 1.16.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of pdfalyzer might be problematic. Click here for more details.

Files changed (45) hide show

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/CHANGELOG.md RENAMED Viewed

@@ -1,5 +1,12 @@
 # NEXT RELEASE
+# 1.16.0
+* Upgrade `PyPDF2` 2.x to `pypdf` 5.0.1 (new name, same package)
+* Add `--image-quality` option to `combine_pdfs` tool
+### 1.15.1
+* Add `--no-default-yara-rules` command line option so users can use _only_ their own custom YARA rules files if they want. Previously you could only use custom YARA rules _in addition to_ the default rules; now you can just skip the default rules.
 # 1.15.0
 * Add `combine_pdfs` command line script to merge a bunch of PDFs into one
 * Remove unused `Deprecated` dependency

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: pdfalyzer
-Version: 1.15.0
+Version: 1.16.0
 Summary: A PDF analysis toolkit. Scan a PDF with relevant YARA rules, visualize its inner tree-like data structure in living color (lots of colors), force decodes of suspicious font binaries, and more.
 Home-page: https://github.com/michelcrypt4d4mus/pdfalyzer
 License: GPL-3.0-or-later
@@ -16,9 +16,9 @@ Classifier: Programming Language :: Python :: 3.11
 Classifier: Topic :: Artistic Software
 Classifier: Topic :: Scientific/Engineering :: Visualization
 Classifier: Topic :: Security
-Requires-Dist: PyPDF2 (>=2.10,<3.0)
 Requires-Dist: anytree (>=2.8,<3.0)
 Requires-Dist: chardet (>=5.0.0,<6.0.0)
+Requires-Dist: pypdf (>=5.0.1,<6.0.0)
 Requires-Dist: python-dotenv (>=0.21.0,<0.22.0)
 Requires-Dist: rich (>=12.5.1,<13.0.0)
 Requires-Dist: rich-argparse-plus (>=0.3.1,<0.4.0)
@@ -71,7 +71,7 @@ Installation with [pipx](https://pypa.github.io/pipx/)[^4] is preferred though `
 pipx install pdfalyzer
 ```
-See [PyPDF2 installation notes](https://github.com/py-pdf/PyPDF2#installation) about `PyCryptodome` if you plan to `pdfalyze` any files that use AES encryption.
+See [PyPDF installation notes](https://github.com/py-pdf/pypdf#installation) about `PyCryptodome` if you plan to `pdfalyze` any files that use AES encryption.
 If you are on macOS someone out there was kind enough to make [The Pdfalyzer available via homebrew](https://formulae.brew.sh/formula/pdfalyzer) so `brew install pdfalyzer` should work.
@@ -123,7 +123,7 @@ Warnings will be printed if any PDF object ID between 1 and the `/Size` reported
 ## Use As A Code Library
 For info about setting up a dev environment see [Contributing](#contributing) below.
-At its core The Pdfalyzer is taking PDF internal objects gathered by [PyPDF2](https://github.com/py-pdf/PyPDF2) and wrapping them in [AnyTree](https://github.com/c0fec0de/anytree)'s `NodeMixin` class.  Given that things like searching the tree or accessing internal PDF properties will be done through those packages' code it may be helpful to review their documentation.
+At its core The Pdfalyzer is taking PDF internal objects gathered by [PyPDF](https://github.com/py-pdf/pypdf) and wrapping them in [AnyTree](https://github.com/c0fec0de/anytree)'s `NodeMixin` class.  Given that things like searching the tree or accessing internal PDF properties will be done through those packages' code it may be helpful to review their documentation.
 As far as The Pdfalyzer's unique functionality goes, [`Pdfalyzer`](pdfalyzer/pdfalyzer.py) is the class at the heart of the operation. It holds the PDF's logical tree as well as a few other data structures. Chief among these are the [`FontInfo`](pdfalyzer/font_info.py) class which pulls together various properties of a font strewn across 3 or 4 different PDF objects and the [`BinaryScanner`](pdfalyzer/binary/binary_scanner.py) class which lets you dig through the embedded streams' bytes looking for suspicious patterns.
@@ -192,7 +192,7 @@ This image shows a more in-depth view of of the PDF tree for the same document s
 ## Fonts
-#### **Extract character mappings from ancient Adobe font formats**. It's actually `PyPDF2` doing the lifting here but we're happy to take the credit.
+#### **Extract character mappings from ancient Adobe font formats**. It's actually `PyPDF` doing the lifting here but we're happy to take the credit.
 ![](https://github.com/michelcrypt4d4mus/pdfalyzer/raw/master/doc/svgs/rendered_images/font_character_mapping.png)
@@ -275,7 +275,7 @@ scripts/install_t1utils.sh
 ## Did The World Really Need Another PDF Tool?
 This tool was built to fill a gap in the PDF assessment landscape following [my own recent experience trying to find malicious content in a PDF file](https://twitter.com/Cryptadamist/status/1570167937381826560). Didier Stevens's [pdfid.py](https://github.com/DidierStevens/DidierStevensSuite/blob/master/pdfid.py) and [pdf-parser.py](https://github.com/DidierStevens/DidierStevensSuite/blob/master/pdf-parser.py) are still the best game in town when it comes to PDF analysis tools but they lack in the visualization department and also don't give you much to work with as far as giving you a data model you can write your own code around. [Peepdf](https://github.com/jesparza/peepdf) seemed promising but turned out to be in a buggy, out of date, and more or less unfixable state. And neither of them offered much in the way of tooling for embedded binary analysis.
-Thus I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects ([AnyTree](https://github.com/c0fec0de/anytree), [PyPDF2](https://github.com/py-pdf/PyPDF2), [Rich](https://github.com/Textualize/rich), and [YARA](https://github.com/VirusTotal/yara-python) via [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer)) into this tool.
+Thus I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects ([AnyTree](https://github.com/c0fec0de/anytree), [PyPDF](https://github.com/py-pdf/pypdf), [Rich](https://github.com/Textualize/rich), and [YARA](https://github.com/VirusTotal/yara-python) via [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer)) into this tool.
 -------------
@@ -289,7 +289,7 @@ These are the naming conventions at play in The Pdfalyzer code base:
 | Term  | Meaning |
 | ----------------- | ---------------- |
-| **`PDF Object`** | Instance of a `PyPDF2` class that represents the information stored in the PDF binary between open and close guillemet quotes (« and ») |
+| **`PDF Object`** | Instance of a `PyPDF` class that represents the information stored in the PDF binary between open and close guillemet quotes (« and ») |
 | **`reference_key`** | String found in a PDF object that names a property (e.g. `/BaseFont` or `/Subtype`) |
 | **`reference`** | Link _from_ a PDF object _to_ another node. Outward facing relationships, basically. |
 | **`address`** | `reference_key` plus a hash key or numerical array index if that's how the reference works. e.g. if node A has a reference key `/Resources` pointing to a dict `{'/Font2': [IndirectObject(55), IndirectObject(2)]}` the address of `IndirectObject(55)` from node A would be `/Resources[/Font2][0]` |
@@ -300,11 +300,10 @@ These are the naming conventions at play in The Pdfalyzer code base:
 | **`link_node`** | nodes like `/Dest` that just contain a pointer to another node |
 ### Reference
-* [`PyPDF2 2.12.0` documentation](https://pypdf2.readthedocs.io/en/2.12.0/) (latest is 4.x or something so these are the relevant docs for `pdfalyze`)
+* [`PyPDF` documentation](https://pypdf.readthedocs.io/en/stable/) (latest is 4.x or something so these are the relevant docs for `pdfalyze`)
 # TODO
-* Upgrade `PyPDF` to latest and expand `combine_pdfs` compression command line option
 * Highlight decodes with a lot of Javascript keywords
 * https://github.com/mandiant/flare-floss (https://github.com/mandiant/flare-floss/releases/download/v2.1.0/floss-v2.1.0-linux.zip)
 * https://github.com/1Project/Scanr/blob/master/emulator/emulator.py

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/README.md RENAMED Viewed

@@ -42,7 +42,7 @@ Installation with [pipx](https://pypa.github.io/pipx/)[^4] is preferred though `
 pipx install pdfalyzer
 ```
-See [PyPDF2 installation notes](https://github.com/py-pdf/PyPDF2#installation) about `PyCryptodome` if you plan to `pdfalyze` any files that use AES encryption.
+See [PyPDF installation notes](https://github.com/py-pdf/pypdf#installation) about `PyCryptodome` if you plan to `pdfalyze` any files that use AES encryption.
 If you are on macOS someone out there was kind enough to make [The Pdfalyzer available via homebrew](https://formulae.brew.sh/formula/pdfalyzer) so `brew install pdfalyzer` should work.
@@ -94,7 +94,7 @@ Warnings will be printed if any PDF object ID between 1 and the `/Size` reported
 ## Use As A Code Library
 For info about setting up a dev environment see [Contributing](#contributing) below.
-At its core The Pdfalyzer is taking PDF internal objects gathered by [PyPDF2](https://github.com/py-pdf/PyPDF2) and wrapping them in [AnyTree](https://github.com/c0fec0de/anytree)'s `NodeMixin` class.  Given that things like searching the tree or accessing internal PDF properties will be done through those packages' code it may be helpful to review their documentation.
+At its core The Pdfalyzer is taking PDF internal objects gathered by [PyPDF](https://github.com/py-pdf/pypdf) and wrapping them in [AnyTree](https://github.com/c0fec0de/anytree)'s `NodeMixin` class.  Given that things like searching the tree or accessing internal PDF properties will be done through those packages' code it may be helpful to review their documentation.
 As far as The Pdfalyzer's unique functionality goes, [`Pdfalyzer`](pdfalyzer/pdfalyzer.py) is the class at the heart of the operation. It holds the PDF's logical tree as well as a few other data structures. Chief among these are the [`FontInfo`](pdfalyzer/font_info.py) class which pulls together various properties of a font strewn across 3 or 4 different PDF objects and the [`BinaryScanner`](pdfalyzer/binary/binary_scanner.py) class which lets you dig through the embedded streams' bytes looking for suspicious patterns.
@@ -163,7 +163,7 @@ This image shows a more in-depth view of of the PDF tree for the same document s
 ## Fonts
-#### **Extract character mappings from ancient Adobe font formats**. It's actually `PyPDF2` doing the lifting here but we're happy to take the credit.
+#### **Extract character mappings from ancient Adobe font formats**. It's actually `PyPDF` doing the lifting here but we're happy to take the credit.
 ![](https://github.com/michelcrypt4d4mus/pdfalyzer/raw/master/doc/svgs/rendered_images/font_character_mapping.png)
@@ -246,7 +246,7 @@ scripts/install_t1utils.sh
 ## Did The World Really Need Another PDF Tool?
 This tool was built to fill a gap in the PDF assessment landscape following [my own recent experience trying to find malicious content in a PDF file](https://twitter.com/Cryptadamist/status/1570167937381826560). Didier Stevens's [pdfid.py](https://github.com/DidierStevens/DidierStevensSuite/blob/master/pdfid.py) and [pdf-parser.py](https://github.com/DidierStevens/DidierStevensSuite/blob/master/pdf-parser.py) are still the best game in town when it comes to PDF analysis tools but they lack in the visualization department and also don't give you much to work with as far as giving you a data model you can write your own code around. [Peepdf](https://github.com/jesparza/peepdf) seemed promising but turned out to be in a buggy, out of date, and more or less unfixable state. And neither of them offered much in the way of tooling for embedded binary analysis.
-Thus I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects ([AnyTree](https://github.com/c0fec0de/anytree), [PyPDF2](https://github.com/py-pdf/PyPDF2), [Rich](https://github.com/Textualize/rich), and [YARA](https://github.com/VirusTotal/yara-python) via [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer)) into this tool.
+Thus I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects ([AnyTree](https://github.com/c0fec0de/anytree), [PyPDF](https://github.com/py-pdf/pypdf), [Rich](https://github.com/Textualize/rich), and [YARA](https://github.com/VirusTotal/yara-python) via [The Yaralyzer](https://github.com/michelcrypt4d4mus/yaralyzer)) into this tool.
 -------------
@@ -260,7 +260,7 @@ These are the naming conventions at play in The Pdfalyzer code base:
 | Term  | Meaning |
 | ----------------- | ---------------- |
-| **`PDF Object`** | Instance of a `PyPDF2` class that represents the information stored in the PDF binary between open and close guillemet quotes (« and ») |
+| **`PDF Object`** | Instance of a `PyPDF` class that represents the information stored in the PDF binary between open and close guillemet quotes (« and ») |
 | **`reference_key`** | String found in a PDF object that names a property (e.g. `/BaseFont` or `/Subtype`) |
 | **`reference`** | Link _from_ a PDF object _to_ another node. Outward facing relationships, basically. |
 | **`address`** | `reference_key` plus a hash key or numerical array index if that's how the reference works. e.g. if node A has a reference key `/Resources` pointing to a dict `{'/Font2': [IndirectObject(55), IndirectObject(2)]}` the address of `IndirectObject(55)` from node A would be `/Resources[/Font2][0]` |
@@ -271,11 +271,10 @@ These are the naming conventions at play in The Pdfalyzer code base:
 | **`link_node`** | nodes like `/Dest` that just contain a pointer to another node |
 ### Reference
-* [`PyPDF2 2.12.0` documentation](https://pypdf2.readthedocs.io/en/2.12.0/) (latest is 4.x or something so these are the relevant docs for `pdfalyze`)
+* [`PyPDF` documentation](https://pypdf.readthedocs.io/en/stable/) (latest is 4.x or something so these are the relevant docs for `pdfalyze`)
 # TODO
-* Upgrade `PyPDF` to latest and expand `combine_pdfs` compression command line option
 * Highlight decodes with a lot of Javascript keywords
 * https://github.com/mandiant/flare-floss (https://github.com/mandiant/flare-floss/releases/download/v2.1.0/floss-v2.1.0-linux.zip)
 * https://github.com/1Project/Scanr/blob/master/emulator/emulator.py

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/__init__.py RENAMED Viewed

@@ -4,9 +4,8 @@ from os import environ, getcwd, path
 from pathlib import Path
 from dotenv import load_dotenv
-# TODO: PdfMerger is deprecated in favor of PdfWriter at v3.9.1 (see https://pypdf.readthedocs.io/en/latest/user/merging-pdfs.html#basic-example)
-from PyPDF2 import PdfMerger
-from PyPDF2.errors import PdfReadError
+from pypdf import PdfWriter
+from pypdf.errors import PdfReadError
 # Should be first local import before load_dotenv() (or at least I think it needs to come first)
 from pdfalyzer.config import PdfalyzerConfig
@@ -31,7 +30,8 @@ from pdfalyzer.helpers.rich_text_helper import print_highlighted
 from pdfalyzer.output.pdfalyzer_presenter import PdfalyzerPresenter
 from pdfalyzer.output.styles.rich_theme import PDFALYZER_THEME_DICT
 from pdfalyzer.pdfalyzer import Pdfalyzer
-from pdfalyzer.util.argument_parser import ask_to_proceed, output_sections, parse_arguments, parse_combine_pdfs_args
+from pdfalyzer.util.argument_parser import (MAX_QUALITY, ask_to_proceed, output_sections, parse_arguments,
+     parse_combine_pdfs_args)
 from pdfalyzer.util.pdf_parser_manager import PdfParserManager
 # For the table shown by running pdfalyzer_show_color_theme
@@ -51,6 +51,7 @@ def pdfalyze():
         log_and_print(f"Binary stream extraction complete, files written to '{args.output_dir}'.\nExiting.\n")
         sys.exit()
+    # The method that gets called is related to the argument name. See 'possible_output_sections' list in argument_parser.py
     # Analysis exports wrap themselves around the methods that actually generate the analyses
     for (arg, method) in output_sections(args, pdfalyzer):
         if args.output_dir:
@@ -92,10 +93,13 @@ def pdfalyzer_show_color_theme() -> None:
 def combine_pdfs():
-    """Utility method to combine multiple PDFs into one. Invocable with 'combine_pdfs PDF1 [PDF2...]'."""
+    """
+    Utility method to combine multiple PDFs into one. Invocable with 'combine_pdfs PDF1 [PDF2...]'.
+    Example: https://github.com/py-pdf/pypdf/blob/main/docs/user/merging-pdfs.md
+    """
     args = parse_combine_pdfs_args()
     set_max_open_files(args.number_of_pdfs)
-    merger = PdfMerger()
+    merger = PdfWriter()
     for pdf in args.pdfs:
         try:
@@ -105,18 +109,19 @@ def combine_pdfs():
             print_highlighted(f"      -> Failed to merge '{pdf}'! {e}", style='red')
             ask_to_proceed()
-    if args.compression_level == 0:
-        print_highlighted("\nSkipping content stream compression...")
-    else:
-        print_highlighted(f"\nCompressing content streams with zlib level {args.compression_level}...")
+    # Iterate through pages and compress, lowering image quality if requested
+    # See https://pypdf.readthedocs.io/en/latest/user/file-size.html#reducing-image-quality
+    for i, page in enumerate(merger.pages):
+        if args.image_quality < MAX_QUALITY:
+            for j, img in enumerate(page.images):
+                print_highlighted(f"  -> Reducing image #{j + 1} quality on page {i + 1} to {args.image_quality}...", style='dim')
+                img.replace(img.image, quality=args.image_quality)
-        for i, page in enumerate(merger.pages):
-            # TODO: enable image quality reduction + zlib level once PyPDF is upgraded to 4.x and option is available
-            # See https://pypdf.readthedocs.io/en/latest/user/file-size.html#reducing-image-quality
-            print_highlighted(f"  -> Compressing page {i + 1}...", style='dim')
-            page.pagedata.compress_content_streams()  # This is CPU intensive!
+        print_highlighted(f"  -> Compressing page {i + 1}...", style='dim')
+        page.compress_content_streams()  # This is CPU intensive!
     print_highlighted(f"\nWriting '{args.output_file}'...", style='cyan')
+    merger.compress_identical_objects(remove_identicals=True, remove_orphans=True)
     merger.write(args.output_file)
     merger.close()
     txt = Text('').append(f"  -> Wrote ")

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/decorators/document_model_printer.py RENAMED Viewed

@@ -3,7 +3,7 @@ Deprecated old, pre-tree, more rawformat reader. Only used for debugging these d
 """
 from io import StringIO
-from PyPDF2.generic import ArrayObject, DictionaryObject, IndirectObject
+from pypdf.generic import ArrayObject, DictionaryObject, IndirectObject
 from rich.console import Console
 from rich.markup import escape

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/decorators/pdf_object_properties.py RENAMED Viewed

@@ -1,9 +1,9 @@
 """
-Decorator for PyPDF2 PdfObject that extracts a couple of properties (type, label, etc).
+Decorator for PyPDF PdfObject that extracts a couple of properties (type, label, etc).
 """
 from typing import Any, List, Optional, Union
-from PyPDF2.generic import DictionaryObject, IndirectObject, NumberObject, PdfObject
+from pypdf.generic import DictionaryObject, IndirectObject, NumberObject, PdfObject
 from rich.text import Text
 from yaralyzer.util.logging import log

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/decorators/pdf_tree_node.py RENAMED Viewed

@@ -9,8 +9,8 @@ hooks)
 from typing import Callable, List, Optional, Set
 from anytree import NodeMixin, SymlinkNode
-from PyPDF2.errors import PdfReadError
-from PyPDF2.generic import IndirectObject, PdfObject, StreamObject
+from pypdf.errors import PdfReadError
+from pypdf.generic import IndirectObject, PdfObject, StreamObject
 from rich.markup import escape
 from rich.text import Text
 from yaralyzer.output.rich_console import console
@@ -41,7 +41,7 @@ class PdfTreeNode(NodeMixin, PdfObjectProperties):
                 self.stream_data = self.obj.get_data()
                 self.stream_length = len(self.stream_data)
             except (NotImplementedError, PdfReadError) as e:
-                msg = f"PyPDF2 failed to decode stream in {self}: {e}.\n" + \
+                msg = f"PyPDF failed to decode stream in {self}: {e}.\n" + \
                        "Trees will be unaffected but scans/extractions will not be able to check this stream."
                 console.print_exception()
                 log.warning(msg)

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/decorators/pdf_tree_verifier.py RENAMED Viewed

@@ -1,8 +1,8 @@
 """
 Verify that the PDF tree is complete/contains all the nodes in the PDF file.
 """
-from PyPDF2.errors import PdfReadError
-from PyPDF2.generic import IndirectObject, NameObject, NumberObject
+from pypdf.errors import PdfReadError
+from pypdf.generic import IndirectObject, NameObject, NumberObject
 from rich.markup import escape
 from yaralyzer.output.rich_console import console
 from yaralyzer.util.logging import log

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/detection/yaralyzer_helper.py RENAMED Viewed

@@ -8,6 +8,8 @@ from typing import Optional, Union
 from yaralyzer.config import YaralyzerConfig
 from yaralyzer.yaralyzer import Yaralyzer
+from pdfalyzer.config import PdfalyzerConfig
 YARA_RULES_DIR = files('pdfalyzer').joinpath('yara_rules')
 YARA_RULES_FILES = [
@@ -32,8 +34,11 @@ def _build_yaralyzer(scannable: Union[bytes, str], label: Optional[str] = None)
     with as_file(YARA_RULES_DIR.joinpath(YARA_RULES_FILES[0])) as yara0:
         with as_file(YARA_RULES_DIR.joinpath(YARA_RULES_FILES[1])) as yara1:
             with as_file(YARA_RULES_DIR.joinpath(YARA_RULES_FILES[2])) as yara2:
-                rules_paths = [str(y) for y in [yara0, yara1, yara2]]
-                rules_paths += YaralyzerConfig.args.yara_rules_files or []
+                # If there is a custom yara_rules argument file use that instead of the files in the yara_rules/ dir
+                rules_paths = YaralyzerConfig.args.yara_rules_files or []
+                if not YaralyzerConfig.args.no_default_yara_rules:
+                    rules_paths += [str(y) for y in [yara0, yara1, yara2]]
                 try:
                     return Yaralyzer.for_rules_files(rules_paths, scannable, label)

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/font_info.py RENAMED Viewed

@@ -3,8 +3,8 @@ Unify font information spread across a bunch of PdfObjects (Font, FontDescriptor
 and FontFile) into a single class.
 """
-from PyPDF2._cmap import build_char_map, prepare_cm
-from PyPDF2.generic import IndirectObject, PdfObject
+from pypdf._cmap import build_char_map, prepare_cm
+from pypdf.generic import IndirectObject, PdfObject
 from rich.text import Text
 from yaralyzer.output.rich_console import console
 from yaralyzer.util.logging import log

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/helpers/pdf_object_helper.py RENAMED Viewed

@@ -1,9 +1,9 @@
 """
-Some methods to help with the direct manipulation/processing of PyPDF2's PdfObjects
+Some methods to help with the direct manipulation/processing of PyPDF's PdfObjects
 """
 from typing import List, Optional
-from PyPDF2.generic import IndirectObject, PdfObject
+from pypdf.generic import IndirectObject, PdfObject
 from pdfalyzer.pdf_object_relationship import PdfObjectRelationship
 from pdfalyzer.util.adobe_strings import *
@@ -24,7 +24,7 @@ def _sort_pdf_object_refs(refs: List[PdfObjectRelationship]) -> List[PdfObjectRe
 def pypdf_class_name(obj: PdfObject) -> str:
-    """Shortened name of type(obj), e.g. PyPDF2.generic._data_structures.ArrayObject becomes Array"""
+    """Shortened name of type(obj), e.g. PyPDF.generic._data_structures.ArrayObject becomes Array"""
     class_pkgs = type(obj).__name__.split('.')
     class_pkgs.reverse()
     return class_pkgs[0].removesuffix('Object')

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/helpers/rich_text_helper.py RENAMED Viewed

@@ -4,7 +4,7 @@ Functions for miscellaneous Rich text/string operations.
 from functools import partial
 from typing import List
-from PyPDF2.generic import PdfObject
+from pypdf.generic import PdfObject
 from rich.console import Console
 from rich.highlighter import RegexHighlighter, JSONHighlighter
 from rich.text import Text

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/output/character_mapping.py RENAMED Viewed

@@ -12,13 +12,13 @@ from pdfalyzer.helpers.rich_text_helper import quoted_text
 from pdfalyzer.helpers.string_helper import pp
 from pdfalyzer.output.layout import print_headline_panel, subheading_width
-CHARMAP_TITLE = 'Character Mapping (As Extracted By PyPDF2)'
+CHARMAP_TITLE = 'Character Mapping (As Extracted By PyPDF)'
 CHARMAP_TITLE_PADDING = (1, 0, 0, 2)
 CHARMAP_PADDING = (0, 2, 0, 10)
 def print_character_mapping(font: 'FontInfo') -> None:
-    """Prints the character mapping extracted by PyPDF2._charmap in tidy columns"""
+    """Prints the character mapping extracted by PyPDF._charmap in tidy columns"""
     if font.character_mapping is None or len(font.character_mapping) == 0:
         log.info(f"No character map found in {font}")
         return
@@ -38,12 +38,12 @@ def print_character_mapping(font: 'FontInfo') -> None:
 def print_prepared_charmap(font: 'FontInfo'):
-    """Prints the prepared_charmap returned by PyPDF2"""
+    """Prints the prepared_charmap returned by PyPDF."""
     if font.prepared_char_map is None:
         log.info(f"No prepared_charmap found in {font}")
         return
-    headline = f"{font} Adobe PostScript charmap prepared by PyPDF2"
+    headline = f"{font} Adobe PostScript charmap prepared by PyPDF"
     print_headline_panel(headline, style='charmap.prepared_title')
     print_bytes(font.prepared_char_map, style='charmap.prepared')
     console.line()

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/output/pdfalyzer_presenter.py RENAMED Viewed

@@ -47,7 +47,7 @@ class PdfalyzerPresenter:
     def print_document_info(self) -> None:
         """Print the embedded document info (author, timestamps, version, etc)."""
         print_section_header(f'Document Info for {self.pdfalyzer.pdf_basename}')
-        console.print(pp.pformat(self.pdfalyzer.pdf_reader.getDocumentInfo()))
+        console.print(pp.pformat(self.pdfalyzer.pdf_reader.metadata))
         console.line()
         console.print(bytes_hashes_table(self.pdfalyzer.pdf_bytes, self.pdfalyzer.pdf_basename))
         console.line()
@@ -124,7 +124,7 @@ class PdfalyzerPresenter:
                 console.print(build_decoding_stats_table(binary_scanner), justify='center')
     def print_yara_results(self) -> None:
-        """Scan the overall PDF and each individual binary stream in it with yara_rules/ files"""
+        """Scan the main PDF and each individual binary stream in it with yara_rules/*.yara files"""
         print_section_header(f"YARA Scan of PDF rules for '{self.pdfalyzer.pdf_basename}'")
         YaralyzerConfig.args.standalone_mode = True  # TODO: using 'standalone mode' like this kind of sucks

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/output/styles/node_colors.py RENAMED Viewed

@@ -6,7 +6,7 @@ from collections import namedtuple
 from numbers import Number
 from typing import Any
-from PyPDF2.generic import (ArrayObject, ByteStringObject, EncodedStreamObject, IndirectObject,
+from pypdf.generic import (ArrayObject, ByteStringObject, EncodedStreamObject, IndirectObject,
      StreamObject, TextStringObject)
 from yaralyzer.output.rich_console import YARALYZER_THEME_DICT

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/output/tables/pdf_node_rich_table.py RENAMED Viewed

@@ -5,7 +5,7 @@ from collections import namedtuple
 from typing import List, Optional
 from anytree import SymlinkNode
-from PyPDF2.generic import StreamObject
+from pypdf.generic import StreamObject
 from rich.markup import escape
 from rich.panel import Panel
 from rich.table import Table

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/pdf_object_relationship.py RENAMED Viewed

@@ -3,7 +3,7 @@ Simple container class for information about a link between two PDF objects.
 """
 from typing import List, Optional, Union
-from PyPDF2.generic import IndirectObject, PdfObject
+from pypdf.generic import IndirectObject, PdfObject
 from yaralyzer.util.logging import log
 from pdfalyzer.helpers.string_helper import bracketed, is_prefixed_by_any

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/pdfalyzer.py RENAMED Viewed

@@ -11,8 +11,8 @@ from typing import Dict, Iterator, List, Optional
 from anytree import LevelOrderIter, SymlinkNode
 from anytree.search import findall, findall_by_attr
-from PyPDF2 import PdfReader
-from PyPDF2.generic import IndirectObject
+from pypdf import PdfReader
+from pypdf.generic import IndirectObject
 from yaralyzer.helpers.file_helper import load_binary_data
 from yaralyzer.output.file_hashes_table import compute_file_hashes
 from yaralyzer.output.rich_console import console
@@ -36,7 +36,7 @@ class Pdfalyzer:
         self.pdf_basename = basename(pdf_path)
         self.pdf_bytes = load_binary_data(pdf_path)
         self.pdf_bytes_info = compute_file_hashes(self.pdf_bytes)
-        pdf_file = open(pdf_path, 'rb')  # Filehandle must be left open for PyPDF2 to perform seeks
+        pdf_file = open(pdf_path, 'rb')  # Filehandle must be left open for PyPDF to perform seeks
         self.pdf_reader = PdfReader(pdf_file)
         # Initialize tracking variables

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/util/adobe_strings.py RENAMED Viewed

@@ -2,8 +2,8 @@
 String constants specified in the Adobe specs for PDFs, fonts, etc.
 """
-from PyPDF2.constants import (CatalogDictionary, ImageAttributes, PageAttributes,
-     PagesAttributes, Ressources as Resources)
+from pypdf.constants import (CatalogDictionary, ImageAttributes, PageAttributes,
+     PagesAttributes, Resources)
 from pdfalyzer.helpers.string_helper import is_prefixed_by_any

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pdfalyzer/util/argument_parser.py RENAMED Viewed

@@ -9,7 +9,7 @@ from typing import List
 from rich_argparse_plus import RichHelpFormatterPlus
 from rich.prompt import Confirm
 from rich.text import Text
-from yaralyzer.util.argument_parser import export, parser, parse_arguments as parse_yaralyzer_args
+from yaralyzer.util.argument_parser import export, parser, parse_arguments as parse_yaralyzer_args, source
 from yaralyzer.util.logging import log, log_and_print, log_argparse_result, log_current_config, log_invocation
 from pdfalyzer.config import ALL_STREAMS, PdfalyzerConfig
@@ -50,8 +50,13 @@ export.add_argument('-bin', '--extract-binary-streams',
                     const='bin',
                     help='extract all binary streams in the PDF to separate files (requires pdf-parser.py)')
+# Add one more option to the YARA rules section
+source.add_argument('--no-default-yara-rules',
+                    action='store_true',
+                    help='if --yara is selected use only custom rules from --yara-file arg and not the default included YARA rules')
-#  Note that we extend the yaralyzer's parser and export
+# Note that we extend the yaralyzer's parser and export
 parser = ArgumentParser(
     formatter_class=RichHelpFormatterPlus,
     description=DESCRIPTION,
@@ -78,7 +83,7 @@ select.add_argument('-f', '--fonts', action='store_true',
                     help="show info about fonts included character mappings for embedded font binaries")
 select.add_argument('-y', '--yara', action='store_true',
-                    help="scan the PDF with YARA rules")
+                    help="scan the PDF with the included malicious PDF YARA rules and/or your custom YARA rules")
 select.add_argument('-c', '--counts', action='store_true',
                     help='show counts of some of the properties of the objects in the PDF')
@@ -127,10 +132,13 @@ def parse_arguments():
     if not args.streams:
         if args.extract_quoteds:
-            raise ArgumentError(None, "--extract-quoted does nothing if --streams is not selected")
+            exit_with_error("--extract-quoted does nothing if --streams is not selected")
         if args.suppress_boms:
             log.warning("--suppress-boms has nothing to suppress if --streams is not selected")
+    if args.no_default_yara_rules and not args.yara_rules_files:
+        exit_with_error("--no-default-yara-rules requires at least one --yara-file argument")
     # File export options
     if args.export_svg or args.export_txt or args.export_html or args.extract_binary_streams:
         args.output_dir = args.output_dir or getcwd()
@@ -149,8 +157,8 @@ def parse_arguments():
 def output_sections(args, pdfalyzer) -> List[OutputSection]:
     """
-    Determine which of the tree visualizations, font scans, etc were requested.
-    If nothing was specified the default is to output all sections.
+    Determine which of the tree visualizations, font scans, etc should be run.
+    If nothing is specified output ALL sections other than --streams which is v. slow/verbose.
     """
     # Create a partial for print_font_info() because it's the only one that can take an argument
     # partials have no __name__ so update_wrapper() propagates the 'print_font_info' as this partial's name
@@ -158,7 +166,8 @@ def output_sections(args, pdfalyzer) -> List[OutputSection]:
     stream_scan = partial(pdfalyzer.print_streams_analysis, idnum=stream_id)
     update_wrapper(stream_scan, pdfalyzer.print_streams_analysis)
-    # The first element string matches the argument in 'select' group.
+    # 1st element string matches the argument in 'select' group
+    # 2nd is fxn to call if selected.
     # Top to bottom is the default order of output.
     possible_output_sections = [
         OutputSection(DOCINFO, pdfalyzer.print_document_info),
@@ -187,6 +196,8 @@ def all_sections_chosen(args):
 ###############################################
 # Separate arg parser for combine_pdfs script #
 ###############################################
+MAX_QUALITY = 10
 combine_pdfs_parser = ArgumentParser(
     description="Combine multiple PDFs into one.",
     epilog="If all PDFs end in a number (e.g. 'xyz_1.pdf', 'xyz_2.pdf', etc. sort the files as if those were" \
@@ -198,10 +209,10 @@ combine_pdfs_parser.add_argument('pdfs',
                                  metavar='PDF_PATH',
                                  nargs='+')
-combine_pdfs_parser.add_argument('-c', '--compression-level',
-                                 help='zlib image compression level (0=none, max=1 until PyPDF is upgraded)',
-                                 choices=range(0, 2),
-                                 default=1,
+combine_pdfs_parser.add_argument('-iq', '--image-quality',
+                                 help='image quality for embedded images (can compress PDF at loss of quality)',
+                                 choices=range(1, MAX_QUALITY + 1),
+                                 default=MAX_QUALITY,
                                  type=int)
 combine_pdfs_parser.add_argument('-o', '--output-file',
@@ -246,7 +257,7 @@ def ask_to_proceed() -> None:
 def exit_with_error(error_message: str|None = None) -> None:
     """Print 'error_message' and exit with status code 1."""
     if error_message:
-        print_highlighted(error_message, style='bold red')
+        print_highlighted(Text('').append('ERROR', style='bold red').append(f': {error_message}'))
-    print_highlighted('Exiting...', style='red')
+    print_highlighted('Exiting...', style='dim red')
     sys.exit(1)

{pdfalyzer-1.15.0 → pdfalyzer-1.16.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "pdfalyzer"
-version = "1.15.0"
+version = "1.16.0"
 description = "A PDF analysis toolkit. Scan a PDF with relevant YARA rules, visualize its inner tree-like data structure in living color (lots of colors), force decodes of suspicious font binaries, and more."
 authors = ["Michel de Cryptadamus <michel@cryptadamus.com>"]
 license = "GPL-3.0-or-later"
@@ -51,7 +51,7 @@ pdfalyzer_show_color_theme = 'pdfalyzer:pdfalyzer_show_color_theme'
 python = "^3.9"
 anytree = "~=2.8"
 chardet = ">=5.0.0,<6.0.0"
-PyPDF2 = "^2.10"
+pypdf = "^5.0.1"
 python-dotenv = "^0.21.0"
 rich = "^12.5.1"
 rich-argparse-plus = "^0.3.1"