PyPI - pywaybackup - Versions diffs - 3.4.1__tar.gz → 4.1.0__tar.gz - Mend

pywaybackup 3.4.1tar.gz → 4.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

{pywaybackup-3.4.1/pywaybackup.egg-info → pywaybackup-4.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: pywaybackup
-Version: 3.4.1
+Version: 4.1.0
 Summary: Query and download archive.org as simple as possible.
 Author-email: bitdruid <bitdruid@outlook.com>
 License: MIT License
@@ -29,8 +29,7 @@ Project-URL: homepage, https://github.com/bitdruid/python-wayback-machine-downlo
 Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: pysqlite3-binary==0.5.4; sys_platform == "linux"
-Requires-Dist: pysqlite-binary; sys_platform == "win32"
+Requires-Dist: SQLAlchemy==2.0.43
 Requires-Dist: requests==2.32.3
 Requires-Dist: tqdm==4.67.1
 Requires-Dist: python-magic==0.4.27; sys_platform == "linux"
@@ -49,6 +48,17 @@ Internet-archive is a nice source for several OSINT-information. This tool is a
 This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
+# Content
+➡️ [Installation](#installation) <br>
+➡️ [notes / issues / hints](#notes--issues--hints) <br>
+➡️ [import](#import) <br>
+➡️ [cli](#cli) <br>
+➡️ [Usage](#usage) <br>
+➡️ [Examples](#examples) <br>
+➡️ [Output](#output) <br>
+➡️ [Contributing](#contributing) <br>
 ## Installation
 ### Pip
@@ -81,8 +91,14 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 You can import pywaybackup into your own scripts and run it. Args are the same as cli.
 Additional args:
-- `silent` (default True): If True, suppresses all output to the console.
-- `debug` (default False): If True, disables writing errors to the error log file.
+- `silent` (default False): If True, suppresses all output to the console.
+- `debug` (default True): If False, disables writing errors to the error log file.
+Use:
+- `run()`
+- `status()`
+- `paths()`
+- `stop()`
 ```python
 from pywaybackup import PyWayBackup
@@ -114,6 +130,29 @@ output:
 }
 ```
+... or run it asynchronously and print the current status or stop it whenever needed.
+```python
+import time
+from pywaybackup import PyWayBackup
+backup = PyWayBackup( ... )
+backup.run(daemon=True)
+print(backup.status())
+time.sleep(10)
+print(backup.status())
+backup.stop()
+```
+output:
+```bash
+{
+  'task': 'downloading snapshots',
+  'current': 15,
+  'total': 84,
+  'progress': '18%'
+}
+```
 ## cli
 - `-h`, `--help`: Show the help message and exit.
@@ -127,25 +166,24 @@ output:
 #### Mode Selection (Choose One)
 - **`-a`**, **`--all`**:<br>
-  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
+  All timestamps. Gives one folder per timestamp.
 - **`-l`**, **`--last`**:<br>
-  Download the last version of each file snapshot. You will get one directory with a rebuild of the page. It contains the last version of each file of your specified `--range`.
+  Last Version. Gives one folder containing the last version of each file of specified `--range`.
 - **`-f`**, **`--first`**:<br>
-  Download the first version of each file snapshot. You will get one directory with a rebuild of the page. It contains the first version of each file of your specified `--range`.
-- **`-s`**, **`--save`**:<br>
-  Save a page to the Wayback Machine. (beta)
+  First Version. Gives one folder containing the first version of each file of specified `--range`.
 #### Optional query parameters
+Parameters for archive.org CDX query. No effect on snapshot download itself.
 - **`-e`**, **`--explicit`**:<br>
-  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
+  Only the explicit URL. No wildcard subdomains or paths. For example get: root-only (`https://example.com`) or specific file (`login.html`, `?query=this`).
 - **`--limit`** `<count>`:<br>
-  Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
+  Limits the snapshots fetched from archive.org CDX. (Will have no effect on existing CDX files)
 - **Range Selection:**<br>
-  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range`, the `start` and `end` will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
-  (year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
+  Set the query range in years (`range`) or a timestamp (`start` and/or `end`). If `range` then ignores `start` and `end`. Format for timestamps: YYYYMMDDhhmmss. Timestamp can as specific as needed (year 2019, year+month+day 20190101, ...).
   - **`-r`**, **`--range`**:<br>
     Specify the range in years for which to search and download snapshots.
@@ -155,57 +193,56 @@ output:
     Timestamp to end searching.
 - **Filtering:**<br>
-  A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
   - **`--filetype`** `<filetype>`:<br>
-    Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
+    Specify filetypes to download. Example: `--filetype jpg,css,js`. You can only filter filetypes which are stored by archive.org (.html mostly not)
   - **`--statuscode`** `<statuscode>`:<br>
-    Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
+    Specify HTTP status codes to download. Example: `--statuscode 200,301`. PyWayBackup will always skip `404` and `301`.<br>
     Common status codes you may want to handle/filter:
       - `200` (OK)
-      - `301` (Moved Permanently - will redirect snapshot)
+      - `301` (Moved Permanently)
       - `404` (Not Found - snapshot seems to be empty)
       - `500` (Internal Server Error - snapshot is at least for now not available)
-### Optional
+#### Optional Behavior Manipulation
-#### Behavior Manipulation
+Parameters will change the download behavior for snapshots.
 - **`-o`**, **`--output`**:<br>
   Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
 - **`-m`**, **`--metadata`**<br>
-  Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+  Folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). If you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
 - **`--verbose`**:<br>
   Increase output verbosity.
 - **`--log`** <!-- `<path>` -->:<br>
-  Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
+  Saves a log file into the output-dir. `waybackup_<sanitized_url>.log`.
 - **`--progress`**:<br>
   Shows a progress bar instead of the default output.
 - **`--workers`** `<count>`:<br>
-  Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
+  Number of simultaneous download workers. Default is 1, safe range is about 10. Too many workers may lead to refused connections by archive.org.
 - **`--no-redirect`**:<br>
-  Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
+  Disables following redirects of snapshots. Can prevent timestamp-folder mismatches caused by redirects.
 - **`--retry`** `<attempts>`:<br>
-  Specifies number of retry attempts for failed downloads.
+  Retry attempts for failed downloads.
 - **`--delay`** `<seconds>`:<br>
-  Specifies delay between download requests in seconds. Default is no delay (0).
+  Delay between download requests in seconds. Default is no delay (0).
 #### Job Handling:
 - **`--reset`**:
-  If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
+  If set, the job will be reset, and `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch.
 - **`--keep`**:
-  If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
+  If set, `cdx` and `db` files will be kept after the job is finished. Otherwise they will be deleted.
 <br>
 <br>
@@ -216,23 +253,11 @@ output:
 `pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
-- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
-- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
-- Skips previously downloaded files to save time.
+Only resumes queries if:
+- existing `.cdx` and `.db` files in an `output dir`
+- command is identical by `URL`, `mode`, and `optional query parameters`
   > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
-#### Resetting a Job (`--reset`)
-- Deletes `.cdx` and `.db` files and restarts the process from scratch.
-- Does **not** remove already downloaded files.
-- `waybackup -u https://example.com -a --reset`
-#### Keeping Job Data (`--keep`)
-- Normally, `.cdx` and `.db` files are deleted after a successful job.
-- `--keep` preserves them for future re-analysis or extending the query.
-- `waybackup -u https://example.com -a --keep`
 <br>
 <br>
@@ -338,6 +363,11 @@ Exceptions will be written into `waybackup_error.log` (each run overwrites the f
 <br>
 <br>
+## Future ideas (long run)
+- More module functionality
+- Docker UI
 ## Contributing
 I'm always happy for some feature requests to improve the usability of this tool.

{pywaybackup-3.4.1 → pywaybackup-4.1.0}/README.md RENAMED Viewed

@@ -11,6 +11,17 @@ Internet-archive is a nice source for several OSINT-information. This tool is a
 This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.
+# Content
+➡️ [Installation](#installation) <br>
+➡️ [notes / issues / hints](#notes--issues--hints) <br>
+➡️ [import](#import) <br>
+➡️ [cli](#cli) <br>
+➡️ [Usage](#usage) <br>
+➡️ [Examples](#examples) <br>
+➡️ [Output](#output) <br>
+➡️ [Contributing](#contributing) <br>
 ## Installation
 ### Pip
@@ -43,8 +54,14 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 You can import pywaybackup into your own scripts and run it. Args are the same as cli.
 Additional args:
-- `silent` (default True): If True, suppresses all output to the console.
-- `debug` (default False): If True, disables writing errors to the error log file.
+- `silent` (default False): If True, suppresses all output to the console.
+- `debug` (default True): If False, disables writing errors to the error log file.
+Use:
+- `run()`
+- `status()`
+- `paths()`
+- `stop()`
 ```python
 from pywaybackup import PyWayBackup
@@ -76,6 +93,29 @@ output:
 }
 ```
+... or run it asynchronously and print the current status or stop it whenever needed.
+```python
+import time
+from pywaybackup import PyWayBackup
+backup = PyWayBackup( ... )
+backup.run(daemon=True)
+print(backup.status())
+time.sleep(10)
+print(backup.status())
+backup.stop()
+```
+output:
+```bash
+{
+  'task': 'downloading snapshots',
+  'current': 15,
+  'total': 84,
+  'progress': '18%'
+}
+```
 ## cli
 - `-h`, `--help`: Show the help message and exit.
@@ -89,25 +129,24 @@ output:
 #### Mode Selection (Choose One)
 - **`-a`**, **`--all`**:<br>
-  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
+  All timestamps. Gives one folder per timestamp.
 - **`-l`**, **`--last`**:<br>
-  Download the last version of each file snapshot. You will get one directory with a rebuild of the page. It contains the last version of each file of your specified `--range`.
+  Last Version. Gives one folder containing the last version of each file of specified `--range`.
 - **`-f`**, **`--first`**:<br>
-  Download the first version of each file snapshot. You will get one directory with a rebuild of the page. It contains the first version of each file of your specified `--range`.
-- **`-s`**, **`--save`**:<br>
-  Save a page to the Wayback Machine. (beta)
+  First Version. Gives one folder containing the first version of each file of specified `--range`.
 #### Optional query parameters
+Parameters for archive.org CDX query. No effect on snapshot download itself.
 - **`-e`**, **`--explicit`**:<br>
-  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
+  Only the explicit URL. No wildcard subdomains or paths. For example get: root-only (`https://example.com`) or specific file (`login.html`, `?query=this`).
 - **`--limit`** `<count>`:<br>
-  Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
+  Limits the snapshots fetched from archive.org CDX. (Will have no effect on existing CDX files)
 - **Range Selection:**<br>
-  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range`, the `start` and `end` will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
-  (year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
+  Set the query range in years (`range`) or a timestamp (`start` and/or `end`). If `range` then ignores `start` and `end`. Format for timestamps: YYYYMMDDhhmmss. Timestamp can as specific as needed (year 2019, year+month+day 20190101, ...).
   - **`-r`**, **`--range`**:<br>
     Specify the range in years for which to search and download snapshots.
@@ -117,57 +156,56 @@ output:
     Timestamp to end searching.
 - **Filtering:**<br>
-  A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
   - **`--filetype`** `<filetype>`:<br>
-    Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
+    Specify filetypes to download. Example: `--filetype jpg,css,js`. You can only filter filetypes which are stored by archive.org (.html mostly not)
   - **`--statuscode`** `<statuscode>`:<br>
-    Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
+    Specify HTTP status codes to download. Example: `--statuscode 200,301`. PyWayBackup will always skip `404` and `301`.<br>
     Common status codes you may want to handle/filter:
       - `200` (OK)
-      - `301` (Moved Permanently - will redirect snapshot)
+      - `301` (Moved Permanently)
       - `404` (Not Found - snapshot seems to be empty)
       - `500` (Internal Server Error - snapshot is at least for now not available)
-### Optional
+#### Optional Behavior Manipulation
-#### Behavior Manipulation
+Parameters will change the download behavior for snapshots.
 - **`-o`**, **`--output`**:<br>
   Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
 - **`-m`**, **`--metadata`**<br>
-  Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+  Folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). If you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
 - **`--verbose`**:<br>
   Increase output verbosity.
 - **`--log`** <!-- `<path>` -->:<br>
-  Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
+  Saves a log file into the output-dir. `waybackup_<sanitized_url>.log`.
 - **`--progress`**:<br>
   Shows a progress bar instead of the default output.
 - **`--workers`** `<count>`:<br>
-  Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
+  Number of simultaneous download workers. Default is 1, safe range is about 10. Too many workers may lead to refused connections by archive.org.
 - **`--no-redirect`**:<br>
-  Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
+  Disables following redirects of snapshots. Can prevent timestamp-folder mismatches caused by redirects.
 - **`--retry`** `<attempts>`:<br>
-  Specifies number of retry attempts for failed downloads.
+  Retry attempts for failed downloads.
 - **`--delay`** `<seconds>`:<br>
-  Specifies delay between download requests in seconds. Default is no delay (0).
+  Delay between download requests in seconds. Default is no delay (0).
 #### Job Handling:
 - **`--reset`**:
-  If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
+  If set, the job will be reset, and `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch.
 - **`--keep`**:
-  If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
+  If set, `cdx` and `db` files will be kept after the job is finished. Otherwise they will be deleted.
 <br>
 <br>
@@ -178,23 +216,11 @@ output:
 `pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
-- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
-- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
-- Skips previously downloaded files to save time.
+Only resumes queries if:
+- existing `.cdx` and `.db` files in an `output dir`
+- command is identical by `URL`, `mode`, and `optional query parameters`
   > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
-#### Resetting a Job (`--reset`)
-- Deletes `.cdx` and `.db` files and restarts the process from scratch.
-- Does **not** remove already downloaded files.
-- `waybackup -u https://example.com -a --reset`
-#### Keeping Job Data (`--keep`)
-- Normally, `.cdx` and `.db` files are deleted after a successful job.
-- `--keep` preserves them for future re-analysis or extending the query.
-- `waybackup -u https://example.com -a --keep`
 <br>
 <br>
@@ -300,6 +326,11 @@ Exceptions will be written into `waybackup_error.log` (each run overwrites the f
 <br>
 <br>
+## Future ideas (long run)
+- More module functionality
+- Docker UI
 ## Contributing
 I'm always happy for some feature requests to improve the usability of this tool.

{pywaybackup-3.4.1 → pywaybackup-4.1.0}/pyproject.toml RENAMED Viewed

@@ -7,7 +7,7 @@ packages = ["pywaybackup"]
 [project]
 name = "pywaybackup"
-version = "3.4.1"
+version = "4.1.0"
 description = "Query and download archive.org as simple as possible."
 authors = [
     { name = "bitdruid", email = "bitdruid@outlook.com" }
@@ -16,8 +16,7 @@ license = { file = "LICENSE" }
 readme = "README.md"
 requires-python = ">=3.8"
 dependencies = [
-    "pysqlite3-binary==0.5.4; sys_platform == 'linux'",
-    "pysqlite-binary; sys_platform == 'win32'",
+    "SQLAlchemy==2.0.43",
     "requests==2.32.3",
     "tqdm==4.67.1",
     "python-magic==0.4.27; sys_platform == 'linux'",

{pywaybackup-3.4.1 → pywaybackup-4.1.0}/pywaybackup/Arguments.py RENAMED Viewed

@@ -24,8 +24,8 @@ class Arguments:
         optional = parser.add_argument_group("optional query parameters")
         optional.add_argument("-e", "--explicit", action="store_true", help="search only for the explicit given url")
         optional.add_argument("-r", "--range", type=int, metavar="", help="range in years to search")
-        optional.add_argument("--start", type=int, metavar="", help="start timestamp format: YYYYMMDDhhmmss")
-        optional.add_argument("--end", type=int, metavar="", help="end timestamp format: YYYYMMDDhhmmss")
+        optional.add_argument("--start", type=int, metavar="", help="start timestamp format: YYYYMMDDHHMMSS")
+        optional.add_argument("--end", type=int, metavar="", help="end timestamp format: YYYYMMDDHHMMSS")
         optional.add_argument("--limit", type=int, nargs="?", const=True, metavar="int", help="limit the number of snapshots to download")
         optional.add_argument("--filetype", type=str, metavar="", help="filetypes to download comma separated (js,css,...)")
         optional.add_argument("--statuscode", type=str, metavar="", help="statuscodes to download comma separated (200,404,...)")
@@ -55,3 +55,4 @@ class Arguments:
     def get_args(self) -> dict:
         """Returns the parsed arguments as a dictionary."""
         return vars(self.args)

{pywaybackup-3.4.1 → pywaybackup-4.1.0}/pywaybackup/Exception.py RENAMED Viewed

@@ -14,9 +14,9 @@ class Exception:
     command = None
     @classmethod
-    def init(cls, debug=None, output=None, command=None):
+    def init(cls, debugfile=None, output=None, command=None):
         sys.excepthook = cls.exception_handler  # set custom exception handler (uncaught exceptions)
-        cls.debug = debug
+        cls.debugfile = debugfile
         cls.output = output
         cls.command = command
@@ -45,18 +45,18 @@ class Exception:
             exception_message += "!-- Traceback is None\n"
         exception_message += f"!-- Description: {e}\n-------------------------"
         print(exception_message)
-        if cls.debug:
-            print(f"Exception log: {cls.debug}")
+        if cls.debugfile:
+            print(f"Exception log: {cls.debugfile}")
             if cls.new_debug:  # new run, overwrite file
                 cls.new_debug = False
-                f = open(cls.debug, "w", encoding="utf-8")
+                f = open(cls.debugfile, "w", encoding="utf-8")
                 f.write("-------------------------\n")
                 f.write(f"Version: {version('pywaybackup')}\n")
                 f.write("-------------------------\n")
                 f.write(f"Command: {cls.command}\n")
                 f.write("-------------------------\n\n")
             else:  # current run, append to file
-                f = open(cls.debug, "a", encoding="utf-8")
+                f = open(cls.debugfile, "a", encoding="utf-8")
             f.write(datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "\n")
             f.write(exception_message + "\n")
             f.write("!-- Local Variables:\n")
@@ -96,4 +96,4 @@ class Exception:
         if issubclass(exception_type, KeyboardInterrupt):
             sys.__excepthook__(exception_type, exception, traceback)
             return
-        Exception.exception("UNCAUGHT EXCEPTION", exception, traceback)  # uncaught exceptions also with custom scheme
+        Exception.exception('UNCAUGHT EXCEPTION', exception, traceback)  # uncaught exceptions also with custom scheme

pywaybackup 3.4.1__tar.gz → 4.1.0__tar.gz

pywaybackup 3.4.1tar.gz → 4.1.0tar.gz