PyPI - pywaybackup - Versions diffs - 3.2.1__tar.gz → 3.3.1__tar.gz - Mend

pywaybackup 3.2.1tar.gz → 3.3.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

{pywaybackup-3.2.1/pywaybackup.egg-info → pywaybackup-3.3.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pywaybackup
-Version: 3.2.1
+Version: 3.3.1
 Summary: Query and download archive.org as simple as possible.
 Author-email: bitdruid <bitdruid@outlook.com>
 License: MIT License
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 ### Pip
 1. Install the package <br>
-   ```pip install pywaybackup```
+   `pip install pywaybackup`
 2. Run the tool <br>
-   ```waybackup -h```
+   `waybackup -h`
 ### Manual
 1. Clone the repository <br>
-   ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
+   `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
 2. Install <br>
-   ```pip install .```
+   `pip install .`
    - in a virtual env or use `--break-system-package`
 ## notes / issues / hints
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
   The URL of the web page to download. This argument is required.
 #### Mode Selection (Choose One)
 - **`-a`**, **`--all`**:<br>
   Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
 - **`-l`**, **`--last`**:<br>
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 - **`-e`**, **`--explicit`**:<br>
   Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
-- **`--filetype`** `<filetype>`:<br>
-  Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
 - **`--limit`** `<count>`:<br>
-Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
+  Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
 - **Range Selection:**<br>
   Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
   (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
-   - **`-r`**, **`--range`**:<br>
-     Specify the range in years for which to search and download snapshots.
-   - **`--start`**:<br>
-     Timestamp to start searching.
-   - **`--end`**:<br>
-     Timestamp to end searching.
+  - **`-r`**, **`--range`**:<br>
+    Specify the range in years for which to search and download snapshots.
+  - **`--start`**:<br>
+    Timestamp to start searching.
+  - **`--end`**:<br>
+    Timestamp to end searching.
+- **Filtering:**<br>
+  A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
+  - **`--filetype`** `<filetype>`:<br>
+    Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
+  - **`--statuscode`** `<statuscode>`:<br>
+    Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
+    Common status codes you may want to handle/filter:
+      - `200` (OK)
+      - `301` (Moved Permanently - will redirect snapshot)
+      - `404` (Not Found - snapshot seems to be empty)
+      - `500` (Internal Server Error - snapshot is at least for now not available)
 ### Optional
 #### Behavior Manipulation
 - **`-o`**, **`--output`**:<br>
-Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
+  Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
 - **`-m`**, **`--metadata`**<br>
-Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+  Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+- **`--verbose`**:<br>
+  Increase output verbosity.
 <!-- - **`--verbosity`** `<level>`:<br>
 Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
 - **`--log`** <!-- `<path>` -->:<br>
-Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
+  Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
 - **`--progress`**:<br>
-Shows a progress bar instead of the default output.
+  Shows a progress bar instead of the default output.
 - **`--workers`** `<count>`:<br>
-Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
+  Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
 - **`--no-redirect`**:<br>
-Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
+  Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
 - **`--retry`** `<attempts>`:<br>
-Specifies number of retry attempts for failed downloads.
+  Specifies number of retry attempts for failed downloads.
 - **`--delay`** `<seconds>`:<br>
-Specifies delay between download requests in seconds. Default is no delay (0).
-- **`--verbose`**:<br>
-Increase output verbosity.
-  - verbose:
-  ```
-  -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
-  SUCCESS   -> 200 OK
-            -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
-            -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
-  ```
-  - non-verbose:
-  ```
-  55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
-  ```
+  Specifies delay between download requests in seconds. Default is no delay (0).
 <!-- - **`--convert-links`**:<br>
 If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
 - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
 - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
 - Skips previously downloaded files to save time.
-> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
+  > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
 #### Resetting a Job (`--reset`)
 - Deletes `.cdx` and `.db` files and restarts the process from scratch.
 - Does **not** remove already downloaded files.
 - `waybackup -u https://example.com -a --reset`
 #### Keeping Job Data (`--keep`)
 - Normally, `.cdx` and `.db` files are deleted after a successful job.
 - `--keep` preserves them for future re-analysis or extending the query.
 - `waybackup -u https://example.com -a --keep`
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
 ## Examples
 1. Download a specific single snapshot of all available files (starting from root):<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
 2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
-`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
+   `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
 3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
 4. Download all snapshots of all available files in the given range:<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
 <br>
 <br>
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
 `http://example.com/subdir1/subdir2/assets/`
 <br><br>
 For the first and last version (`-f` or `-l`):
 - Will only include all files/folders starting from your query-path.
 ```
 your/path/waybackup_snapshots/
 └── the_root_of_your_query/ (example.com/)
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
                 ├── style.css
                 ...
 ```
 For all versions (`-a`):
 - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
 ```
 your/path/waybackup_snapshots/
 └── the_root_of_your_query/ (example.com/)
@@ -276,6 +285,23 @@ For download queries:
 ]
 ```
+### Log
+Verbose:
+```
+-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
+SUCCESS   -> 200 OK
+          -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
+          -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
+```
+Non-verbose:
+```
+55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
+```
 ### Debugging
 Exceptions will be written into `waybackup_error.log` (each run overwrites the file).

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/README.md RENAMED Viewed

@@ -16,16 +16,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 ### Pip
 1. Install the package <br>
-   ```pip install pywaybackup```
+   `pip install pywaybackup`
 2. Run the tool <br>
-   ```waybackup -h```
+   `waybackup -h`
 ### Manual
 1. Clone the repository <br>
-   ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
+   `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
 2. Install <br>
-   ```pip install .```
+   `pip install .`
    - in a virtual env or use `--break-system-package`
 ## notes / issues / hints
@@ -49,6 +49,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
   The URL of the web page to download. This argument is required.
 #### Mode Selection (Choose One)
 - **`-a`**, **`--all`**:<br>
   Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
 - **`-l`**, **`--last`**:<br>
@@ -63,66 +64,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 - **`-e`**, **`--explicit`**:<br>
   Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
-- **`--filetype`** `<filetype>`:<br>
-  Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
 - **`--limit`** `<count>`:<br>
-Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
+  Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
 - **Range Selection:**<br>
   Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
   (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
-   - **`-r`**, **`--range`**:<br>
-     Specify the range in years for which to search and download snapshots.
-   - **`--start`**:<br>
-     Timestamp to start searching.
-   - **`--end`**:<br>
-     Timestamp to end searching.
+  - **`-r`**, **`--range`**:<br>
+    Specify the range in years for which to search and download snapshots.
+  - **`--start`**:<br>
+    Timestamp to start searching.
+  - **`--end`**:<br>
+    Timestamp to end searching.
+- **Filtering:**<br>
+  A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
+  - **`--filetype`** `<filetype>`:<br>
+    Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
+  - **`--statuscode`** `<statuscode>`:<br>
+    Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
+    Common status codes you may want to handle/filter:
+      - `200` (OK)
+      - `301` (Moved Permanently - will redirect snapshot)
+      - `404` (Not Found - snapshot seems to be empty)
+      - `500` (Internal Server Error - snapshot is at least for now not available)
 ### Optional
 #### Behavior Manipulation
 - **`-o`**, **`--output`**:<br>
-Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
+  Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
 - **`-m`**, **`--metadata`**<br>
-Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+  Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+- **`--verbose`**:<br>
+  Increase output verbosity.
 <!-- - **`--verbosity`** `<level>`:<br>
 Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
 - **`--log`** <!-- `<path>` -->:<br>
-Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
+  Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
 - **`--progress`**:<br>
-Shows a progress bar instead of the default output.
+  Shows a progress bar instead of the default output.
 - **`--workers`** `<count>`:<br>
-Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
+  Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
 - **`--no-redirect`**:<br>
-Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
+  Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
 - **`--retry`** `<attempts>`:<br>
-Specifies number of retry attempts for failed downloads.
+  Specifies number of retry attempts for failed downloads.
 - **`--delay`** `<seconds>`:<br>
-Specifies delay between download requests in seconds. Default is no delay (0).
-- **`--verbose`**:<br>
-Increase output verbosity.
-  - verbose:
-  ```
-  -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
-  SUCCESS   -> 200 OK
-            -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
-            -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
-  ```
-  - non-verbose:
-  ```
-  55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
-  ```
+  Specifies delay between download requests in seconds. Default is no delay (0).
 <!-- - **`--convert-links`**:<br>
 If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -147,14 +149,16 @@ If set, all links in the downloaded files will be converted to local links. This
 - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
 - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
 - Skips previously downloaded files to save time.
-> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
+  > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
 #### Resetting a Job (`--reset`)
 - Deletes `.cdx` and `.db` files and restarts the process from scratch.
 - Does **not** remove already downloaded files.
 - `waybackup -u https://example.com -a --reset`
 #### Keeping Job Data (`--keep`)
 - Normally, `.cdx` and `.db` files are deleted after a successful job.
 - `--keep` preserves them for future re-analysis or extending the query.
 - `waybackup -u https://example.com -a --keep`
@@ -165,13 +169,13 @@ If set, all links in the downloaded files will be converted to local links. This
 ## Examples
 1. Download a specific single snapshot of all available files (starting from root):<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
 2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
-`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
+   `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
 3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
 4. Download all snapshots of all available files in the given range:<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
 <br>
 <br>
@@ -184,7 +188,9 @@ The output path is currently structured as follows by an example for the query:<
 `http://example.com/subdir1/subdir2/assets/`
 <br><br>
 For the first and last version (`-f` or `-l`):
 - Will only include all files/folders starting from your query-path.
 ```
 your/path/waybackup_snapshots/
 └── the_root_of_your_query/ (example.com/)
@@ -195,8 +201,11 @@ your/path/waybackup_snapshots/
                 ├── style.css
                 ...
 ```
 For all versions (`-a`):
 - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
 ```
 your/path/waybackup_snapshots/
 └── the_root_of_your_query/ (example.com/)
@@ -237,6 +246,23 @@ For download queries:
 ]
 ```
+### Log
+Verbose:
+```
+-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
+SUCCESS   -> 200 OK
+          -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
+          -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
+```
+Non-verbose:
+```
+55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
+```
 ### Debugging
 Exceptions will be written into `waybackup_error.log` (each run overwrites the file).

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pyproject.toml RENAMED Viewed

@@ -7,7 +7,7 @@ packages = ["pywaybackup"]
 [project]
 name = "pywaybackup"
-version = "3.2.1"
+version = "3.3.1"
 description = "Query and download archive.org as simple as possible."
 authors = [
     { name = "bitdruid", email = "bitdruid@outlook.com" }

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Arguments.py RENAMED Viewed

@@ -3,6 +3,8 @@ import sys
 import os
 import argparse
+from argparse import RawTextHelpFormatter
 from importlib.metadata import version
 from pywaybackup.helper import url_split, sanitize_filename
@@ -10,9 +12,10 @@ from pywaybackup.helper import url_split, sanitize_filename
 class Arguments:
     def __init__(self):
-        parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
-        parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + version("pywaybackup") + ' by @bitdruid -> https://github.com/bitdruid')
+        parser = argparse.ArgumentParser(
+            description=f"<<< python-wayback-machine-downloader v{version('pywaybackup')} >>>\nby @bitdruid -> https://github.com/bitdruid",
+            formatter_class=RawTextHelpFormatter,
+        )
         required = parser.add_argument_group('required (one exclusive)')
         required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
@@ -27,12 +30,14 @@ class Arguments:
         optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
         optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
         optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
-        optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
         optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
+        optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (js,css,...)')
+        optional.add_argument('--statuscode', type=str, metavar="", help='statuscodes to download comma separated (200,404,...)')
         behavior = parser.add_argument_group('manipulate behavior')
         behavior.add_argument('-o', '--output', type=str, metavar="", help='output for all files - defaults to current directory')
         behavior.add_argument('-m', '--metadata', type=str, metavar="", help='change directory for db/cdx/csv/log files')
+        behavior.add_argument('-v', '--verbose', action='store_true', help='overwritten by progress - gives detailed output')
         behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
         behavior.add_argument('--progress', action='store_true', help='show a progress bar')
         behavior.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
@@ -40,7 +45,6 @@ class Arguments:
         behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
         # behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
         behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
-        behavior.add_argument('--verbose', action='store_true', help='overwritten by progress - gives detailed output')
         special = parser.add_argument_group('special')
         special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
@@ -61,6 +65,52 @@ class Arguments:
         return self.args
 class Configuration:
+    # def __init__(self):
+    #     self.args = Arguments().get_args()
+    #     for key, value in vars(self.args).items():
+    #         setattr(Configuration, key, value)
+    #     self.set_config()
+    # def set_config(self):
+    #             # args now attributes of Configuration // Configuration.output, ...
+    #     self.command = ' '.join(sys.argv[1:])
+    #     self.domain, self.subdir, self.filename = url_split(self.url)
+    #     if self.output is None:
+    #         self.output = os.path.join(os.getcwd(), "waybackup_snapshots")
+    #     if self.metadata is None:
+    #         self.metadata = self.output
+    #     os.makedirs(self.output, exist_ok=True) if not self.save else None
+    #     os.makedirs(self.metadata, exist_ok=True) if not self.save else None
+    #     if self.all:
+    #         self.mode = "all"
+    #     if self.last:
+    #         self.mode = "last"
+    #     if self.first:
+    #         self.mode = "first"
+    #     if self.save:
+    #         self.mode = "save"
+    #     if self.filetype:
+    #         self.filetype = [f.lower().strip() for f in self.filetype.split(",")]
+    #     if self.statuscode:
+    #         self.statuscode = [s.lower().strip() for s in self.statuscode.split(",")]
+    #     base_path = self.metadata
+    #     base_name = f"waybackup_{sanitize_filename(self.url)}"
+    #     self.cdxfile = os.path.join(base_path, f"{base_name}.cdx")
+    #     self.dbfile = os.path.join(base_path, f"{base_name}.db")
+    #     self.csvfile = os.path.join(base_path, f"{base_name}.csv")
+    #     self.log = os.path.join(base_path, f"{base_name}.log") if self.log else None
+    #     if self.reset:
+    #         os.remove(self.cdxfile) if os.path.isfile(self.cdxfile) else None
+    #         os.remove(self.dbfile) if os.path.isfile(self.dbfile) else None
+    #         os.remove(self.csvfile) if os.path.isfile(self.csvfile) else None
     @classmethod
     def init(cls):
@@ -90,7 +140,9 @@ class Configuration:
             cls.mode = "save"
         if cls.filetype:
-            cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]
+            cls.filetype = [f.lower().strip() for f in cls.filetype.split(",")]
+        if cls.statuscode:
+            cls.statuscode = [s.lower().strip() for s in cls.statuscode.split(",")]
         base_path = cls.metadata
         base_name = f"waybackup_{sanitize_filename(cls.url)}"

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/SnapshotCollection.py RENAMED Viewed

@@ -22,11 +22,10 @@ class SnapshotCollection:
     SNAPSHOT_UNHANDLED = 0  # all unhandled snapshots in the db (without response)
     SNAPSHOT_HANDLED = 0    # snapshots with a response
-    SNAPSHOT_REMOVALS = 0   # not to be utilized (total - unhandled - skip)
-    SNAPSHOT_FAULTY = 0     # snapshots which could not be loaded from cdx file into db
     FILTER_DUPLICATES = 0   # with identical url_archive
     FILTER_MODE = 0         # all snapshots filtered by the MODE (last or first)
     FILTER_SKIP = 0         # content of the csv file
+    FILTER_RESPONSE = 0     # snapshots which could not be loaded from cdx file into db or 404
     @classmethod
     def init(cls, mode):
@@ -71,35 +70,40 @@ class SnapshotCollection:
             cls.db.set_index_complete()
         else:
             vb.write(verbose=True, content="\nAlready indexed snapshots")
-        if cls.MODE_LAST or cls.MODE_FIRST:
-            if not cls.db.get_filter_complete():
-                vb.write(content="\nFiltering snapshots (last or first version)...")
-                cls.filter_snapshots() # filter: keep newest or oldest based on MODE
-                cls.db.set_filter_complete()
-            else:
-                vb.write(verbose=True, content="\nAlready filtered snapshots (last or first version)")
+        if not cls.db.get_filter_complete():
+            vb.write(content="\nFiltering snapshots (last or first version)...")
+            cls.filter_snapshots() # filter: keep newest or oldest based on MODE
+            cls.db.set_filter_complete()
+        else:
+            vb.write(verbose=True, content="\nAlready filtered snapshots (last or first version)")
         cls.skip_set(csvfile)  # set response to NULL or read csv file and write values into db
+    @classmethod
+    def calculate(cls):
         cls.SNAPSHOT_UNHANDLED = cls.count_totals(unhandled=True)  # count all unhandled in db
         cls.SNAPSHOT_HANDLED = cls.count_totals(handled=True)  # count all handled in db
         cls.SNAPSHOT_TOTAL = cls.count_totals(total=True)  # count all in db
-        cls.SNAPSHOT_REMOVALS = cls.CDX_TOTAL - cls.SNAPSHOT_UNHANDLED - cls.FILTER_SKIP  # count all removals
         vb.write(content="\nSnapshot calculation:")
         vb.write(content=f"-----> {'in CDX file'.ljust(18)}: {cls.CDX_TOTAL:,}")
-        if cls.FILTER_DUPLICATES == 0 and cls.FILTER_MODE == 0:
-            vb.write(content=f"-----> {'total removals'.ljust(18)}: {cls.SNAPSHOT_REMOVALS:,}")
-        if cls.SNAPSHOT_FAULTY > 0:
-            vb.write(content=f"-----> {'removed faulty'.ljust(18)}: {cls.SNAPSHOT_FAULTY}")
         if cls.FILTER_DUPLICATES > 0:
             vb.write(content=f"-----> {'removed duplicates'.ljust(18)}: {cls.FILTER_DUPLICATES:,}")
         if cls.FILTER_MODE > 0:
             vb.write(content=f"-----> {'removed versions'.ljust(18)}: {cls.FILTER_MODE:,}")
         if cls.FILTER_SKIP > 0:
-            vb.write(content=f"-----> {'skipped existing'.ljust(18)}: {cls.FILTER_SKIP:,}")
+            vb.write(content=f"-----> {'skip existing'.ljust(18)}: {cls.FILTER_SKIP:,}")
+        if cls.FILTER_RESPONSE > 0:
+            vb.write(content=f"-----> {'skip statuscode'.ljust(18)}: {cls.FILTER_RESPONSE}")
-        vb.write(content=f"\n-----> {'to utilize'.ljust(18)}: {cls.SNAPSHOT_UNHANDLED:,}")
+        if cls.SNAPSHOT_UNHANDLED > 0:
+            vb.write(content=f"\n-----> {'to utilize'.ljust(18)}: {cls.SNAPSHOT_UNHANDLED:,}")
@@ -112,7 +116,23 @@ class SnapshotCollection:
         - Removes duplicates by url_archive (same timestamp and url_origin)
         - Filters the snapshots by the given mode (last or first)
         """
+        def _parse_line(line):
+            line = json.loads(line)
+            line = {
+                "timestamp": line[0],
+                "digest": line[1],
+                "mimetype": line[2],
+                "statuscode": line[3],
+                "origin": line[4],
+            }
+            url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['origin']}"
+            statuscode = line["statuscode"] if line["statuscode"] in ("301", "404") else None
+            return (line["timestamp"], url_archive, line["origin"], statuscode)
         vb.write(verbose=None, content="\nInserting CDX data into database...")
         with open(cdxfile, "r", encoding="utf-8") as f, tqdm(
             unit=" lines",
             total=cls.CDX_TOTAL,
@@ -123,11 +143,9 @@ class SnapshotCollection:
             line_batchsize = 2500
             line_batch = []
             total_inserted = 0
-            faulty_lines = 0
-            query_duplicates = (
-                """INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin) VALUES (?, ?, ?)"""
-            )
+            query_duplicates = """INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin, response) VALUES (?, ?, ?, ?)"""
             first_line = True
             for line in f:
                 if first_line:
                     first_line = False
@@ -137,29 +155,15 @@ class SnapshotCollection:
                     line = line.rsplit("]", 1)[0]
                 if line.endswith(","):
                     line = line.rsplit(",", 1)[0]
-                try:
-                    line = json.loads(line)
-                    line = {
-                        "timestamp": line[0],
-                        "digest": line[1],
-                        "mimetype": line[2],
-                        "status": line[3],
-                        "url": line[4],
-                    }
-                    url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['url']}"
-                    line_batch.append((line["timestamp"], url_archive, line["url"]))
-                    if len(line_batch) >= line_batchsize:
-                        total_inserted += len(line_batch)
-                        cls.db.cursor.executemany(query_duplicates, line_batch)
-                        line_batch = []
-                        pbar.update(line_batchsize)
-                except json.JSONDecodeError as e:
-                    faulty_lines += 1
-                    vb.write(
-                        verbose=None,
-                        content=f"JSONDecodeError: {e} on line {cls.CDX_TOTAL}",
-                    )
-                    continue
+                line_batch.append(_parse_line(line))
+                if len(line_batch) >= line_batchsize:
+                    total_inserted += len(line_batch)
+                    cls.db.cursor.executemany(query_duplicates, line_batch)
+                    line_batch = []
+                    pbar.update(line_batchsize)
             if line_batch:
                 total_inserted += len(line_batch)
                 cls.db.cursor.executemany(query_duplicates, line_batch)
@@ -167,8 +171,7 @@ class SnapshotCollection:
         cls.db.conn.commit()
-        cls.SNAPSHOT_FAULTY = faulty_lines
-        cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(unhandled=True) + cls.SNAPSHOT_FAULTY
+        cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(total=True)
@@ -181,7 +184,7 @@ class SnapshotCollection:
         """
         row_batchsize = 2500
         cls.db.cursor.execute("UPDATE snapshot_tbl SET response = NULL WHERE response = 'LOCK'") # reset locked to unprocessed
-        cls.db.cursor.execute("SELECT * FROM snapshot_tbl WHERE response IS NOT NULL") # only write processed snapshots
+        cls.db.cursor.execute("SELECT * FROM csv_view WHERE response IS NOT NULL") # only write processed snapshots
         headers = [description[0] for description in cls.db.cursor.description]
         with open(csvfile, "w", encoding="utf-8") as f:
             writer = csv.writer(f)
@@ -203,13 +206,15 @@ class SnapshotCollection:
         Create indexes for the snapshot table.
         """
         # index for filtering last snapshots
-        cls.db.cursor.execute(
-            "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_desc ON snapshot_tbl(url_origin, timestamp DESC);"
-        )
+        if cls.MODE_LAST:
+            cls.db.cursor.execute(
+                "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_desc ON snapshot_tbl(url_origin, timestamp DESC);"
+            )
         # index for filtering first snapshots
-        cls.db.cursor.execute(
-            "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_asc ON snapshot_tbl(url_origin, timestamp ASC);"
-        )
+        if cls.MODE_FIRST:
+            cls.db.cursor.execute(
+                "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_asc ON snapshot_tbl(url_origin, timestamp ASC);"
+            )
         # index for skippable snapshots
         cls.db.cursor.execute(
             "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_timestamp_url_origin_response ON snapshot_tbl(timestamp, url_origin);"
@@ -247,6 +252,26 @@ class SnapshotCollection:
                 """
             )
             cls.FILTER_MODE = cls.db.cursor.rowcount
+        cls.db.cursor.execute(
+        """
+        SELECT COUNT(*) FROM snapshot_tbl WHERE response IN ('404', '301')
+        """
+        )
+        cls.FILTER_RESPONSE = cls.db.cursor.fetchone()[0]
+        cls.db.cursor.execute(
+            """
+            WITH numbered AS (
+                SELECT rowid, ROW_NUMBER() OVER (ORDER BY rowid) AS rn
+                FROM snapshot_tbl
+            )
+            UPDATE snapshot_tbl
+            SET counter = (
+                SELECT rn FROM numbered WHERE numbered.rowid = snapshot_tbl.rowid
+            );
+            """
+        )
         cls.db.conn.commit()
@@ -259,13 +284,6 @@ class SnapshotCollection:
         """
         If an existing csv-file for the job exists, the responses will be overwritten by the csv-content.
         """
-        cls.db.cursor.execute(
-            """
-            UPDATE snapshot_tbl
-            SET response = NULL
-            """
-        )
-        cls.db.conn.commit()
         if not os.path.isfile(csvfile):
             return
         else:
@@ -327,16 +345,16 @@ class SnapshotCollection:
         if unhandled:
             return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE response IS NULL").fetchone()[0]
         if success:
-            return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NOT NULL").fetchone()[0]
+            return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NOT NULL AND file != ''").fetchone()[0]
         if fail:
-            return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NULL").fetchone()[0]
+            return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NULL OR file = ''").fetchone()[0]
     @staticmethod
     def modify_snapshot(connection, snapshot_id, column, value):
         """
         Modify a snapshot-row in the snapshot table.
         """
-        query = f"UPDATE snapshot_tbl SET {column} = ? WHERE rowid = ?"
+        query = f"UPDATE snapshot_tbl SET {column} = ? WHERE counter = ?"
         connection.cursor.execute(query, (value, snapshot_id))
         connection.conn.commit()

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Verbosity.py RENAMED Viewed

@@ -3,6 +3,7 @@ from typing import Union
 class Verbosity:
     """
     A class to manage verbosity levels, logging, progress and output.

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Worker.py RENAMED Viewed

@@ -37,7 +37,7 @@ class Worker:
         self.snapshot = sc.get_snapshot(self.db)
         if not self.snapshot:
             return
-        self.rowid = self.snapshot["rowid"]
+        self.counter = self.snapshot["counter"]
         self.timestamp = self.snapshot["timestamp"]
         self.url_archive = self.snapshot["url_archive"]
         self.url_origin = self.snapshot["url_origin"]
@@ -64,7 +64,7 @@ class Worker:
         if self.redirect_timestamp is None and value is None:
             return
         self._redirect_url = value
-        sc.modify_snapshot(self.db, self.rowid, "redirect_url", value)
+        sc.modify_snapshot(self.db, self.counter, "redirect_url", value)
     @property
     def redirect_timestamp(self):
@@ -75,7 +75,7 @@ class Worker:
         if self.redirect_url is None and value is None:
             return
         self._redirect_timestamp = value
-        sc.modify_snapshot(self.db, self.rowid, "redirect_timestamp", value)
+        sc.modify_snapshot(self.db, self.counter, "redirect_timestamp", value)
     @property
     def response(self):
@@ -86,7 +86,7 @@ class Worker:
         if self.redirect_url is None and value is None:
             return
         self._response = value
-        sc.modify_snapshot(self.db, self.rowid, "response", value)
+        sc.modify_snapshot(self.db, self.counter, "response", value)
     @property
     def file(self):
@@ -97,7 +97,7 @@ class Worker:
         if self.redirect_url is None and value is None:
             return
         self._file = value
-        sc.modify_snapshot(self.db, self.rowid, "file", value)
+        sc.modify_snapshot(self.db, self.counter, "file", value)
 class Message(Worker):
@@ -141,14 +141,15 @@ class Message(Worker):
                 "verbose": True,
                 "content": _format_verbose({"result": result, "info": info, "content": content}),
             }
+            self.buffer.append(self.message)
         if verbose is False or verbose is None:
             result = result + " - " if result else ""
             content = content + " - " if content else ""
             self.message = {
             "verbose": False,
-            "content": f"{self.worker.rowid}/{sc.SNAPSHOT_TOTAL} - W:{self.worker.id} - {result}{content}{self.worker.timestamp} - {self.worker.url_origin}",
+            "content": f"{self.worker.counter}/{sc.SNAPSHOT_TOTAL} - W:{self.worker.id} - {result}{content}{self.worker.timestamp} - {self.worker.url_origin}",
             }
-        self.buffer.append(self.message)
+            self.buffer.append(self.message)
     def write(self):
         """

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/archive_download.py RENAMED Viewed

@@ -47,7 +47,7 @@ def startup():
-def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list):
+def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list,filter_statuscode: list):
     def inject(cdxinject: str) -> bool:
         if os.path.isfile(cdxinject):
@@ -60,7 +60,7 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
             )
             return False
-    def create_query(queryrange: int, limit: int, filter_filetype: list, start: int, end: int, explicit: bool) -> str:
+    def create_query(queryrange: int, limit: int, filter_filetype: list, filter_statuscode: list, start: int, end: int, explicit: bool) -> str:
         if queryrange:
             query_range = f"&from={datetime.now().year - queryrange}"
         else:
@@ -81,9 +81,10 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
         limit = f"&limit={limit}" if limit else ""
+        filter_statuscode = (f"&filter=statuscode:({'|'.join(filter_statuscode)})$" if filter_statuscode else "")
         filter_filetype = (f"&filter=original:.*\\.({'|'.join(filter_filetype)})$" if filter_filetype else "")
-        cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}"
+        cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}{filter_statuscode}"
         return cdxquery
@@ -111,9 +112,10 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
     cdxinject = inject(cdxfile)
     if not cdxinject:
-        cdxquery = create_query(queryrange, limit, filter_filetype, start, end, explicit)
+        cdxquery = create_query(queryrange, limit, filter_filetype, filter_statuscode, start, end, explicit)
         cdxfile =  run_query(cdxfile, cdxquery)
     sc.process_cdx(cdxfile, csvfile)
+    sc.calculate()
@@ -131,7 +133,7 @@ def download_list(output, retry, no_redirect, delay, workers):
     threads = []
     for i in range(workers):
         worker = Worker(id=i + 1)
-        vb.write(verbose=True, content=f"\n-----> Starting worker: {worker.id}")
+        vb.write(verbose=True, content=f"\n-----> Starting Worker: {worker.id}")
         thread = threading.Thread(target=download_loop, args=(worker, output, retry, no_redirect, delay))
         threads.append(thread)
         thread.start()
@@ -163,7 +165,7 @@ def download_loop(worker, output, retry, no_redirect, delay):
             while worker.attempt <= retry_max_attempt: # retry as given by user
-                worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}]")
+                worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}]")
                 download_attempt = 1
                 download_max_attempt = 3
@@ -180,11 +182,11 @@ def download_loop(worker, output, retry, no_redirect, delay):
                                 download_attempt += 1  # try again 2x with same connection
                                 vb.write(
                                     verbose=True,
-                                    content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - requesting again in 50 seconds...",
+                                    content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - requesting again in 50 seconds...",
                                 )
                                 vb.write(
                                     verbose=False,
-                                    content=f"Worker: {worker.id} - Snapshot {worker.rowid}/{sc.SNAPSHOT_TOTAL} - requesting again in 50 seconds...",
+                                    content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - requesting again in 50 seconds...",
                                 )
                                 time.sleep(50)
                                 continue
@@ -195,17 +197,17 @@ def download_loop(worker, output, retry, no_redirect, delay):
                                 download_attempt = download_max_attempt  # try again 1x with new connection
                                 vb.write(
                                     verbose=True,
-                                    content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - renewing connection in 15 seconds...",
+                                    content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - renewing connection in 15 seconds...",
                                 )
                                 vb.write(
                                     verbose=False,
-                                    content=f"Worker: {worker.id} - Snapshot {worker.rowid}/{sc.SNAPSHOT_TOTAL} - renewing connection in 15 seconds...",
+                                    content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - renewing connection in 15 seconds...",
                                 )
                                 time.sleep(15)
                                 worker.refresh_connection()
                                 continue
                         else:
-                            ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - EXCEPTION - {e}", e=e)
+                            ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - EXCEPTION - {e}", e=e)
                             worker.attempt = retry_max_attempt
                             break

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/db.py RENAMED Viewed

@@ -19,6 +19,7 @@ class Database:
         filter_complete INTEGER
     )"""
     snapshot_table = """CREATE TABLE IF NOT EXISTS snapshot_tbl (
+        counter INT,
         timestamp TEXT,
         url_archive TEXT,
         url_origin TEXT,
@@ -28,6 +29,18 @@ class Database:
         file TEXT,
         UNIQUE (url_archive)
     )"""
+    csv_view = """CREATE VIEW IF NOT EXISTS csv_view
+        AS
+            SELECT
+                timestamp AS timestamp,
+                url_archive AS url_archive,
+                url_origin AS url_origin,
+                redirect_url AS redirect_url,
+                redirect_timestamp AS redirect_timestamp,
+                response AS response,
+                file AS file
+        FROM snapshot_tbl;
+    """
     QUERY_EXIST = False
     QUERY_PROGRESS = "0 / 0"
@@ -38,6 +51,7 @@ class Database:
         db = Database()
         db.cursor.execute(cls.waybackup_table)
         db.cursor.execute(cls.snapshot_table)
+        db.cursor.execute(cls.csv_view)
         db.cursor.execute("SELECT query_identifier FROM waybackup_table WHERE query_identifier = ?", (query_identifier,))
         if db.cursor.fetchone():
             cls.QUERY_EXIST = True

{pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/main.py RENAMED Viewed

@@ -29,7 +29,7 @@ def main():
         archive_download.startup()
         try:
-            archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype)
+            archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype, config.statuscode)
             archive_download.download_list(config.output, config.retry, config.no_redirect, config.delay, config.workers)
         except KeyboardInterrupt:
             print("\nInterrupted by user\n")
@@ -38,7 +38,7 @@ def main():
         except Exception as e:
             config.keep = True
-            ex.exception(content="", e=e)
+            ex.exception(message="", e=e)
         finally:
             sc.csv_create(config.csvfile)

{pywaybackup-3.2.1 → pywaybackup-3.3.1/pywaybackup.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: pywaybackup
-Version: 3.2.1
+Version: 3.3.1
 Summary: Query and download archive.org as simple as possible.
 Author-email: bitdruid <bitdruid@outlook.com>
 License: MIT License
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 ### Pip
 1. Install the package <br>
-   ```pip install pywaybackup```
+   `pip install pywaybackup`
 2. Run the tool <br>
-   ```waybackup -h```
+   `waybackup -h`
 ### Manual
 1. Clone the repository <br>
-   ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
+   `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
 2. Install <br>
-   ```pip install .```
+   `pip install .`
    - in a virtual env or use `--break-system-package`
 ## notes / issues / hints
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
   The URL of the web page to download. This argument is required.
 #### Mode Selection (Choose One)
 - **`-a`**, **`--all`**:<br>
   Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
 - **`-l`**, **`--last`**:<br>
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
 - **`-e`**, **`--explicit`**:<br>
   Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
-- **`--filetype`** `<filetype>`:<br>
-  Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
 - **`--limit`** `<count>`:<br>
-Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
+  Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
 - **Range Selection:**<br>
   Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
   (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
-   - **`-r`**, **`--range`**:<br>
-     Specify the range in years for which to search and download snapshots.
-   - **`--start`**:<br>
-     Timestamp to start searching.
-   - **`--end`**:<br>
-     Timestamp to end searching.
+  - **`-r`**, **`--range`**:<br>
+    Specify the range in years for which to search and download snapshots.
+  - **`--start`**:<br>
+    Timestamp to start searching.
+  - **`--end`**:<br>
+    Timestamp to end searching.
+- **Filtering:**<br>
+  A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
+  - **`--filetype`** `<filetype>`:<br>
+    Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
+  - **`--statuscode`** `<statuscode>`:<br>
+    Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
+    Common status codes you may want to handle/filter:
+      - `200` (OK)
+      - `301` (Moved Permanently - will redirect snapshot)
+      - `404` (Not Found - snapshot seems to be empty)
+      - `500` (Internal Server Error - snapshot is at least for now not available)
 ### Optional
 #### Behavior Manipulation
 - **`-o`**, **`--output`**:<br>
-Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
+  Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
 - **`-m`**, **`--metadata`**<br>
-Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+  Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
+- **`--verbose`**:<br>
+  Increase output verbosity.
 <!-- - **`--verbosity`** `<level>`:<br>
 Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
 - **`--log`** <!-- `<path>` -->:<br>
-Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
+  Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
 - **`--progress`**:<br>
-Shows a progress bar instead of the default output.
+  Shows a progress bar instead of the default output.
 - **`--workers`** `<count>`:<br>
-Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
+  Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
 - **`--no-redirect`**:<br>
-Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
+  Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
 - **`--retry`** `<attempts>`:<br>
-Specifies number of retry attempts for failed downloads.
+  Specifies number of retry attempts for failed downloads.
 - **`--delay`** `<seconds>`:<br>
-Specifies delay between download requests in seconds. Default is no delay (0).
-- **`--verbose`**:<br>
-Increase output verbosity.
-  - verbose:
-  ```
-  -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
-  SUCCESS   -> 200 OK
-            -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
-            -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
-  ```
-  - non-verbose:
-  ```
-  55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
-  ```
+  Specifies delay between download requests in seconds. Default is no delay (0).
 <!-- - **`--convert-links`**:<br>
 If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
 - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
 - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
 - Skips previously downloaded files to save time.
-> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
+  > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
 #### Resetting a Job (`--reset`)
 - Deletes `.cdx` and `.db` files and restarts the process from scratch.
 - Does **not** remove already downloaded files.
 - `waybackup -u https://example.com -a --reset`
 #### Keeping Job Data (`--keep`)
 - Normally, `.cdx` and `.db` files are deleted after a successful job.
 - `--keep` preserves them for future re-analysis or extending the query.
 - `waybackup -u https://example.com -a --keep`
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
 ## Examples
 1. Download a specific single snapshot of all available files (starting from root):<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
 2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
-`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
+   `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
 3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
 4. Download all snapshots of all available files in the given range:<br>
-`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
+   `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
 <br>
 <br>
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
 `http://example.com/subdir1/subdir2/assets/`
 <br><br>
 For the first and last version (`-f` or `-l`):
 - Will only include all files/folders starting from your query-path.
 ```
 your/path/waybackup_snapshots/
 └── the_root_of_your_query/ (example.com/)
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
                 ├── style.css
                 ...
 ```
 For all versions (`-a`):
 - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
 ```
 your/path/waybackup_snapshots/
 └── the_root_of_your_query/ (example.com/)
@@ -276,6 +285,23 @@ For download queries:
 ]
 ```
+### Log
+Verbose:
+```
+-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
+SUCCESS   -> 200 OK
+          -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
+          -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
+```
+Non-verbose:
+```
+55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
+```
 ### Debugging
 Exceptions will be written into `waybackup_error.log` (each run overwrites the file).