pywaybackup 3.2.1__tar.gz → 3.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (23) hide show
  1. {pywaybackup-3.2.1/pywaybackup.egg-info → pywaybackup-3.3.1}/PKG-INFO +68 -42
  2. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/README.md +67 -41
  3. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pyproject.toml +1 -1
  4. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Arguments.py +58 -6
  5. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/SnapshotCollection.py +80 -62
  6. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Verbosity.py +1 -0
  7. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Worker.py +8 -7
  8. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/archive_download.py +13 -11
  9. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/db.py +14 -0
  10. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/main.py +2 -2
  11. {pywaybackup-3.2.1 → pywaybackup-3.3.1/pywaybackup.egg-info}/PKG-INFO +68 -42
  12. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/LICENSE +0 -0
  13. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Converter.py +0 -0
  14. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Exception.py +0 -0
  15. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/__init__.py +0 -0
  16. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/archive_save.py +0 -0
  17. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/helper.py +0 -0
  18. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/SOURCES.txt +0 -0
  19. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/dependency_links.txt +0 -0
  20. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/entry_points.txt +0 -0
  21. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/requires.txt +0 -0
  22. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/top_level.txt +0 -0
  23. {pywaybackup-3.2.1 → pywaybackup-3.3.1}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pywaybackup
3
- Version: 3.2.1
3
+ Version: 3.3.1
4
4
  Summary: Query and download archive.org as simple as possible.
5
5
  Author-email: bitdruid <bitdruid@outlook.com>
6
6
  License: MIT License
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
55
55
  ### Pip
56
56
 
57
57
  1. Install the package <br>
58
- ```pip install pywaybackup```
58
+ `pip install pywaybackup`
59
59
  2. Run the tool <br>
60
- ```waybackup -h```
60
+ `waybackup -h`
61
61
 
62
62
  ### Manual
63
63
 
64
64
  1. Clone the repository <br>
65
- ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
65
+ `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
66
66
  2. Install <br>
67
- ```pip install .```
67
+ `pip install .`
68
68
  - in a virtual env or use `--break-system-package`
69
69
 
70
70
  ## notes / issues / hints
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
88
88
  The URL of the web page to download. This argument is required.
89
89
 
90
90
  #### Mode Selection (Choose One)
91
+
91
92
  - **`-a`**, **`--all`**:<br>
92
93
  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
93
94
  - **`-l`**, **`--last`**:<br>
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
102
103
  - **`-e`**, **`--explicit`**:<br>
103
104
  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
104
105
 
105
- - **`--filetype`** `<filetype>`:<br>
106
- Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
107
-
108
106
  - **`--limit`** `<count>`:<br>
109
- Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
107
+ Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
110
108
 
111
109
  - **Range Selection:**<br>
112
110
  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
113
111
  (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
114
- - **`-r`**, **`--range`**:<br>
115
- Specify the range in years for which to search and download snapshots.
116
- - **`--start`**:<br>
117
- Timestamp to start searching.
118
- - **`--end`**:<br>
119
- Timestamp to end searching.
112
+
113
+ - **`-r`**, **`--range`**:<br>
114
+ Specify the range in years for which to search and download snapshots.
115
+ - **`--start`**:<br>
116
+ Timestamp to start searching.
117
+ - **`--end`**:<br>
118
+ Timestamp to end searching.
119
+
120
+ - **Filtering:**<br>
121
+ A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
122
+
123
+ - **`--filetype`** `<filetype>`:<br>
124
+ Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
125
+
126
+ - **`--statuscode`** `<statuscode>`:<br>
127
+ Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
128
+ Common status codes you may want to handle/filter:
129
+ - `200` (OK)
130
+ - `301` (Moved Permanently - will redirect snapshot)
131
+ - `404` (Not Found - snapshot seems to be empty)
132
+ - `500` (Internal Server Error - snapshot is at least for now not available)
120
133
 
121
134
  ### Optional
122
135
 
123
136
  #### Behavior Manipulation
124
137
 
125
138
  - **`-o`**, **`--output`**:<br>
126
- Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
139
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
127
140
 
128
141
  - **`-m`**, **`--metadata`**<br>
129
- Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
142
+ Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
143
+
144
+ - **`--verbose`**:<br>
145
+ Increase output verbosity.
130
146
 
131
147
  <!-- - **`--verbosity`** `<level>`:<br>
132
148
  Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
133
149
 
134
150
  - **`--log`** <!-- `<path>` -->:<br>
135
- Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
151
+ Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
136
152
 
137
153
  - **`--progress`**:<br>
138
- Shows a progress bar instead of the default output.
154
+ Shows a progress bar instead of the default output.
139
155
 
140
156
  - **`--workers`** `<count>`:<br>
141
- Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
157
+ Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
142
158
 
143
159
  - **`--no-redirect`**:<br>
144
- Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
160
+ Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
145
161
 
146
162
  - **`--retry`** `<attempts>`:<br>
147
- Specifies number of retry attempts for failed downloads.
163
+ Specifies number of retry attempts for failed downloads.
148
164
 
149
165
  - **`--delay`** `<seconds>`:<br>
150
- Specifies delay between download requests in seconds. Default is no delay (0).
151
-
152
- - **`--verbose`**:<br>
153
- Increase output verbosity.
154
- - verbose:
155
- ```
156
- -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
157
- SUCCESS -> 200 OK
158
- -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
159
- -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
160
- ```
161
- - non-verbose:
162
- ```
163
- 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
164
- ```
166
+ Specifies delay between download requests in seconds. Default is no delay (0).
165
167
 
166
168
  <!-- - **`--convert-links`**:<br>
167
169
  If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
186
188
  - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
187
189
  - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
188
190
  - Skips previously downloaded files to save time.
189
- > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
191
+ > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
190
192
 
191
193
  #### Resetting a Job (`--reset`)
194
+
192
195
  - Deletes `.cdx` and `.db` files and restarts the process from scratch.
193
196
  - Does **not** remove already downloaded files.
194
197
  - `waybackup -u https://example.com -a --reset`
195
198
 
196
199
  #### Keeping Job Data (`--keep`)
200
+
197
201
  - Normally, `.cdx` and `.db` files are deleted after a successful job.
198
202
  - `--keep` preserves them for future re-analysis or extending the query.
199
203
  - `waybackup -u https://example.com -a --keep`
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
204
208
  ## Examples
205
209
 
206
210
  1. Download a specific single snapshot of all available files (starting from root):<br>
207
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
211
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
208
212
  2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
209
- `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
213
+ `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
210
214
  3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
211
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
215
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
212
216
  4. Download all snapshots of all available files in the given range:<br>
213
- `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
217
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
214
218
 
215
219
  <br>
216
220
  <br>
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
223
227
  `http://example.com/subdir1/subdir2/assets/`
224
228
  <br><br>
225
229
  For the first and last version (`-f` or `-l`):
230
+
226
231
  - Will only include all files/folders starting from your query-path.
232
+
227
233
  ```
228
234
  your/path/waybackup_snapshots/
229
235
  └── the_root_of_your_query/ (example.com/)
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
234
240
  ├── style.css
235
241
  ...
236
242
  ```
243
+
237
244
  For all versions (`-a`):
245
+
238
246
  - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
247
+
239
248
  ```
240
249
  your/path/waybackup_snapshots/
241
250
  └── the_root_of_your_query/ (example.com/)
@@ -276,6 +285,23 @@ For download queries:
276
285
  ]
277
286
  ```
278
287
 
288
+ ### Log
289
+
290
+ Verbose:
291
+
292
+ ```
293
+ -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
294
+ SUCCESS -> 200 OK
295
+ -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
296
+ -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
297
+ ```
298
+
299
+ Non-verbose:
300
+
301
+ ```
302
+ 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
303
+ ```
304
+
279
305
  ### Debugging
280
306
 
281
307
  Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
@@ -16,16 +16,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
16
16
  ### Pip
17
17
 
18
18
  1. Install the package <br>
19
- ```pip install pywaybackup```
19
+ `pip install pywaybackup`
20
20
  2. Run the tool <br>
21
- ```waybackup -h```
21
+ `waybackup -h`
22
22
 
23
23
  ### Manual
24
24
 
25
25
  1. Clone the repository <br>
26
- ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
26
+ `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
27
27
  2. Install <br>
28
- ```pip install .```
28
+ `pip install .`
29
29
  - in a virtual env or use `--break-system-package`
30
30
 
31
31
  ## notes / issues / hints
@@ -49,6 +49,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
49
49
  The URL of the web page to download. This argument is required.
50
50
 
51
51
  #### Mode Selection (Choose One)
52
+
52
53
  - **`-a`**, **`--all`**:<br>
53
54
  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
54
55
  - **`-l`**, **`--last`**:<br>
@@ -63,66 +64,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
63
64
  - **`-e`**, **`--explicit`**:<br>
64
65
  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
65
66
 
66
- - **`--filetype`** `<filetype>`:<br>
67
- Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
68
-
69
67
  - **`--limit`** `<count>`:<br>
70
- Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
68
+ Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
71
69
 
72
70
  - **Range Selection:**<br>
73
71
  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
74
72
  (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
75
- - **`-r`**, **`--range`**:<br>
76
- Specify the range in years for which to search and download snapshots.
77
- - **`--start`**:<br>
78
- Timestamp to start searching.
79
- - **`--end`**:<br>
80
- Timestamp to end searching.
73
+
74
+ - **`-r`**, **`--range`**:<br>
75
+ Specify the range in years for which to search and download snapshots.
76
+ - **`--start`**:<br>
77
+ Timestamp to start searching.
78
+ - **`--end`**:<br>
79
+ Timestamp to end searching.
80
+
81
+ - **Filtering:**<br>
82
+ A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
83
+
84
+ - **`--filetype`** `<filetype>`:<br>
85
+ Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
86
+
87
+ - **`--statuscode`** `<statuscode>`:<br>
88
+ Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
89
+ Common status codes you may want to handle/filter:
90
+ - `200` (OK)
91
+ - `301` (Moved Permanently - will redirect snapshot)
92
+ - `404` (Not Found - snapshot seems to be empty)
93
+ - `500` (Internal Server Error - snapshot is at least for now not available)
81
94
 
82
95
  ### Optional
83
96
 
84
97
  #### Behavior Manipulation
85
98
 
86
99
  - **`-o`**, **`--output`**:<br>
87
- Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
100
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
88
101
 
89
102
  - **`-m`**, **`--metadata`**<br>
90
- Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
103
+ Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
104
+
105
+ - **`--verbose`**:<br>
106
+ Increase output verbosity.
91
107
 
92
108
  <!-- - **`--verbosity`** `<level>`:<br>
93
109
  Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
94
110
 
95
111
  - **`--log`** <!-- `<path>` -->:<br>
96
- Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
112
+ Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
97
113
 
98
114
  - **`--progress`**:<br>
99
- Shows a progress bar instead of the default output.
115
+ Shows a progress bar instead of the default output.
100
116
 
101
117
  - **`--workers`** `<count>`:<br>
102
- Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
118
+ Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
103
119
 
104
120
  - **`--no-redirect`**:<br>
105
- Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
121
+ Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
106
122
 
107
123
  - **`--retry`** `<attempts>`:<br>
108
- Specifies number of retry attempts for failed downloads.
124
+ Specifies number of retry attempts for failed downloads.
109
125
 
110
126
  - **`--delay`** `<seconds>`:<br>
111
- Specifies delay between download requests in seconds. Default is no delay (0).
112
-
113
- - **`--verbose`**:<br>
114
- Increase output verbosity.
115
- - verbose:
116
- ```
117
- -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
118
- SUCCESS -> 200 OK
119
- -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
120
- -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
121
- ```
122
- - non-verbose:
123
- ```
124
- 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
125
- ```
127
+ Specifies delay between download requests in seconds. Default is no delay (0).
126
128
 
127
129
  <!-- - **`--convert-links`**:<br>
128
130
  If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -147,14 +149,16 @@ If set, all links in the downloaded files will be converted to local links. This
147
149
  - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
148
150
  - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
149
151
  - Skips previously downloaded files to save time.
150
- > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
152
+ > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
151
153
 
152
154
  #### Resetting a Job (`--reset`)
155
+
153
156
  - Deletes `.cdx` and `.db` files and restarts the process from scratch.
154
157
  - Does **not** remove already downloaded files.
155
158
  - `waybackup -u https://example.com -a --reset`
156
159
 
157
160
  #### Keeping Job Data (`--keep`)
161
+
158
162
  - Normally, `.cdx` and `.db` files are deleted after a successful job.
159
163
  - `--keep` preserves them for future re-analysis or extending the query.
160
164
  - `waybackup -u https://example.com -a --keep`
@@ -165,13 +169,13 @@ If set, all links in the downloaded files will be converted to local links. This
165
169
  ## Examples
166
170
 
167
171
  1. Download a specific single snapshot of all available files (starting from root):<br>
168
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
172
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
169
173
  2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
170
- `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
174
+ `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
171
175
  3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
172
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
176
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
173
177
  4. Download all snapshots of all available files in the given range:<br>
174
- `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
178
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
175
179
 
176
180
  <br>
177
181
  <br>
@@ -184,7 +188,9 @@ The output path is currently structured as follows by an example for the query:<
184
188
  `http://example.com/subdir1/subdir2/assets/`
185
189
  <br><br>
186
190
  For the first and last version (`-f` or `-l`):
191
+
187
192
  - Will only include all files/folders starting from your query-path.
193
+
188
194
  ```
189
195
  your/path/waybackup_snapshots/
190
196
  └── the_root_of_your_query/ (example.com/)
@@ -195,8 +201,11 @@ your/path/waybackup_snapshots/
195
201
  ├── style.css
196
202
  ...
197
203
  ```
204
+
198
205
  For all versions (`-a`):
206
+
199
207
  - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
208
+
200
209
  ```
201
210
  your/path/waybackup_snapshots/
202
211
  └── the_root_of_your_query/ (example.com/)
@@ -237,6 +246,23 @@ For download queries:
237
246
  ]
238
247
  ```
239
248
 
249
+ ### Log
250
+
251
+ Verbose:
252
+
253
+ ```
254
+ -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
255
+ SUCCESS -> 200 OK
256
+ -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
257
+ -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
258
+ ```
259
+
260
+ Non-verbose:
261
+
262
+ ```
263
+ 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
264
+ ```
265
+
240
266
  ### Debugging
241
267
 
242
268
  Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
@@ -7,7 +7,7 @@ packages = ["pywaybackup"]
7
7
 
8
8
  [project]
9
9
  name = "pywaybackup"
10
- version = "3.2.1"
10
+ version = "3.3.1"
11
11
  description = "Query and download archive.org as simple as possible."
12
12
  authors = [
13
13
  { name = "bitdruid", email = "bitdruid@outlook.com" }
@@ -3,6 +3,8 @@ import sys
3
3
  import os
4
4
  import argparse
5
5
 
6
+ from argparse import RawTextHelpFormatter
7
+
6
8
  from importlib.metadata import version
7
9
 
8
10
  from pywaybackup.helper import url_split, sanitize_filename
@@ -10,9 +12,10 @@ from pywaybackup.helper import url_split, sanitize_filename
10
12
  class Arguments:
11
13
 
12
14
  def __init__(self):
13
-
14
- parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
15
- parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + version("pywaybackup") + ' by @bitdruid -> https://github.com/bitdruid')
15
+ parser = argparse.ArgumentParser(
16
+ description=f"<<< python-wayback-machine-downloader v{version('pywaybackup')} >>>\nby @bitdruid -> https://github.com/bitdruid",
17
+ formatter_class=RawTextHelpFormatter,
18
+ )
16
19
 
17
20
  required = parser.add_argument_group('required (one exclusive)')
18
21
  required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
@@ -27,12 +30,14 @@ class Arguments:
27
30
  optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
28
31
  optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
29
32
  optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
30
- optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
31
33
  optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
34
+ optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (js,css,...)')
35
+ optional.add_argument('--statuscode', type=str, metavar="", help='statuscodes to download comma separated (200,404,...)')
32
36
 
33
37
  behavior = parser.add_argument_group('manipulate behavior')
34
38
  behavior.add_argument('-o', '--output', type=str, metavar="", help='output for all files - defaults to current directory')
35
39
  behavior.add_argument('-m', '--metadata', type=str, metavar="", help='change directory for db/cdx/csv/log files')
40
+ behavior.add_argument('-v', '--verbose', action='store_true', help='overwritten by progress - gives detailed output')
36
41
  behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
37
42
  behavior.add_argument('--progress', action='store_true', help='show a progress bar')
38
43
  behavior.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
@@ -40,7 +45,6 @@ class Arguments:
40
45
  behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
41
46
  # behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
42
47
  behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
43
- behavior.add_argument('--verbose', action='store_true', help='overwritten by progress - gives detailed output')
44
48
 
45
49
  special = parser.add_argument_group('special')
46
50
  special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
@@ -61,6 +65,52 @@ class Arguments:
61
65
  return self.args
62
66
 
63
67
  class Configuration:
68
+
69
+ # def __init__(self):
70
+ # self.args = Arguments().get_args()
71
+ # for key, value in vars(self.args).items():
72
+ # setattr(Configuration, key, value)
73
+
74
+ # self.set_config()
75
+
76
+ # def set_config(self):
77
+ # # args now attributes of Configuration // Configuration.output, ...
78
+ # self.command = ' '.join(sys.argv[1:])
79
+ # self.domain, self.subdir, self.filename = url_split(self.url)
80
+
81
+ # if self.output is None:
82
+ # self.output = os.path.join(os.getcwd(), "waybackup_snapshots")
83
+ # if self.metadata is None:
84
+ # self.metadata = self.output
85
+ # os.makedirs(self.output, exist_ok=True) if not self.save else None
86
+ # os.makedirs(self.metadata, exist_ok=True) if not self.save else None
87
+
88
+ # if self.all:
89
+ # self.mode = "all"
90
+ # if self.last:
91
+ # self.mode = "last"
92
+ # if self.first:
93
+ # self.mode = "first"
94
+ # if self.save:
95
+ # self.mode = "save"
96
+
97
+ # if self.filetype:
98
+ # self.filetype = [f.lower().strip() for f in self.filetype.split(",")]
99
+ # if self.statuscode:
100
+ # self.statuscode = [s.lower().strip() for s in self.statuscode.split(",")]
101
+
102
+ # base_path = self.metadata
103
+ # base_name = f"waybackup_{sanitize_filename(self.url)}"
104
+ # self.cdxfile = os.path.join(base_path, f"{base_name}.cdx")
105
+ # self.dbfile = os.path.join(base_path, f"{base_name}.db")
106
+ # self.csvfile = os.path.join(base_path, f"{base_name}.csv")
107
+ # self.log = os.path.join(base_path, f"{base_name}.log") if self.log else None
108
+
109
+ # if self.reset:
110
+ # os.remove(self.cdxfile) if os.path.isfile(self.cdxfile) else None
111
+ # os.remove(self.dbfile) if os.path.isfile(self.dbfile) else None
112
+ # os.remove(self.csvfile) if os.path.isfile(self.csvfile) else None
113
+
64
114
 
65
115
  @classmethod
66
116
  def init(cls):
@@ -90,7 +140,9 @@ class Configuration:
90
140
  cls.mode = "save"
91
141
 
92
142
  if cls.filetype:
93
- cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]
143
+ cls.filetype = [f.lower().strip() for f in cls.filetype.split(",")]
144
+ if cls.statuscode:
145
+ cls.statuscode = [s.lower().strip() for s in cls.statuscode.split(",")]
94
146
 
95
147
  base_path = cls.metadata
96
148
  base_name = f"waybackup_{sanitize_filename(cls.url)}"
@@ -22,11 +22,10 @@ class SnapshotCollection:
22
22
  SNAPSHOT_UNHANDLED = 0 # all unhandled snapshots in the db (without response)
23
23
  SNAPSHOT_HANDLED = 0 # snapshots with a response
24
24
 
25
- SNAPSHOT_REMOVALS = 0 # not to be utilized (total - unhandled - skip)
26
- SNAPSHOT_FAULTY = 0 # snapshots which could not be loaded from cdx file into db
27
25
  FILTER_DUPLICATES = 0 # with identical url_archive
28
26
  FILTER_MODE = 0 # all snapshots filtered by the MODE (last or first)
29
27
  FILTER_SKIP = 0 # content of the csv file
28
+ FILTER_RESPONSE = 0 # snapshots which could not be loaded from cdx file into db or 404
30
29
 
31
30
  @classmethod
32
31
  def init(cls, mode):
@@ -71,35 +70,40 @@ class SnapshotCollection:
71
70
  cls.db.set_index_complete()
72
71
  else:
73
72
  vb.write(verbose=True, content="\nAlready indexed snapshots")
74
- if cls.MODE_LAST or cls.MODE_FIRST:
75
- if not cls.db.get_filter_complete():
76
- vb.write(content="\nFiltering snapshots (last or first version)...")
77
- cls.filter_snapshots() # filter: keep newest or oldest based on MODE
78
- cls.db.set_filter_complete()
79
- else:
80
- vb.write(verbose=True, content="\nAlready filtered snapshots (last or first version)")
73
+ if not cls.db.get_filter_complete():
74
+ vb.write(content="\nFiltering snapshots (last or first version)...")
75
+ cls.filter_snapshots() # filter: keep newest or oldest based on MODE
76
+ cls.db.set_filter_complete()
77
+ else:
78
+ vb.write(verbose=True, content="\nAlready filtered snapshots (last or first version)")
81
79
 
82
80
  cls.skip_set(csvfile) # set response to NULL or read csv file and write values into db
81
+
82
+
83
+
84
+
85
+
86
+ @classmethod
87
+ def calculate(cls):
83
88
  cls.SNAPSHOT_UNHANDLED = cls.count_totals(unhandled=True) # count all unhandled in db
84
89
  cls.SNAPSHOT_HANDLED = cls.count_totals(handled=True) # count all handled in db
85
90
  cls.SNAPSHOT_TOTAL = cls.count_totals(total=True) # count all in db
86
- cls.SNAPSHOT_REMOVALS = cls.CDX_TOTAL - cls.SNAPSHOT_UNHANDLED - cls.FILTER_SKIP # count all removals
87
91
 
88
92
  vb.write(content="\nSnapshot calculation:")
89
93
  vb.write(content=f"-----> {'in CDX file'.ljust(18)}: {cls.CDX_TOTAL:,}")
90
94
 
91
- if cls.FILTER_DUPLICATES == 0 and cls.FILTER_MODE == 0:
92
- vb.write(content=f"-----> {'total removals'.ljust(18)}: {cls.SNAPSHOT_REMOVALS:,}")
93
- if cls.SNAPSHOT_FAULTY > 0:
94
- vb.write(content=f"-----> {'removed faulty'.ljust(18)}: {cls.SNAPSHOT_FAULTY}")
95
95
  if cls.FILTER_DUPLICATES > 0:
96
96
  vb.write(content=f"-----> {'removed duplicates'.ljust(18)}: {cls.FILTER_DUPLICATES:,}")
97
97
  if cls.FILTER_MODE > 0:
98
98
  vb.write(content=f"-----> {'removed versions'.ljust(18)}: {cls.FILTER_MODE:,}")
99
+
99
100
  if cls.FILTER_SKIP > 0:
100
- vb.write(content=f"-----> {'skipped existing'.ljust(18)}: {cls.FILTER_SKIP:,}")
101
+ vb.write(content=f"-----> {'skip existing'.ljust(18)}: {cls.FILTER_SKIP:,}")
102
+ if cls.FILTER_RESPONSE > 0:
103
+ vb.write(content=f"-----> {'skip statuscode'.ljust(18)}: {cls.FILTER_RESPONSE}")
101
104
 
102
- vb.write(content=f"\n-----> {'to utilize'.ljust(18)}: {cls.SNAPSHOT_UNHANDLED:,}")
105
+ if cls.SNAPSHOT_UNHANDLED > 0:
106
+ vb.write(content=f"\n-----> {'to utilize'.ljust(18)}: {cls.SNAPSHOT_UNHANDLED:,}")
103
107
 
104
108
 
105
109
 
@@ -112,7 +116,23 @@ class SnapshotCollection:
112
116
  - Removes duplicates by url_archive (same timestamp and url_origin)
113
117
  - Filters the snapshots by the given mode (last or first)
114
118
  """
119
+
120
+ def _parse_line(line):
121
+ line = json.loads(line)
122
+ line = {
123
+ "timestamp": line[0],
124
+ "digest": line[1],
125
+ "mimetype": line[2],
126
+ "statuscode": line[3],
127
+ "origin": line[4],
128
+ }
129
+ url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['origin']}"
130
+ statuscode = line["statuscode"] if line["statuscode"] in ("301", "404") else None
131
+ return (line["timestamp"], url_archive, line["origin"], statuscode)
132
+
133
+
115
134
  vb.write(verbose=None, content="\nInserting CDX data into database...")
135
+
116
136
  with open(cdxfile, "r", encoding="utf-8") as f, tqdm(
117
137
  unit=" lines",
118
138
  total=cls.CDX_TOTAL,
@@ -123,11 +143,9 @@ class SnapshotCollection:
123
143
  line_batchsize = 2500
124
144
  line_batch = []
125
145
  total_inserted = 0
126
- faulty_lines = 0
127
- query_duplicates = (
128
- """INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin) VALUES (?, ?, ?)"""
129
- )
146
+ query_duplicates = """INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin, response) VALUES (?, ?, ?, ?)"""
130
147
  first_line = True
148
+
131
149
  for line in f:
132
150
  if first_line:
133
151
  first_line = False
@@ -137,29 +155,15 @@ class SnapshotCollection:
137
155
  line = line.rsplit("]", 1)[0]
138
156
  if line.endswith(","):
139
157
  line = line.rsplit(",", 1)[0]
140
- try:
141
- line = json.loads(line)
142
- line = {
143
- "timestamp": line[0],
144
- "digest": line[1],
145
- "mimetype": line[2],
146
- "status": line[3],
147
- "url": line[4],
148
- }
149
- url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['url']}"
150
- line_batch.append((line["timestamp"], url_archive, line["url"]))
151
- if len(line_batch) >= line_batchsize:
152
- total_inserted += len(line_batch)
153
- cls.db.cursor.executemany(query_duplicates, line_batch)
154
- line_batch = []
155
- pbar.update(line_batchsize)
156
- except json.JSONDecodeError as e:
157
- faulty_lines += 1
158
- vb.write(
159
- verbose=None,
160
- content=f"JSONDecodeError: {e} on line {cls.CDX_TOTAL}",
161
- )
162
- continue
158
+
159
+ line_batch.append(_parse_line(line))
160
+
161
+ if len(line_batch) >= line_batchsize:
162
+ total_inserted += len(line_batch)
163
+ cls.db.cursor.executemany(query_duplicates, line_batch)
164
+ line_batch = []
165
+ pbar.update(line_batchsize)
166
+
163
167
  if line_batch:
164
168
  total_inserted += len(line_batch)
165
169
  cls.db.cursor.executemany(query_duplicates, line_batch)
@@ -167,8 +171,7 @@ class SnapshotCollection:
167
171
 
168
172
  cls.db.conn.commit()
169
173
 
170
- cls.SNAPSHOT_FAULTY = faulty_lines
171
- cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(unhandled=True) + cls.SNAPSHOT_FAULTY
174
+ cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(total=True)
172
175
 
173
176
 
174
177
 
@@ -181,7 +184,7 @@ class SnapshotCollection:
181
184
  """
182
185
  row_batchsize = 2500
183
186
  cls.db.cursor.execute("UPDATE snapshot_tbl SET response = NULL WHERE response = 'LOCK'") # reset locked to unprocessed
184
- cls.db.cursor.execute("SELECT * FROM snapshot_tbl WHERE response IS NOT NULL") # only write processed snapshots
187
+ cls.db.cursor.execute("SELECT * FROM csv_view WHERE response IS NOT NULL") # only write processed snapshots
185
188
  headers = [description[0] for description in cls.db.cursor.description]
186
189
  with open(csvfile, "w", encoding="utf-8") as f:
187
190
  writer = csv.writer(f)
@@ -203,13 +206,15 @@ class SnapshotCollection:
203
206
  Create indexes for the snapshot table.
204
207
  """
205
208
  # index for filtering last snapshots
206
- cls.db.cursor.execute(
207
- "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_desc ON snapshot_tbl(url_origin, timestamp DESC);"
208
- )
209
+ if cls.MODE_LAST:
210
+ cls.db.cursor.execute(
211
+ "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_desc ON snapshot_tbl(url_origin, timestamp DESC);"
212
+ )
209
213
  # index for filtering first snapshots
210
- cls.db.cursor.execute(
211
- "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_asc ON snapshot_tbl(url_origin, timestamp ASC);"
212
- )
214
+ if cls.MODE_FIRST:
215
+ cls.db.cursor.execute(
216
+ "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_asc ON snapshot_tbl(url_origin, timestamp ASC);"
217
+ )
213
218
  # index for skippable snapshots
214
219
  cls.db.cursor.execute(
215
220
  "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_timestamp_url_origin_response ON snapshot_tbl(timestamp, url_origin);"
@@ -247,6 +252,26 @@ class SnapshotCollection:
247
252
  """
248
253
  )
249
254
  cls.FILTER_MODE = cls.db.cursor.rowcount
255
+
256
+ cls.db.cursor.execute(
257
+ """
258
+ SELECT COUNT(*) FROM snapshot_tbl WHERE response IN ('404', '301')
259
+ """
260
+ )
261
+ cls.FILTER_RESPONSE = cls.db.cursor.fetchone()[0]
262
+
263
+ cls.db.cursor.execute(
264
+ """
265
+ WITH numbered AS (
266
+ SELECT rowid, ROW_NUMBER() OVER (ORDER BY rowid) AS rn
267
+ FROM snapshot_tbl
268
+ )
269
+ UPDATE snapshot_tbl
270
+ SET counter = (
271
+ SELECT rn FROM numbered WHERE numbered.rowid = snapshot_tbl.rowid
272
+ );
273
+ """
274
+ )
250
275
 
251
276
  cls.db.conn.commit()
252
277
 
@@ -259,13 +284,6 @@ class SnapshotCollection:
259
284
  """
260
285
  If an existing csv-file for the job exists, the responses will be overwritten by the csv-content.
261
286
  """
262
- cls.db.cursor.execute(
263
- """
264
- UPDATE snapshot_tbl
265
- SET response = NULL
266
- """
267
- )
268
- cls.db.conn.commit()
269
287
  if not os.path.isfile(csvfile):
270
288
  return
271
289
  else:
@@ -327,16 +345,16 @@ class SnapshotCollection:
327
345
  if unhandled:
328
346
  return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE response IS NULL").fetchone()[0]
329
347
  if success:
330
- return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NOT NULL").fetchone()[0]
348
+ return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NOT NULL AND file != ''").fetchone()[0]
331
349
  if fail:
332
- return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NULL").fetchone()[0]
350
+ return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NULL OR file = ''").fetchone()[0]
333
351
 
334
352
  @staticmethod
335
353
  def modify_snapshot(connection, snapshot_id, column, value):
336
354
  """
337
355
  Modify a snapshot-row in the snapshot table.
338
356
  """
339
- query = f"UPDATE snapshot_tbl SET {column} = ? WHERE rowid = ?"
357
+ query = f"UPDATE snapshot_tbl SET {column} = ? WHERE counter = ?"
340
358
  connection.cursor.execute(query, (value, snapshot_id))
341
359
  connection.conn.commit()
342
360
 
@@ -3,6 +3,7 @@ from typing import Union
3
3
 
4
4
 
5
5
 
6
+
6
7
  class Verbosity:
7
8
  """
8
9
  A class to manage verbosity levels, logging, progress and output.
@@ -37,7 +37,7 @@ class Worker:
37
37
  self.snapshot = sc.get_snapshot(self.db)
38
38
  if not self.snapshot:
39
39
  return
40
- self.rowid = self.snapshot["rowid"]
40
+ self.counter = self.snapshot["counter"]
41
41
  self.timestamp = self.snapshot["timestamp"]
42
42
  self.url_archive = self.snapshot["url_archive"]
43
43
  self.url_origin = self.snapshot["url_origin"]
@@ -64,7 +64,7 @@ class Worker:
64
64
  if self.redirect_timestamp is None and value is None:
65
65
  return
66
66
  self._redirect_url = value
67
- sc.modify_snapshot(self.db, self.rowid, "redirect_url", value)
67
+ sc.modify_snapshot(self.db, self.counter, "redirect_url", value)
68
68
 
69
69
  @property
70
70
  def redirect_timestamp(self):
@@ -75,7 +75,7 @@ class Worker:
75
75
  if self.redirect_url is None and value is None:
76
76
  return
77
77
  self._redirect_timestamp = value
78
- sc.modify_snapshot(self.db, self.rowid, "redirect_timestamp", value)
78
+ sc.modify_snapshot(self.db, self.counter, "redirect_timestamp", value)
79
79
 
80
80
  @property
81
81
  def response(self):
@@ -86,7 +86,7 @@ class Worker:
86
86
  if self.redirect_url is None and value is None:
87
87
  return
88
88
  self._response = value
89
- sc.modify_snapshot(self.db, self.rowid, "response", value)
89
+ sc.modify_snapshot(self.db, self.counter, "response", value)
90
90
 
91
91
  @property
92
92
  def file(self):
@@ -97,7 +97,7 @@ class Worker:
97
97
  if self.redirect_url is None and value is None:
98
98
  return
99
99
  self._file = value
100
- sc.modify_snapshot(self.db, self.rowid, "file", value)
100
+ sc.modify_snapshot(self.db, self.counter, "file", value)
101
101
 
102
102
 
103
103
  class Message(Worker):
@@ -141,14 +141,15 @@ class Message(Worker):
141
141
  "verbose": True,
142
142
  "content": _format_verbose({"result": result, "info": info, "content": content}),
143
143
  }
144
+ self.buffer.append(self.message)
144
145
  if verbose is False or verbose is None:
145
146
  result = result + " - " if result else ""
146
147
  content = content + " - " if content else ""
147
148
  self.message = {
148
149
  "verbose": False,
149
- "content": f"{self.worker.rowid}/{sc.SNAPSHOT_TOTAL} - W:{self.worker.id} - {result}{content}{self.worker.timestamp} - {self.worker.url_origin}",
150
+ "content": f"{self.worker.counter}/{sc.SNAPSHOT_TOTAL} - W:{self.worker.id} - {result}{content}{self.worker.timestamp} - {self.worker.url_origin}",
150
151
  }
151
- self.buffer.append(self.message)
152
+ self.buffer.append(self.message)
152
153
 
153
154
  def write(self):
154
155
  """
@@ -47,7 +47,7 @@ def startup():
47
47
 
48
48
 
49
49
 
50
- def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list):
50
+ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list,filter_statuscode: list):
51
51
 
52
52
  def inject(cdxinject: str) -> bool:
53
53
  if os.path.isfile(cdxinject):
@@ -60,7 +60,7 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
60
60
  )
61
61
  return False
62
62
 
63
- def create_query(queryrange: int, limit: int, filter_filetype: list, start: int, end: int, explicit: bool) -> str:
63
+ def create_query(queryrange: int, limit: int, filter_filetype: list, filter_statuscode: list, start: int, end: int, explicit: bool) -> str:
64
64
  if queryrange:
65
65
  query_range = f"&from={datetime.now().year - queryrange}"
66
66
  else:
@@ -81,9 +81,10 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
81
81
 
82
82
  limit = f"&limit={limit}" if limit else ""
83
83
 
84
+ filter_statuscode = (f"&filter=statuscode:({'|'.join(filter_statuscode)})$" if filter_statuscode else "")
84
85
  filter_filetype = (f"&filter=original:.*\\.({'|'.join(filter_filetype)})$" if filter_filetype else "")
85
86
 
86
- cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}"
87
+ cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}{filter_statuscode}"
87
88
 
88
89
  return cdxquery
89
90
 
@@ -111,9 +112,10 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
111
112
 
112
113
  cdxinject = inject(cdxfile)
113
114
  if not cdxinject:
114
- cdxquery = create_query(queryrange, limit, filter_filetype, start, end, explicit)
115
+ cdxquery = create_query(queryrange, limit, filter_filetype, filter_statuscode, start, end, explicit)
115
116
  cdxfile = run_query(cdxfile, cdxquery)
116
117
  sc.process_cdx(cdxfile, csvfile)
118
+ sc.calculate()
117
119
 
118
120
 
119
121
 
@@ -131,7 +133,7 @@ def download_list(output, retry, no_redirect, delay, workers):
131
133
  threads = []
132
134
  for i in range(workers):
133
135
  worker = Worker(id=i + 1)
134
- vb.write(verbose=True, content=f"\n-----> Starting worker: {worker.id}")
136
+ vb.write(verbose=True, content=f"\n-----> Starting Worker: {worker.id}")
135
137
  thread = threading.Thread(target=download_loop, args=(worker, output, retry, no_redirect, delay))
136
138
  threads.append(thread)
137
139
  thread.start()
@@ -163,7 +165,7 @@ def download_loop(worker, output, retry, no_redirect, delay):
163
165
 
164
166
  while worker.attempt <= retry_max_attempt: # retry as given by user
165
167
 
166
- worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}]")
168
+ worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}]")
167
169
  download_attempt = 1
168
170
  download_max_attempt = 3
169
171
 
@@ -180,11 +182,11 @@ def download_loop(worker, output, retry, no_redirect, delay):
180
182
  download_attempt += 1 # try again 2x with same connection
181
183
  vb.write(
182
184
  verbose=True,
183
- content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - requesting again in 50 seconds...",
185
+ content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - requesting again in 50 seconds...",
184
186
  )
185
187
  vb.write(
186
188
  verbose=False,
187
- content=f"Worker: {worker.id} - Snapshot {worker.rowid}/{sc.SNAPSHOT_TOTAL} - requesting again in 50 seconds...",
189
+ content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - requesting again in 50 seconds...",
188
190
  )
189
191
  time.sleep(50)
190
192
  continue
@@ -195,17 +197,17 @@ def download_loop(worker, output, retry, no_redirect, delay):
195
197
  download_attempt = download_max_attempt # try again 1x with new connection
196
198
  vb.write(
197
199
  verbose=True,
198
- content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - renewing connection in 15 seconds...",
200
+ content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - renewing connection in 15 seconds...",
199
201
  )
200
202
  vb.write(
201
203
  verbose=False,
202
- content=f"Worker: {worker.id} - Snapshot {worker.rowid}/{sc.SNAPSHOT_TOTAL} - renewing connection in 15 seconds...",
204
+ content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - renewing connection in 15 seconds...",
203
205
  )
204
206
  time.sleep(15)
205
207
  worker.refresh_connection()
206
208
  continue
207
209
  else:
208
- ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - EXCEPTION - {e}", e=e)
210
+ ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - EXCEPTION - {e}", e=e)
209
211
  worker.attempt = retry_max_attempt
210
212
  break
211
213
 
@@ -19,6 +19,7 @@ class Database:
19
19
  filter_complete INTEGER
20
20
  )"""
21
21
  snapshot_table = """CREATE TABLE IF NOT EXISTS snapshot_tbl (
22
+ counter INT,
22
23
  timestamp TEXT,
23
24
  url_archive TEXT,
24
25
  url_origin TEXT,
@@ -28,6 +29,18 @@ class Database:
28
29
  file TEXT,
29
30
  UNIQUE (url_archive)
30
31
  )"""
32
+ csv_view = """CREATE VIEW IF NOT EXISTS csv_view
33
+ AS
34
+ SELECT
35
+ timestamp AS timestamp,
36
+ url_archive AS url_archive,
37
+ url_origin AS url_origin,
38
+ redirect_url AS redirect_url,
39
+ redirect_timestamp AS redirect_timestamp,
40
+ response AS response,
41
+ file AS file
42
+ FROM snapshot_tbl;
43
+ """
31
44
 
32
45
  QUERY_EXIST = False
33
46
  QUERY_PROGRESS = "0 / 0"
@@ -38,6 +51,7 @@ class Database:
38
51
  db = Database()
39
52
  db.cursor.execute(cls.waybackup_table)
40
53
  db.cursor.execute(cls.snapshot_table)
54
+ db.cursor.execute(cls.csv_view)
41
55
  db.cursor.execute("SELECT query_identifier FROM waybackup_table WHERE query_identifier = ?", (query_identifier,))
42
56
  if db.cursor.fetchone():
43
57
  cls.QUERY_EXIST = True
@@ -29,7 +29,7 @@ def main():
29
29
  archive_download.startup()
30
30
 
31
31
  try:
32
- archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype)
32
+ archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype, config.statuscode)
33
33
  archive_download.download_list(config.output, config.retry, config.no_redirect, config.delay, config.workers)
34
34
  except KeyboardInterrupt:
35
35
  print("\nInterrupted by user\n")
@@ -38,7 +38,7 @@ def main():
38
38
 
39
39
  except Exception as e:
40
40
  config.keep = True
41
- ex.exception(content="", e=e)
41
+ ex.exception(message="", e=e)
42
42
 
43
43
  finally:
44
44
  sc.csv_create(config.csvfile)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pywaybackup
3
- Version: 3.2.1
3
+ Version: 3.3.1
4
4
  Summary: Query and download archive.org as simple as possible.
5
5
  Author-email: bitdruid <bitdruid@outlook.com>
6
6
  License: MIT License
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
55
55
  ### Pip
56
56
 
57
57
  1. Install the package <br>
58
- ```pip install pywaybackup```
58
+ `pip install pywaybackup`
59
59
  2. Run the tool <br>
60
- ```waybackup -h```
60
+ `waybackup -h`
61
61
 
62
62
  ### Manual
63
63
 
64
64
  1. Clone the repository <br>
65
- ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
65
+ `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
66
66
  2. Install <br>
67
- ```pip install .```
67
+ `pip install .`
68
68
  - in a virtual env or use `--break-system-package`
69
69
 
70
70
  ## notes / issues / hints
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
88
88
  The URL of the web page to download. This argument is required.
89
89
 
90
90
  #### Mode Selection (Choose One)
91
+
91
92
  - **`-a`**, **`--all`**:<br>
92
93
  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
93
94
  - **`-l`**, **`--last`**:<br>
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
102
103
  - **`-e`**, **`--explicit`**:<br>
103
104
  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
104
105
 
105
- - **`--filetype`** `<filetype>`:<br>
106
- Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
107
-
108
106
  - **`--limit`** `<count>`:<br>
109
- Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
107
+ Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
110
108
 
111
109
  - **Range Selection:**<br>
112
110
  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
113
111
  (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
114
- - **`-r`**, **`--range`**:<br>
115
- Specify the range in years for which to search and download snapshots.
116
- - **`--start`**:<br>
117
- Timestamp to start searching.
118
- - **`--end`**:<br>
119
- Timestamp to end searching.
112
+
113
+ - **`-r`**, **`--range`**:<br>
114
+ Specify the range in years for which to search and download snapshots.
115
+ - **`--start`**:<br>
116
+ Timestamp to start searching.
117
+ - **`--end`**:<br>
118
+ Timestamp to end searching.
119
+
120
+ - **Filtering:**<br>
121
+ A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
122
+
123
+ - **`--filetype`** `<filetype>`:<br>
124
+ Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
125
+
126
+ - **`--statuscode`** `<statuscode>`:<br>
127
+ Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
128
+ Common status codes you may want to handle/filter:
129
+ - `200` (OK)
130
+ - `301` (Moved Permanently - will redirect snapshot)
131
+ - `404` (Not Found - snapshot seems to be empty)
132
+ - `500` (Internal Server Error - snapshot is at least for now not available)
120
133
 
121
134
  ### Optional
122
135
 
123
136
  #### Behavior Manipulation
124
137
 
125
138
  - **`-o`**, **`--output`**:<br>
126
- Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
139
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
127
140
 
128
141
  - **`-m`**, **`--metadata`**<br>
129
- Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
142
+ Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
143
+
144
+ - **`--verbose`**:<br>
145
+ Increase output verbosity.
130
146
 
131
147
  <!-- - **`--verbosity`** `<level>`:<br>
132
148
  Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
133
149
 
134
150
  - **`--log`** <!-- `<path>` -->:<br>
135
- Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
151
+ Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
136
152
 
137
153
  - **`--progress`**:<br>
138
- Shows a progress bar instead of the default output.
154
+ Shows a progress bar instead of the default output.
139
155
 
140
156
  - **`--workers`** `<count>`:<br>
141
- Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
157
+ Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
142
158
 
143
159
  - **`--no-redirect`**:<br>
144
- Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
160
+ Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
145
161
 
146
162
  - **`--retry`** `<attempts>`:<br>
147
- Specifies number of retry attempts for failed downloads.
163
+ Specifies number of retry attempts for failed downloads.
148
164
 
149
165
  - **`--delay`** `<seconds>`:<br>
150
- Specifies delay between download requests in seconds. Default is no delay (0).
151
-
152
- - **`--verbose`**:<br>
153
- Increase output verbosity.
154
- - verbose:
155
- ```
156
- -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
157
- SUCCESS -> 200 OK
158
- -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
159
- -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
160
- ```
161
- - non-verbose:
162
- ```
163
- 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
164
- ```
166
+ Specifies delay between download requests in seconds. Default is no delay (0).
165
167
 
166
168
  <!-- - **`--convert-links`**:<br>
167
169
  If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
186
188
  - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
187
189
  - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
188
190
  - Skips previously downloaded files to save time.
189
- > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
191
+ > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
190
192
 
191
193
  #### Resetting a Job (`--reset`)
194
+
192
195
  - Deletes `.cdx` and `.db` files and restarts the process from scratch.
193
196
  - Does **not** remove already downloaded files.
194
197
  - `waybackup -u https://example.com -a --reset`
195
198
 
196
199
  #### Keeping Job Data (`--keep`)
200
+
197
201
  - Normally, `.cdx` and `.db` files are deleted after a successful job.
198
202
  - `--keep` preserves them for future re-analysis or extending the query.
199
203
  - `waybackup -u https://example.com -a --keep`
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
204
208
  ## Examples
205
209
 
206
210
  1. Download a specific single snapshot of all available files (starting from root):<br>
207
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
211
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
208
212
  2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
209
- `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
213
+ `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
210
214
  3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
211
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
215
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
212
216
  4. Download all snapshots of all available files in the given range:<br>
213
- `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
217
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
214
218
 
215
219
  <br>
216
220
  <br>
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
223
227
  `http://example.com/subdir1/subdir2/assets/`
224
228
  <br><br>
225
229
  For the first and last version (`-f` or `-l`):
230
+
226
231
  - Will only include all files/folders starting from your query-path.
232
+
227
233
  ```
228
234
  your/path/waybackup_snapshots/
229
235
  └── the_root_of_your_query/ (example.com/)
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
234
240
  ├── style.css
235
241
  ...
236
242
  ```
243
+
237
244
  For all versions (`-a`):
245
+
238
246
  - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
247
+
239
248
  ```
240
249
  your/path/waybackup_snapshots/
241
250
  └── the_root_of_your_query/ (example.com/)
@@ -276,6 +285,23 @@ For download queries:
276
285
  ]
277
286
  ```
278
287
 
288
+ ### Log
289
+
290
+ Verbose:
291
+
292
+ ```
293
+ -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
294
+ SUCCESS -> 200 OK
295
+ -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
296
+ -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
297
+ ```
298
+
299
+ Non-verbose:
300
+
301
+ ```
302
+ 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
303
+ ```
304
+
279
305
  ### Debugging
280
306
 
281
307
  Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
File without changes
File without changes