pywaybackup 3.2.1__tar.gz → 3.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (23) hide show
  1. {pywaybackup-3.2.1/pywaybackup.egg-info → pywaybackup-3.3.0}/PKG-INFO +70 -42
  2. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/README.md +70 -42
  3. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pyproject.toml +1 -1
  4. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/Arguments.py +58 -6
  5. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/SnapshotCollection.py +66 -52
  6. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/Verbosity.py +1 -0
  7. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/Worker.py +8 -7
  8. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/archive_download.py +12 -11
  9. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/db.py +14 -0
  10. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/main.py +2 -2
  11. {pywaybackup-3.2.1 → pywaybackup-3.3.0/pywaybackup.egg-info}/PKG-INFO +70 -42
  12. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/LICENSE +0 -0
  13. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/Converter.py +0 -0
  14. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/Exception.py +0 -0
  15. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/__init__.py +0 -0
  16. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/archive_save.py +0 -0
  17. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup/helper.py +0 -0
  18. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup.egg-info/SOURCES.txt +0 -0
  19. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup.egg-info/dependency_links.txt +0 -0
  20. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup.egg-info/entry_points.txt +0 -0
  21. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup.egg-info/requires.txt +0 -0
  22. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/pywaybackup.egg-info/top_level.txt +0 -0
  23. {pywaybackup-3.2.1 → pywaybackup-3.3.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pywaybackup
3
- Version: 3.2.1
3
+ Version: 3.3.0
4
4
  Summary: Query and download archive.org as simple as possible.
5
5
  Author-email: bitdruid <bitdruid@outlook.com>
6
6
  License: MIT License
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
55
55
  ### Pip
56
56
 
57
57
  1. Install the package <br>
58
- ```pip install pywaybackup```
58
+ `pip install pywaybackup`
59
59
  2. Run the tool <br>
60
- ```waybackup -h```
60
+ `waybackup -h`
61
61
 
62
62
  ### Manual
63
63
 
64
64
  1. Clone the repository <br>
65
- ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
65
+ `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
66
66
  2. Install <br>
67
- ```pip install .```
67
+ `pip install .`
68
68
  - in a virtual env or use `--break-system-package`
69
69
 
70
70
  ## notes / issues / hints
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
88
88
  The URL of the web page to download. This argument is required.
89
89
 
90
90
  #### Mode Selection (Choose One)
91
+
91
92
  - **`-a`**, **`--all`**:<br>
92
93
  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
93
94
  - **`-l`**, **`--last`**:<br>
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
102
103
  - **`-e`**, **`--explicit`**:<br>
103
104
  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
104
105
 
105
- - **`--filetype`** `<filetype>`:<br>
106
- Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
107
-
108
106
  - **`--limit`** `<count>`:<br>
109
- Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
107
+ Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
110
108
 
111
109
  - **Range Selection:**<br>
112
110
  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
113
111
  (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
114
- - **`-r`**, **`--range`**:<br>
115
- Specify the range in years for which to search and download snapshots.
116
- - **`--start`**:<br>
117
- Timestamp to start searching.
118
- - **`--end`**:<br>
119
- Timestamp to end searching.
112
+
113
+ - **`-r`**, **`--range`**:<br>
114
+ Specify the range in years for which to search and download snapshots.
115
+ - **`--start`**:<br>
116
+ Timestamp to start searching.
117
+ - **`--end`**:<br>
118
+ Timestamp to end searching.
119
+
120
+ - **Filtering:**<br>
121
+ A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
122
+
123
+ - **`--filetype`** `<filetype>`:<br>
124
+ Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
125
+
126
+ - **`--statuscode`** `<statuscode>`:<br>
127
+ Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
128
+ Common status codes you may want to handle/filter:
129
+ - `200` (OK)
130
+ - `301` (Moved Permanently - will redirect snapshot)
131
+ - `404` (Not Found - snapshot seems to be empty)
132
+ - `500` (Internal Server Error - snapshot is at least for now not available)
120
133
 
121
134
  ### Optional
122
135
 
123
136
  #### Behavior Manipulation
124
137
 
125
138
  - **`-o`**, **`--output`**:<br>
126
- Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
139
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
127
140
 
128
141
  - **`-m`**, **`--metadata`**<br>
129
- Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
142
+ Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
143
+
144
+ - **`--verbose`**:<br>
145
+ Increase output verbosity.
130
146
 
131
147
  <!-- - **`--verbosity`** `<level>`:<br>
132
148
  Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
133
149
 
134
150
  - **`--log`** <!-- `<path>` -->:<br>
135
- Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
151
+ Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
136
152
 
137
153
  - **`--progress`**:<br>
138
- Shows a progress bar instead of the default output.
154
+ Shows a progress bar instead of the default output.
139
155
 
140
156
  - **`--workers`** `<count>`:<br>
141
- Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
157
+ Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
142
158
 
143
159
  - **`--no-redirect`**:<br>
144
- Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
160
+ Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
145
161
 
146
162
  - **`--retry`** `<attempts>`:<br>
147
- Specifies number of retry attempts for failed downloads.
163
+ Specifies number of retry attempts for failed downloads.
148
164
 
149
165
  - **`--delay`** `<seconds>`:<br>
150
- Specifies delay between download requests in seconds. Default is no delay (0).
151
-
152
- - **`--verbose`**:<br>
153
- Increase output verbosity.
154
- - verbose:
155
- ```
156
- -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
157
- SUCCESS -> 200 OK
158
- -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
159
- -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
160
- ```
161
- - non-verbose:
162
- ```
163
- 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
164
- ```
166
+ Specifies delay between download requests in seconds. Default is no delay (0).
165
167
 
166
168
  <!-- - **`--convert-links`**:<br>
167
169
  If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
186
188
  - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
187
189
  - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
188
190
  - Skips previously downloaded files to save time.
189
- > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
191
+ > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
190
192
 
191
193
  #### Resetting a Job (`--reset`)
194
+
192
195
  - Deletes `.cdx` and `.db` files and restarts the process from scratch.
193
196
  - Does **not** remove already downloaded files.
194
197
  - `waybackup -u https://example.com -a --reset`
195
198
 
196
199
  #### Keeping Job Data (`--keep`)
200
+
197
201
  - Normally, `.cdx` and `.db` files are deleted after a successful job.
198
202
  - `--keep` preserves them for future re-analysis or extending the query.
199
203
  - `waybackup -u https://example.com -a --keep`
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
204
208
  ## Examples
205
209
 
206
210
  1. Download a specific single snapshot of all available files (starting from root):<br>
207
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
211
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
208
212
  2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
209
- `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
213
+ `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
210
214
  3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
211
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
215
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
212
216
  4. Download all snapshots of all available files in the given range:<br>
213
- `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
217
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
214
218
 
215
219
  <br>
216
220
  <br>
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
223
227
  `http://example.com/subdir1/subdir2/assets/`
224
228
  <br><br>
225
229
  For the first and last version (`-f` or `-l`):
230
+
226
231
  - Will only include all files/folders starting from your query-path.
232
+
227
233
  ```
228
234
  your/path/waybackup_snapshots/
229
235
  └── the_root_of_your_query/ (example.com/)
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
234
240
  ├── style.css
235
241
  ...
236
242
  ```
243
+
237
244
  For all versions (`-a`):
245
+
238
246
  - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
247
+
239
248
  ```
240
249
  your/path/waybackup_snapshots/
241
250
  └── the_root_of_your_query/ (example.com/)
@@ -276,6 +285,23 @@ For download queries:
276
285
  ]
277
286
  ```
278
287
 
288
+ ### Log
289
+
290
+ Verbose:
291
+
292
+ ```
293
+ -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
294
+ SUCCESS -> 200 OK
295
+ -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
296
+ -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
297
+ ```
298
+
299
+ Non-verbose:
300
+
301
+ ```
302
+ 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
303
+ ```
304
+
279
305
  ### Debugging
280
306
 
281
307
  Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
@@ -287,3 +313,5 @@ Exceptions will be written into `waybackup_error.log` (each run overwrites the f
287
313
 
288
314
  I'm always happy for some feature requests to improve the usability of this tool.
289
315
  Feel free to give suggestions and report issues. Project is still far from being perfect.
316
+
317
+ > Please PR from dev into dev.
@@ -16,16 +16,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
16
16
  ### Pip
17
17
 
18
18
  1. Install the package <br>
19
- ```pip install pywaybackup```
19
+ `pip install pywaybackup`
20
20
  2. Run the tool <br>
21
- ```waybackup -h```
21
+ `waybackup -h`
22
22
 
23
23
  ### Manual
24
24
 
25
25
  1. Clone the repository <br>
26
- ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
26
+ `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
27
27
  2. Install <br>
28
- ```pip install .```
28
+ `pip install .`
29
29
  - in a virtual env or use `--break-system-package`
30
30
 
31
31
  ## notes / issues / hints
@@ -49,6 +49,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
49
49
  The URL of the web page to download. This argument is required.
50
50
 
51
51
  #### Mode Selection (Choose One)
52
+
52
53
  - **`-a`**, **`--all`**:<br>
53
54
  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
54
55
  - **`-l`**, **`--last`**:<br>
@@ -63,66 +64,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
63
64
  - **`-e`**, **`--explicit`**:<br>
64
65
  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
65
66
 
66
- - **`--filetype`** `<filetype>`:<br>
67
- Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
68
-
69
67
  - **`--limit`** `<count>`:<br>
70
- Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
68
+ Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
71
69
 
72
70
  - **Range Selection:**<br>
73
71
  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
74
72
  (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
75
- - **`-r`**, **`--range`**:<br>
76
- Specify the range in years for which to search and download snapshots.
77
- - **`--start`**:<br>
78
- Timestamp to start searching.
79
- - **`--end`**:<br>
80
- Timestamp to end searching.
73
+
74
+ - **`-r`**, **`--range`**:<br>
75
+ Specify the range in years for which to search and download snapshots.
76
+ - **`--start`**:<br>
77
+ Timestamp to start searching.
78
+ - **`--end`**:<br>
79
+ Timestamp to end searching.
80
+
81
+ - **Filtering:**<br>
82
+ A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
83
+
84
+ - **`--filetype`** `<filetype>`:<br>
85
+ Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
86
+
87
+ - **`--statuscode`** `<statuscode>`:<br>
88
+ Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
89
+ Common status codes you may want to handle/filter:
90
+ - `200` (OK)
91
+ - `301` (Moved Permanently - will redirect snapshot)
92
+ - `404` (Not Found - snapshot seems to be empty)
93
+ - `500` (Internal Server Error - snapshot is at least for now not available)
81
94
 
82
95
  ### Optional
83
96
 
84
97
  #### Behavior Manipulation
85
98
 
86
99
  - **`-o`**, **`--output`**:<br>
87
- Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
100
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
88
101
 
89
102
  - **`-m`**, **`--metadata`**<br>
90
- Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
103
+ Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
104
+
105
+ - **`--verbose`**:<br>
106
+ Increase output verbosity.
91
107
 
92
108
  <!-- - **`--verbosity`** `<level>`:<br>
93
109
  Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
94
110
 
95
111
  - **`--log`** <!-- `<path>` -->:<br>
96
- Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
112
+ Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
97
113
 
98
114
  - **`--progress`**:<br>
99
- Shows a progress bar instead of the default output.
115
+ Shows a progress bar instead of the default output.
100
116
 
101
117
  - **`--workers`** `<count>`:<br>
102
- Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
118
+ Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
103
119
 
104
120
  - **`--no-redirect`**:<br>
105
- Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
121
+ Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
106
122
 
107
123
  - **`--retry`** `<attempts>`:<br>
108
- Specifies number of retry attempts for failed downloads.
124
+ Specifies number of retry attempts for failed downloads.
109
125
 
110
126
  - **`--delay`** `<seconds>`:<br>
111
- Specifies delay between download requests in seconds. Default is no delay (0).
112
-
113
- - **`--verbose`**:<br>
114
- Increase output verbosity.
115
- - verbose:
116
- ```
117
- -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
118
- SUCCESS -> 200 OK
119
- -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
120
- -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
121
- ```
122
- - non-verbose:
123
- ```
124
- 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
125
- ```
127
+ Specifies delay between download requests in seconds. Default is no delay (0).
126
128
 
127
129
  <!-- - **`--convert-links`**:<br>
128
130
  If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -147,14 +149,16 @@ If set, all links in the downloaded files will be converted to local links. This
147
149
  - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
148
150
  - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
149
151
  - Skips previously downloaded files to save time.
150
- > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
152
+ > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
151
153
 
152
154
  #### Resetting a Job (`--reset`)
155
+
153
156
  - Deletes `.cdx` and `.db` files and restarts the process from scratch.
154
157
  - Does **not** remove already downloaded files.
155
158
  - `waybackup -u https://example.com -a --reset`
156
159
 
157
160
  #### Keeping Job Data (`--keep`)
161
+
158
162
  - Normally, `.cdx` and `.db` files are deleted after a successful job.
159
163
  - `--keep` preserves them for future re-analysis or extending the query.
160
164
  - `waybackup -u https://example.com -a --keep`
@@ -165,13 +169,13 @@ If set, all links in the downloaded files will be converted to local links. This
165
169
  ## Examples
166
170
 
167
171
  1. Download a specific single snapshot of all available files (starting from root):<br>
168
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
172
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
169
173
  2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
170
- `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
174
+ `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
171
175
  3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
172
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
176
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
173
177
  4. Download all snapshots of all available files in the given range:<br>
174
- `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
178
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
175
179
 
176
180
  <br>
177
181
  <br>
@@ -184,7 +188,9 @@ The output path is currently structured as follows by an example for the query:<
184
188
  `http://example.com/subdir1/subdir2/assets/`
185
189
  <br><br>
186
190
  For the first and last version (`-f` or `-l`):
191
+
187
192
  - Will only include all files/folders starting from your query-path.
193
+
188
194
  ```
189
195
  your/path/waybackup_snapshots/
190
196
  └── the_root_of_your_query/ (example.com/)
@@ -195,8 +201,11 @@ your/path/waybackup_snapshots/
195
201
  ├── style.css
196
202
  ...
197
203
  ```
204
+
198
205
  For all versions (`-a`):
206
+
199
207
  - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
208
+
200
209
  ```
201
210
  your/path/waybackup_snapshots/
202
211
  └── the_root_of_your_query/ (example.com/)
@@ -237,6 +246,23 @@ For download queries:
237
246
  ]
238
247
  ```
239
248
 
249
+ ### Log
250
+
251
+ Verbose:
252
+
253
+ ```
254
+ -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
255
+ SUCCESS -> 200 OK
256
+ -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
257
+ -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
258
+ ```
259
+
260
+ Non-verbose:
261
+
262
+ ```
263
+ 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
264
+ ```
265
+
240
266
  ### Debugging
241
267
 
242
268
  Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
@@ -247,4 +273,6 @@ Exceptions will be written into `waybackup_error.log` (each run overwrites the f
247
273
  ## Contributing
248
274
 
249
275
  I'm always happy for some feature requests to improve the usability of this tool.
250
- Feel free to give suggestions and report issues. Project is still far from being perfect.
276
+ Feel free to give suggestions and report issues. Project is still far from being perfect.
277
+
278
+ > Please PR from dev into dev.
@@ -7,7 +7,7 @@ packages = ["pywaybackup"]
7
7
 
8
8
  [project]
9
9
  name = "pywaybackup"
10
- version = "3.2.1"
10
+ version = "3.3.0"
11
11
  description = "Query and download archive.org as simple as possible."
12
12
  authors = [
13
13
  { name = "bitdruid", email = "bitdruid@outlook.com" }
@@ -3,6 +3,8 @@ import sys
3
3
  import os
4
4
  import argparse
5
5
 
6
+ from argparse import RawTextHelpFormatter
7
+
6
8
  from importlib.metadata import version
7
9
 
8
10
  from pywaybackup.helper import url_split, sanitize_filename
@@ -10,9 +12,10 @@ from pywaybackup.helper import url_split, sanitize_filename
10
12
  class Arguments:
11
13
 
12
14
  def __init__(self):
13
-
14
- parser = argparse.ArgumentParser(description='Download from wayback machine (archive.org)')
15
- parser.add_argument('-v', '--version', action='version', version='%(prog)s ' + version("pywaybackup") + ' by @bitdruid -> https://github.com/bitdruid')
15
+ parser = argparse.ArgumentParser(
16
+ description=f"<<< python-wayback-machine-downloader v{version('pywaybackup')} >>>\nby @bitdruid -> https://github.com/bitdruid",
17
+ formatter_class=RawTextHelpFormatter,
18
+ )
16
19
 
17
20
  required = parser.add_argument_group('required (one exclusive)')
18
21
  required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
@@ -27,12 +30,14 @@ class Arguments:
27
30
  optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
28
31
  optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
29
32
  optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
30
- optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
31
33
  optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
34
+ optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (js,css,...)')
35
+ optional.add_argument('--statuscode', type=str, metavar="", help='statuscodes to download comma separated (200,404,...)')
32
36
 
33
37
  behavior = parser.add_argument_group('manipulate behavior')
34
38
  behavior.add_argument('-o', '--output', type=str, metavar="", help='output for all files - defaults to current directory')
35
39
  behavior.add_argument('-m', '--metadata', type=str, metavar="", help='change directory for db/cdx/csv/log files')
40
+ behavior.add_argument('-v', '--verbose', action='store_true', help='overwritten by progress - gives detailed output')
36
41
  behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
37
42
  behavior.add_argument('--progress', action='store_true', help='show a progress bar')
38
43
  behavior.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
@@ -40,7 +45,6 @@ class Arguments:
40
45
  behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
41
46
  # behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
42
47
  behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
43
- behavior.add_argument('--verbose', action='store_true', help='overwritten by progress - gives detailed output')
44
48
 
45
49
  special = parser.add_argument_group('special')
46
50
  special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
@@ -61,6 +65,52 @@ class Arguments:
61
65
  return self.args
62
66
 
63
67
  class Configuration:
68
+
69
+ # def __init__(self):
70
+ # self.args = Arguments().get_args()
71
+ # for key, value in vars(self.args).items():
72
+ # setattr(Configuration, key, value)
73
+
74
+ # self.set_config()
75
+
76
+ # def set_config(self):
77
+ # # args now attributes of Configuration // Configuration.output, ...
78
+ # self.command = ' '.join(sys.argv[1:])
79
+ # self.domain, self.subdir, self.filename = url_split(self.url)
80
+
81
+ # if self.output is None:
82
+ # self.output = os.path.join(os.getcwd(), "waybackup_snapshots")
83
+ # if self.metadata is None:
84
+ # self.metadata = self.output
85
+ # os.makedirs(self.output, exist_ok=True) if not self.save else None
86
+ # os.makedirs(self.metadata, exist_ok=True) if not self.save else None
87
+
88
+ # if self.all:
89
+ # self.mode = "all"
90
+ # if self.last:
91
+ # self.mode = "last"
92
+ # if self.first:
93
+ # self.mode = "first"
94
+ # if self.save:
95
+ # self.mode = "save"
96
+
97
+ # if self.filetype:
98
+ # self.filetype = [f.lower().strip() for f in self.filetype.split(",")]
99
+ # if self.statuscode:
100
+ # self.statuscode = [s.lower().strip() for s in self.statuscode.split(",")]
101
+
102
+ # base_path = self.metadata
103
+ # base_name = f"waybackup_{sanitize_filename(self.url)}"
104
+ # self.cdxfile = os.path.join(base_path, f"{base_name}.cdx")
105
+ # self.dbfile = os.path.join(base_path, f"{base_name}.db")
106
+ # self.csvfile = os.path.join(base_path, f"{base_name}.csv")
107
+ # self.log = os.path.join(base_path, f"{base_name}.log") if self.log else None
108
+
109
+ # if self.reset:
110
+ # os.remove(self.cdxfile) if os.path.isfile(self.cdxfile) else None
111
+ # os.remove(self.dbfile) if os.path.isfile(self.dbfile) else None
112
+ # os.remove(self.csvfile) if os.path.isfile(self.csvfile) else None
113
+
64
114
 
65
115
  @classmethod
66
116
  def init(cls):
@@ -90,7 +140,9 @@ class Configuration:
90
140
  cls.mode = "save"
91
141
 
92
142
  if cls.filetype:
93
- cls.filetype = [ft.lower().strip() for ft in cls.filetype.split(",")]
143
+ cls.filetype = [f.lower().strip() for f in cls.filetype.split(",")]
144
+ if cls.statuscode:
145
+ cls.statuscode = [s.lower().strip() for s in cls.statuscode.split(",")]
94
146
 
95
147
  base_path = cls.metadata
96
148
  base_name = f"waybackup_{sanitize_filename(cls.url)}"
@@ -22,11 +22,10 @@ class SnapshotCollection:
22
22
  SNAPSHOT_UNHANDLED = 0 # all unhandled snapshots in the db (without response)
23
23
  SNAPSHOT_HANDLED = 0 # snapshots with a response
24
24
 
25
- SNAPSHOT_REMOVALS = 0 # not to be utilized (total - unhandled - skip)
26
- SNAPSHOT_FAULTY = 0 # snapshots which could not be loaded from cdx file into db
27
25
  FILTER_DUPLICATES = 0 # with identical url_archive
28
26
  FILTER_MODE = 0 # all snapshots filtered by the MODE (last or first)
29
27
  FILTER_SKIP = 0 # content of the csv file
28
+ FILTER_RESPONSE = 0 # snapshots which could not be loaded from cdx file into db or 404
30
29
 
31
30
  @classmethod
32
31
  def init(cls, mode):
@@ -83,21 +82,19 @@ class SnapshotCollection:
83
82
  cls.SNAPSHOT_UNHANDLED = cls.count_totals(unhandled=True) # count all unhandled in db
84
83
  cls.SNAPSHOT_HANDLED = cls.count_totals(handled=True) # count all handled in db
85
84
  cls.SNAPSHOT_TOTAL = cls.count_totals(total=True) # count all in db
86
- cls.SNAPSHOT_REMOVALS = cls.CDX_TOTAL - cls.SNAPSHOT_UNHANDLED - cls.FILTER_SKIP # count all removals
87
85
 
88
86
  vb.write(content="\nSnapshot calculation:")
89
87
  vb.write(content=f"-----> {'in CDX file'.ljust(18)}: {cls.CDX_TOTAL:,}")
90
88
 
91
- if cls.FILTER_DUPLICATES == 0 and cls.FILTER_MODE == 0:
92
- vb.write(content=f"-----> {'total removals'.ljust(18)}: {cls.SNAPSHOT_REMOVALS:,}")
93
- if cls.SNAPSHOT_FAULTY > 0:
94
- vb.write(content=f"-----> {'removed faulty'.ljust(18)}: {cls.SNAPSHOT_FAULTY}")
95
89
  if cls.FILTER_DUPLICATES > 0:
96
90
  vb.write(content=f"-----> {'removed duplicates'.ljust(18)}: {cls.FILTER_DUPLICATES:,}")
97
91
  if cls.FILTER_MODE > 0:
98
92
  vb.write(content=f"-----> {'removed versions'.ljust(18)}: {cls.FILTER_MODE:,}")
93
+
99
94
  if cls.FILTER_SKIP > 0:
100
- vb.write(content=f"-----> {'skipped existing'.ljust(18)}: {cls.FILTER_SKIP:,}")
95
+ vb.write(content=f"-----> {'skip existing'.ljust(18)}: {cls.FILTER_SKIP:,}")
96
+ if cls.FILTER_RESPONSE > 0:
97
+ vb.write(content=f"-----> {'skip statuscode'.ljust(18)}: {cls.FILTER_RESPONSE}")
101
98
 
102
99
  vb.write(content=f"\n-----> {'to utilize'.ljust(18)}: {cls.SNAPSHOT_UNHANDLED:,}")
103
100
 
@@ -112,7 +109,23 @@ class SnapshotCollection:
112
109
  - Removes duplicates by url_archive (same timestamp and url_origin)
113
110
  - Filters the snapshots by the given mode (last or first)
114
111
  """
112
+
113
+ def _parse_line(line):
114
+ line = json.loads(line)
115
+ line = {
116
+ "timestamp": line[0],
117
+ "digest": line[1],
118
+ "mimetype": line[2],
119
+ "statuscode": line[3],
120
+ "origin": line[4],
121
+ }
122
+ url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['origin']}"
123
+ statuscode = line["statuscode"] if line["statuscode"] in ("301", "404") else None
124
+ return (line["timestamp"], url_archive, line["origin"], statuscode)
125
+
126
+
115
127
  vb.write(verbose=None, content="\nInserting CDX data into database...")
128
+
116
129
  with open(cdxfile, "r", encoding="utf-8") as f, tqdm(
117
130
  unit=" lines",
118
131
  total=cls.CDX_TOTAL,
@@ -123,11 +136,9 @@ class SnapshotCollection:
123
136
  line_batchsize = 2500
124
137
  line_batch = []
125
138
  total_inserted = 0
126
- faulty_lines = 0
127
- query_duplicates = (
128
- """INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin) VALUES (?, ?, ?)"""
129
- )
139
+ query_duplicates = """INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin, response) VALUES (?, ?, ?, ?)"""
130
140
  first_line = True
141
+
131
142
  for line in f:
132
143
  if first_line:
133
144
  first_line = False
@@ -137,29 +148,15 @@ class SnapshotCollection:
137
148
  line = line.rsplit("]", 1)[0]
138
149
  if line.endswith(","):
139
150
  line = line.rsplit(",", 1)[0]
140
- try:
141
- line = json.loads(line)
142
- line = {
143
- "timestamp": line[0],
144
- "digest": line[1],
145
- "mimetype": line[2],
146
- "status": line[3],
147
- "url": line[4],
148
- }
149
- url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['url']}"
150
- line_batch.append((line["timestamp"], url_archive, line["url"]))
151
- if len(line_batch) >= line_batchsize:
152
- total_inserted += len(line_batch)
153
- cls.db.cursor.executemany(query_duplicates, line_batch)
154
- line_batch = []
155
- pbar.update(line_batchsize)
156
- except json.JSONDecodeError as e:
157
- faulty_lines += 1
158
- vb.write(
159
- verbose=None,
160
- content=f"JSONDecodeError: {e} on line {cls.CDX_TOTAL}",
161
- )
162
- continue
151
+
152
+ line_batch.append(_parse_line(line))
153
+
154
+ if len(line_batch) >= line_batchsize:
155
+ total_inserted += len(line_batch)
156
+ cls.db.cursor.executemany(query_duplicates, line_batch)
157
+ line_batch = []
158
+ pbar.update(line_batchsize)
159
+
163
160
  if line_batch:
164
161
  total_inserted += len(line_batch)
165
162
  cls.db.cursor.executemany(query_duplicates, line_batch)
@@ -167,8 +164,7 @@ class SnapshotCollection:
167
164
 
168
165
  cls.db.conn.commit()
169
166
 
170
- cls.SNAPSHOT_FAULTY = faulty_lines
171
- cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(unhandled=True) + cls.SNAPSHOT_FAULTY
167
+ cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(total=True)
172
168
 
173
169
 
174
170
 
@@ -181,8 +177,11 @@ class SnapshotCollection:
181
177
  """
182
178
  row_batchsize = 2500
183
179
  cls.db.cursor.execute("UPDATE snapshot_tbl SET response = NULL WHERE response = 'LOCK'") # reset locked to unprocessed
184
- cls.db.cursor.execute("SELECT * FROM snapshot_tbl WHERE response IS NOT NULL") # only write processed snapshots
180
+ cls.db.cursor.execute("SELECT * FROM csv_view WHERE response IS NOT NULL") # only write processed snapshots
185
181
  headers = [description[0] for description in cls.db.cursor.description]
182
+ if "snapshot_id" in headers:
183
+ snapshot_id_index = headers.index("snapshot_id")
184
+ headers.pop(snapshot_id_index)
186
185
  with open(csvfile, "w", encoding="utf-8") as f:
187
186
  writer = csv.writer(f)
188
187
  writer.writerow(headers)
@@ -203,13 +202,15 @@ class SnapshotCollection:
203
202
  Create indexes for the snapshot table.
204
203
  """
205
204
  # index for filtering last snapshots
206
- cls.db.cursor.execute(
207
- "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_desc ON snapshot_tbl(url_origin, timestamp DESC);"
208
- )
205
+ if cls.MODE_LAST:
206
+ cls.db.cursor.execute(
207
+ "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_desc ON snapshot_tbl(url_origin, timestamp DESC);"
208
+ )
209
209
  # index for filtering first snapshots
210
- cls.db.cursor.execute(
211
- "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_asc ON snapshot_tbl(url_origin, timestamp ASC);"
212
- )
210
+ if cls.MODE_FIRST:
211
+ cls.db.cursor.execute(
212
+ "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_asc ON snapshot_tbl(url_origin, timestamp ASC);"
213
+ )
213
214
  # index for skippable snapshots
214
215
  cls.db.cursor.execute(
215
216
  "CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_timestamp_url_origin_response ON snapshot_tbl(timestamp, url_origin);"
@@ -247,6 +248,26 @@ class SnapshotCollection:
247
248
  """
248
249
  )
249
250
  cls.FILTER_MODE = cls.db.cursor.rowcount
251
+
252
+ cls.db.cursor.execute(
253
+ """
254
+ SELECT COUNT(*) FROM snapshot_tbl WHERE response IN ('404', '301')
255
+ """
256
+ )
257
+ cls.FILTER_RESPONSE = cls.db.cursor.fetchone()[0]
258
+
259
+ cls.db.cursor.execute(
260
+ """
261
+ WITH numbered AS (
262
+ SELECT rowid, ROW_NUMBER() OVER (ORDER BY rowid) AS rn
263
+ FROM snapshot_tbl
264
+ )
265
+ UPDATE snapshot_tbl
266
+ SET counter = (
267
+ SELECT rn FROM numbered WHERE numbered.rowid = snapshot_tbl.rowid
268
+ );
269
+ """
270
+ )
250
271
 
251
272
  cls.db.conn.commit()
252
273
 
@@ -259,13 +280,6 @@ class SnapshotCollection:
259
280
  """
260
281
  If an existing csv-file for the job exists, the responses will be overwritten by the csv-content.
261
282
  """
262
- cls.db.cursor.execute(
263
- """
264
- UPDATE snapshot_tbl
265
- SET response = NULL
266
- """
267
- )
268
- cls.db.conn.commit()
269
283
  if not os.path.isfile(csvfile):
270
284
  return
271
285
  else:
@@ -336,7 +350,7 @@ class SnapshotCollection:
336
350
  """
337
351
  Modify a snapshot-row in the snapshot table.
338
352
  """
339
- query = f"UPDATE snapshot_tbl SET {column} = ? WHERE rowid = ?"
353
+ query = f"UPDATE snapshot_tbl SET {column} = ? WHERE counter = ?"
340
354
  connection.cursor.execute(query, (value, snapshot_id))
341
355
  connection.conn.commit()
342
356
 
@@ -3,6 +3,7 @@ from typing import Union
3
3
 
4
4
 
5
5
 
6
+
6
7
  class Verbosity:
7
8
  """
8
9
  A class to manage verbosity levels, logging, progress and output.
@@ -37,7 +37,7 @@ class Worker:
37
37
  self.snapshot = sc.get_snapshot(self.db)
38
38
  if not self.snapshot:
39
39
  return
40
- self.rowid = self.snapshot["rowid"]
40
+ self.counter = self.snapshot["counter"]
41
41
  self.timestamp = self.snapshot["timestamp"]
42
42
  self.url_archive = self.snapshot["url_archive"]
43
43
  self.url_origin = self.snapshot["url_origin"]
@@ -64,7 +64,7 @@ class Worker:
64
64
  if self.redirect_timestamp is None and value is None:
65
65
  return
66
66
  self._redirect_url = value
67
- sc.modify_snapshot(self.db, self.rowid, "redirect_url", value)
67
+ sc.modify_snapshot(self.db, self.counter, "redirect_url", value)
68
68
 
69
69
  @property
70
70
  def redirect_timestamp(self):
@@ -75,7 +75,7 @@ class Worker:
75
75
  if self.redirect_url is None and value is None:
76
76
  return
77
77
  self._redirect_timestamp = value
78
- sc.modify_snapshot(self.db, self.rowid, "redirect_timestamp", value)
78
+ sc.modify_snapshot(self.db, self.counter, "redirect_timestamp", value)
79
79
 
80
80
  @property
81
81
  def response(self):
@@ -86,7 +86,7 @@ class Worker:
86
86
  if self.redirect_url is None and value is None:
87
87
  return
88
88
  self._response = value
89
- sc.modify_snapshot(self.db, self.rowid, "response", value)
89
+ sc.modify_snapshot(self.db, self.counter, "response", value)
90
90
 
91
91
  @property
92
92
  def file(self):
@@ -97,7 +97,7 @@ class Worker:
97
97
  if self.redirect_url is None and value is None:
98
98
  return
99
99
  self._file = value
100
- sc.modify_snapshot(self.db, self.rowid, "file", value)
100
+ sc.modify_snapshot(self.db, self.counter, "file", value)
101
101
 
102
102
 
103
103
  class Message(Worker):
@@ -141,14 +141,15 @@ class Message(Worker):
141
141
  "verbose": True,
142
142
  "content": _format_verbose({"result": result, "info": info, "content": content}),
143
143
  }
144
+ self.buffer.append(self.message)
144
145
  if verbose is False or verbose is None:
145
146
  result = result + " - " if result else ""
146
147
  content = content + " - " if content else ""
147
148
  self.message = {
148
149
  "verbose": False,
149
- "content": f"{self.worker.rowid}/{sc.SNAPSHOT_TOTAL} - W:{self.worker.id} - {result}{content}{self.worker.timestamp} - {self.worker.url_origin}",
150
+ "content": f"{self.worker.counter}/{sc.SNAPSHOT_TOTAL} - W:{self.worker.id} - {result}{content}{self.worker.timestamp} - {self.worker.url_origin}",
150
151
  }
151
- self.buffer.append(self.message)
152
+ self.buffer.append(self.message)
152
153
 
153
154
  def write(self):
154
155
  """
@@ -47,7 +47,7 @@ def startup():
47
47
 
48
48
 
49
49
 
50
- def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list):
50
+ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list,filter_statuscode: list):
51
51
 
52
52
  def inject(cdxinject: str) -> bool:
53
53
  if os.path.isfile(cdxinject):
@@ -60,7 +60,7 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
60
60
  )
61
61
  return False
62
62
 
63
- def create_query(queryrange: int, limit: int, filter_filetype: list, start: int, end: int, explicit: bool) -> str:
63
+ def create_query(queryrange: int, limit: int, filter_filetype: list, filter_statuscode: list, start: int, end: int, explicit: bool) -> str:
64
64
  if queryrange:
65
65
  query_range = f"&from={datetime.now().year - queryrange}"
66
66
  else:
@@ -81,9 +81,10 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
81
81
 
82
82
  limit = f"&limit={limit}" if limit else ""
83
83
 
84
+ filter_statuscode = (f"&filter=statuscode:({'|'.join(filter_statuscode)})$" if filter_statuscode else "")
84
85
  filter_filetype = (f"&filter=original:.*\\.({'|'.join(filter_filetype)})$" if filter_filetype else "")
85
86
 
86
- cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}"
87
+ cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}{filter_statuscode}"
87
88
 
88
89
  return cdxquery
89
90
 
@@ -111,7 +112,7 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
111
112
 
112
113
  cdxinject = inject(cdxfile)
113
114
  if not cdxinject:
114
- cdxquery = create_query(queryrange, limit, filter_filetype, start, end, explicit)
115
+ cdxquery = create_query(queryrange, limit, filter_filetype, filter_statuscode, start, end, explicit)
115
116
  cdxfile = run_query(cdxfile, cdxquery)
116
117
  sc.process_cdx(cdxfile, csvfile)
117
118
 
@@ -131,7 +132,7 @@ def download_list(output, retry, no_redirect, delay, workers):
131
132
  threads = []
132
133
  for i in range(workers):
133
134
  worker = Worker(id=i + 1)
134
- vb.write(verbose=True, content=f"\n-----> Starting worker: {worker.id}")
135
+ vb.write(verbose=True, content=f"\n-----> Starting Worker: {worker.id}")
135
136
  thread = threading.Thread(target=download_loop, args=(worker, output, retry, no_redirect, delay))
136
137
  threads.append(thread)
137
138
  thread.start()
@@ -163,7 +164,7 @@ def download_loop(worker, output, retry, no_redirect, delay):
163
164
 
164
165
  while worker.attempt <= retry_max_attempt: # retry as given by user
165
166
 
166
- worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}]")
167
+ worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}]")
167
168
  download_attempt = 1
168
169
  download_max_attempt = 3
169
170
 
@@ -180,11 +181,11 @@ def download_loop(worker, output, retry, no_redirect, delay):
180
181
  download_attempt += 1 # try again 2x with same connection
181
182
  vb.write(
182
183
  verbose=True,
183
- content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - requesting again in 50 seconds...",
184
+ content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - requesting again in 50 seconds...",
184
185
  )
185
186
  vb.write(
186
187
  verbose=False,
187
- content=f"Worker: {worker.id} - Snapshot {worker.rowid}/{sc.SNAPSHOT_TOTAL} - requesting again in 50 seconds...",
188
+ content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - requesting again in 50 seconds...",
188
189
  )
189
190
  time.sleep(50)
190
191
  continue
@@ -195,17 +196,17 @@ def download_loop(worker, output, retry, no_redirect, delay):
195
196
  download_attempt = download_max_attempt # try again 1x with new connection
196
197
  vb.write(
197
198
  verbose=True,
198
- content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - renewing connection in 15 seconds...",
199
+ content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - renewing connection in 15 seconds...",
199
200
  )
200
201
  vb.write(
201
202
  verbose=False,
202
- content=f"Worker: {worker.id} - Snapshot {worker.rowid}/{sc.SNAPSHOT_TOTAL} - renewing connection in 15 seconds...",
203
+ content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - renewing connection in 15 seconds...",
203
204
  )
204
205
  time.sleep(15)
205
206
  worker.refresh_connection()
206
207
  continue
207
208
  else:
208
- ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.rowid}/{sc.SNAPSHOT_TOTAL}] - EXCEPTION - {e}", e=e)
209
+ ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - EXCEPTION - {e}", e=e)
209
210
  worker.attempt = retry_max_attempt
210
211
  break
211
212
 
@@ -19,6 +19,7 @@ class Database:
19
19
  filter_complete INTEGER
20
20
  )"""
21
21
  snapshot_table = """CREATE TABLE IF NOT EXISTS snapshot_tbl (
22
+ counter INT,
22
23
  timestamp TEXT,
23
24
  url_archive TEXT,
24
25
  url_origin TEXT,
@@ -28,6 +29,18 @@ class Database:
28
29
  file TEXT,
29
30
  UNIQUE (url_archive)
30
31
  )"""
32
+ csv_view = """CREATE VIEW IF NOT EXISTS csv_view
33
+ AS
34
+ SELECT
35
+ timestamp AS timestamp,
36
+ url_archive AS url_archive,
37
+ url_origin AS url_origin,
38
+ redirect_url AS redirect_url,
39
+ redirect_timestamp AS redirect_timestamp,
40
+ response AS response,
41
+ file AS file
42
+ FROM snapshot_tbl;
43
+ """
31
44
 
32
45
  QUERY_EXIST = False
33
46
  QUERY_PROGRESS = "0 / 0"
@@ -38,6 +51,7 @@ class Database:
38
51
  db = Database()
39
52
  db.cursor.execute(cls.waybackup_table)
40
53
  db.cursor.execute(cls.snapshot_table)
54
+ db.cursor.execute(cls.csv_view)
41
55
  db.cursor.execute("SELECT query_identifier FROM waybackup_table WHERE query_identifier = ?", (query_identifier,))
42
56
  if db.cursor.fetchone():
43
57
  cls.QUERY_EXIST = True
@@ -29,7 +29,7 @@ def main():
29
29
  archive_download.startup()
30
30
 
31
31
  try:
32
- archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype)
32
+ archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype, config.statuscode)
33
33
  archive_download.download_list(config.output, config.retry, config.no_redirect, config.delay, config.workers)
34
34
  except KeyboardInterrupt:
35
35
  print("\nInterrupted by user\n")
@@ -38,7 +38,7 @@ def main():
38
38
 
39
39
  except Exception as e:
40
40
  config.keep = True
41
- ex.exception(content="", e=e)
41
+ ex.exception(message="", e=e)
42
42
 
43
43
  finally:
44
44
  sc.csv_create(config.csvfile)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pywaybackup
3
- Version: 3.2.1
3
+ Version: 3.3.0
4
4
  Summary: Query and download archive.org as simple as possible.
5
5
  Author-email: bitdruid <bitdruid@outlook.com>
6
6
  License: MIT License
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
55
55
  ### Pip
56
56
 
57
57
  1. Install the package <br>
58
- ```pip install pywaybackup```
58
+ `pip install pywaybackup`
59
59
  2. Run the tool <br>
60
- ```waybackup -h```
60
+ `waybackup -h`
61
61
 
62
62
  ### Manual
63
63
 
64
64
  1. Clone the repository <br>
65
- ```git clone https://github.com/bitdruid/python-wayback-machine-downloader.git```
65
+ `git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
66
66
  2. Install <br>
67
- ```pip install .```
67
+ `pip install .`
68
68
  - in a virtual env or use `--break-system-package`
69
69
 
70
70
  ## notes / issues / hints
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
88
88
  The URL of the web page to download. This argument is required.
89
89
 
90
90
  #### Mode Selection (Choose One)
91
+
91
92
  - **`-a`**, **`--all`**:<br>
92
93
  Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
93
94
  - **`-l`**, **`--last`**:<br>
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
102
103
  - **`-e`**, **`--explicit`**:<br>
103
104
  Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
104
105
 
105
- - **`--filetype`** `<filetype>`:<br>
106
- Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
107
-
108
106
  - **`--limit`** `<count>`:<br>
109
- Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
107
+ Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
110
108
 
111
109
  - **Range Selection:**<br>
112
110
  Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
113
111
  (year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
114
- - **`-r`**, **`--range`**:<br>
115
- Specify the range in years for which to search and download snapshots.
116
- - **`--start`**:<br>
117
- Timestamp to start searching.
118
- - **`--end`**:<br>
119
- Timestamp to end searching.
112
+
113
+ - **`-r`**, **`--range`**:<br>
114
+ Specify the range in years for which to search and download snapshots.
115
+ - **`--start`**:<br>
116
+ Timestamp to start searching.
117
+ - **`--end`**:<br>
118
+ Timestamp to end searching.
119
+
120
+ - **Filtering:**<br>
121
+ A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
122
+
123
+ - **`--filetype`** `<filetype>`:<br>
124
+ Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
125
+
126
+ - **`--statuscode`** `<statuscode>`:<br>
127
+ Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
128
+ Common status codes you may want to handle/filter:
129
+ - `200` (OK)
130
+ - `301` (Moved Permanently - will redirect snapshot)
131
+ - `404` (Not Found - snapshot seems to be empty)
132
+ - `500` (Internal Server Error - snapshot is at least for now not available)
120
133
 
121
134
  ### Optional
122
135
 
123
136
  #### Behavior Manipulation
124
137
 
125
138
  - **`-o`**, **`--output`**:<br>
126
- Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
139
+ Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
127
140
 
128
141
  - **`-m`**, **`--metadata`**<br>
129
- Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
142
+ Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
143
+
144
+ - **`--verbose`**:<br>
145
+ Increase output verbosity.
130
146
 
131
147
  <!-- - **`--verbosity`** `<level>`:<br>
132
148
  Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
133
149
 
134
150
  - **`--log`** <!-- `<path>` -->:<br>
135
- Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
151
+ Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
136
152
 
137
153
  - **`--progress`**:<br>
138
- Shows a progress bar instead of the default output.
154
+ Shows a progress bar instead of the default output.
139
155
 
140
156
  - **`--workers`** `<count>`:<br>
141
- Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
157
+ Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
142
158
 
143
159
  - **`--no-redirect`**:<br>
144
- Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
160
+ Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
145
161
 
146
162
  - **`--retry`** `<attempts>`:<br>
147
- Specifies number of retry attempts for failed downloads.
163
+ Specifies number of retry attempts for failed downloads.
148
164
 
149
165
  - **`--delay`** `<seconds>`:<br>
150
- Specifies delay between download requests in seconds. Default is no delay (0).
151
-
152
- - **`--verbose`**:<br>
153
- Increase output verbosity.
154
- - verbose:
155
- ```
156
- -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
157
- SUCCESS -> 200 OK
158
- -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
159
- -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
160
- ```
161
- - non-verbose:
162
- ```
163
- 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
164
- ```
166
+ Specifies delay between download requests in seconds. Default is no delay (0).
165
167
 
166
168
  <!-- - **`--convert-links`**:<br>
167
169
  If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
186
188
  - Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
187
189
  - Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
188
190
  - Skips previously downloaded files to save time.
189
- > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
191
+ > **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
190
192
 
191
193
  #### Resetting a Job (`--reset`)
194
+
192
195
  - Deletes `.cdx` and `.db` files and restarts the process from scratch.
193
196
  - Does **not** remove already downloaded files.
194
197
  - `waybackup -u https://example.com -a --reset`
195
198
 
196
199
  #### Keeping Job Data (`--keep`)
200
+
197
201
  - Normally, `.cdx` and `.db` files are deleted after a successful job.
198
202
  - `--keep` preserves them for future re-analysis or extending the query.
199
203
  - `waybackup -u https://example.com -a --keep`
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
204
208
  ## Examples
205
209
 
206
210
  1. Download a specific single snapshot of all available files (starting from root):<br>
207
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
211
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
208
212
  2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
209
- `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
213
+ `waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
210
214
  3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
211
- `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
215
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
212
216
  4. Download all snapshots of all available files in the given range:<br>
213
- `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
217
+ `waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
214
218
 
215
219
  <br>
216
220
  <br>
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
223
227
  `http://example.com/subdir1/subdir2/assets/`
224
228
  <br><br>
225
229
  For the first and last version (`-f` or `-l`):
230
+
226
231
  - Will only include all files/folders starting from your query-path.
232
+
227
233
  ```
228
234
  your/path/waybackup_snapshots/
229
235
  └── the_root_of_your_query/ (example.com/)
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
234
240
  ├── style.css
235
241
  ...
236
242
  ```
243
+
237
244
  For all versions (`-a`):
245
+
238
246
  - Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
247
+
239
248
  ```
240
249
  your/path/waybackup_snapshots/
241
250
  └── the_root_of_your_query/ (example.com/)
@@ -276,6 +285,23 @@ For download queries:
276
285
  ]
277
286
  ```
278
287
 
288
+ ### Log
289
+
290
+ Verbose:
291
+
292
+ ```
293
+ -----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
294
+ SUCCESS -> 200 OK
295
+ -> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
296
+ -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
297
+ ```
298
+
299
+ Non-verbose:
300
+
301
+ ```
302
+ 55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
303
+ ```
304
+
279
305
  ### Debugging
280
306
 
281
307
  Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
@@ -287,3 +313,5 @@ Exceptions will be written into `waybackup_error.log` (each run overwrites the f
287
313
 
288
314
  I'm always happy for some feature requests to improve the usability of this tool.
289
315
  Feel free to give suggestions and report issues. Project is still far from being perfect.
316
+
317
+ > Please PR from dev into dev.
File without changes
File without changes