pywaybackup 3.2.1__tar.gz → 3.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pywaybackup-3.2.1/pywaybackup.egg-info → pywaybackup-3.3.1}/PKG-INFO +68 -42
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/README.md +67 -41
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pyproject.toml +1 -1
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Arguments.py +58 -6
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/SnapshotCollection.py +80 -62
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Verbosity.py +1 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Worker.py +8 -7
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/archive_download.py +13 -11
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/db.py +14 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/main.py +2 -2
- {pywaybackup-3.2.1 → pywaybackup-3.3.1/pywaybackup.egg-info}/PKG-INFO +68 -42
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/LICENSE +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Converter.py +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/Exception.py +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/__init__.py +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/archive_save.py +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup/helper.py +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/SOURCES.txt +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/dependency_links.txt +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/entry_points.txt +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/requires.txt +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/pywaybackup.egg-info/top_level.txt +0 -0
- {pywaybackup-3.2.1 → pywaybackup-3.3.1}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pywaybackup
|
|
3
|
-
Version: 3.
|
|
3
|
+
Version: 3.3.1
|
|
4
4
|
Summary: Query and download archive.org as simple as possible.
|
|
5
5
|
Author-email: bitdruid <bitdruid@outlook.com>
|
|
6
6
|
License: MIT License
|
|
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
55
55
|
### Pip
|
|
56
56
|
|
|
57
57
|
1. Install the package <br>
|
|
58
|
-
|
|
58
|
+
`pip install pywaybackup`
|
|
59
59
|
2. Run the tool <br>
|
|
60
|
-
|
|
60
|
+
`waybackup -h`
|
|
61
61
|
|
|
62
62
|
### Manual
|
|
63
63
|
|
|
64
64
|
1. Clone the repository <br>
|
|
65
|
-
|
|
65
|
+
`git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
|
|
66
66
|
2. Install <br>
|
|
67
|
-
|
|
67
|
+
`pip install .`
|
|
68
68
|
- in a virtual env or use `--break-system-package`
|
|
69
69
|
|
|
70
70
|
## notes / issues / hints
|
|
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
88
88
|
The URL of the web page to download. This argument is required.
|
|
89
89
|
|
|
90
90
|
#### Mode Selection (Choose One)
|
|
91
|
+
|
|
91
92
|
- **`-a`**, **`--all`**:<br>
|
|
92
93
|
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
|
|
93
94
|
- **`-l`**, **`--last`**:<br>
|
|
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
102
103
|
- **`-e`**, **`--explicit`**:<br>
|
|
103
104
|
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
|
|
104
105
|
|
|
105
|
-
- **`--filetype`** `<filetype>`:<br>
|
|
106
|
-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
107
|
-
|
|
108
106
|
- **`--limit`** `<count>`:<br>
|
|
109
|
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
107
|
+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
110
108
|
|
|
111
109
|
- **Range Selection:**<br>
|
|
112
110
|
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
|
|
113
111
|
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
112
|
+
|
|
113
|
+
- **`-r`**, **`--range`**:<br>
|
|
114
|
+
Specify the range in years for which to search and download snapshots.
|
|
115
|
+
- **`--start`**:<br>
|
|
116
|
+
Timestamp to start searching.
|
|
117
|
+
- **`--end`**:<br>
|
|
118
|
+
Timestamp to end searching.
|
|
119
|
+
|
|
120
|
+
- **Filtering:**<br>
|
|
121
|
+
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
|
|
122
|
+
|
|
123
|
+
- **`--filetype`** `<filetype>`:<br>
|
|
124
|
+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
125
|
+
|
|
126
|
+
- **`--statuscode`** `<statuscode>`:<br>
|
|
127
|
+
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
|
|
128
|
+
Common status codes you may want to handle/filter:
|
|
129
|
+
- `200` (OK)
|
|
130
|
+
- `301` (Moved Permanently - will redirect snapshot)
|
|
131
|
+
- `404` (Not Found - snapshot seems to be empty)
|
|
132
|
+
- `500` (Internal Server Error - snapshot is at least for now not available)
|
|
120
133
|
|
|
121
134
|
### Optional
|
|
122
135
|
|
|
123
136
|
#### Behavior Manipulation
|
|
124
137
|
|
|
125
138
|
- **`-o`**, **`--output`**:<br>
|
|
126
|
-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
139
|
+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
127
140
|
|
|
128
141
|
- **`-m`**, **`--metadata`**<br>
|
|
129
|
-
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
142
|
+
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
143
|
+
|
|
144
|
+
- **`--verbose`**:<br>
|
|
145
|
+
Increase output verbosity.
|
|
130
146
|
|
|
131
147
|
<!-- - **`--verbosity`** `<level>`:<br>
|
|
132
148
|
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
|
|
133
149
|
|
|
134
150
|
- **`--log`** <!-- `<path>` -->:<br>
|
|
135
|
-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
151
|
+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
136
152
|
|
|
137
153
|
- **`--progress`**:<br>
|
|
138
|
-
Shows a progress bar instead of the default output.
|
|
154
|
+
Shows a progress bar instead of the default output.
|
|
139
155
|
|
|
140
156
|
- **`--workers`** `<count>`:<br>
|
|
141
|
-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
157
|
+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
142
158
|
|
|
143
159
|
- **`--no-redirect`**:<br>
|
|
144
|
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
160
|
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
145
161
|
|
|
146
162
|
- **`--retry`** `<attempts>`:<br>
|
|
147
|
-
Specifies number of retry attempts for failed downloads.
|
|
163
|
+
Specifies number of retry attempts for failed downloads.
|
|
148
164
|
|
|
149
165
|
- **`--delay`** `<seconds>`:<br>
|
|
150
|
-
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
151
|
-
|
|
152
|
-
- **`--verbose`**:<br>
|
|
153
|
-
Increase output verbosity.
|
|
154
|
-
- verbose:
|
|
155
|
-
```
|
|
156
|
-
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
157
|
-
SUCCESS -> 200 OK
|
|
158
|
-
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
159
|
-
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
160
|
-
```
|
|
161
|
-
- non-verbose:
|
|
162
|
-
```
|
|
163
|
-
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
164
|
-
```
|
|
166
|
+
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
165
167
|
|
|
166
168
|
<!-- - **`--convert-links`**:<br>
|
|
167
169
|
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
|
|
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
186
188
|
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
|
|
187
189
|
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
|
|
188
190
|
- Skips previously downloaded files to save time.
|
|
189
|
-
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
191
|
+
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
190
192
|
|
|
191
193
|
#### Resetting a Job (`--reset`)
|
|
194
|
+
|
|
192
195
|
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
|
|
193
196
|
- Does **not** remove already downloaded files.
|
|
194
197
|
- `waybackup -u https://example.com -a --reset`
|
|
195
198
|
|
|
196
199
|
#### Keeping Job Data (`--keep`)
|
|
200
|
+
|
|
197
201
|
- Normally, `.cdx` and `.db` files are deleted after a successful job.
|
|
198
202
|
- `--keep` preserves them for future re-analysis or extending the query.
|
|
199
203
|
- `waybackup -u https://example.com -a --keep`
|
|
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
204
208
|
## Examples
|
|
205
209
|
|
|
206
210
|
1. Download a specific single snapshot of all available files (starting from root):<br>
|
|
207
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
211
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
208
212
|
2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
|
|
209
|
-
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
213
|
+
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
210
214
|
3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
|
|
211
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
215
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
212
216
|
4. Download all snapshots of all available files in the given range:<br>
|
|
213
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
217
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
214
218
|
|
|
215
219
|
<br>
|
|
216
220
|
<br>
|
|
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
|
|
|
223
227
|
`http://example.com/subdir1/subdir2/assets/`
|
|
224
228
|
<br><br>
|
|
225
229
|
For the first and last version (`-f` or `-l`):
|
|
230
|
+
|
|
226
231
|
- Will only include all files/folders starting from your query-path.
|
|
232
|
+
|
|
227
233
|
```
|
|
228
234
|
your/path/waybackup_snapshots/
|
|
229
235
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
|
|
|
234
240
|
├── style.css
|
|
235
241
|
...
|
|
236
242
|
```
|
|
243
|
+
|
|
237
244
|
For all versions (`-a`):
|
|
245
|
+
|
|
238
246
|
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
|
|
247
|
+
|
|
239
248
|
```
|
|
240
249
|
your/path/waybackup_snapshots/
|
|
241
250
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -276,6 +285,23 @@ For download queries:
|
|
|
276
285
|
]
|
|
277
286
|
```
|
|
278
287
|
|
|
288
|
+
### Log
|
|
289
|
+
|
|
290
|
+
Verbose:
|
|
291
|
+
|
|
292
|
+
```
|
|
293
|
+
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
294
|
+
SUCCESS -> 200 OK
|
|
295
|
+
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
296
|
+
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
Non-verbose:
|
|
300
|
+
|
|
301
|
+
```
|
|
302
|
+
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
303
|
+
```
|
|
304
|
+
|
|
279
305
|
### Debugging
|
|
280
306
|
|
|
281
307
|
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
|
|
@@ -16,16 +16,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
16
16
|
### Pip
|
|
17
17
|
|
|
18
18
|
1. Install the package <br>
|
|
19
|
-
|
|
19
|
+
`pip install pywaybackup`
|
|
20
20
|
2. Run the tool <br>
|
|
21
|
-
|
|
21
|
+
`waybackup -h`
|
|
22
22
|
|
|
23
23
|
### Manual
|
|
24
24
|
|
|
25
25
|
1. Clone the repository <br>
|
|
26
|
-
|
|
26
|
+
`git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
|
|
27
27
|
2. Install <br>
|
|
28
|
-
|
|
28
|
+
`pip install .`
|
|
29
29
|
- in a virtual env or use `--break-system-package`
|
|
30
30
|
|
|
31
31
|
## notes / issues / hints
|
|
@@ -49,6 +49,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
49
49
|
The URL of the web page to download. This argument is required.
|
|
50
50
|
|
|
51
51
|
#### Mode Selection (Choose One)
|
|
52
|
+
|
|
52
53
|
- **`-a`**, **`--all`**:<br>
|
|
53
54
|
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
|
|
54
55
|
- **`-l`**, **`--last`**:<br>
|
|
@@ -63,66 +64,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
63
64
|
- **`-e`**, **`--explicit`**:<br>
|
|
64
65
|
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
|
|
65
66
|
|
|
66
|
-
- **`--filetype`** `<filetype>`:<br>
|
|
67
|
-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
68
|
-
|
|
69
67
|
- **`--limit`** `<count>`:<br>
|
|
70
|
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
68
|
+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
71
69
|
|
|
72
70
|
- **Range Selection:**<br>
|
|
73
71
|
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
|
|
74
72
|
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
73
|
+
|
|
74
|
+
- **`-r`**, **`--range`**:<br>
|
|
75
|
+
Specify the range in years for which to search and download snapshots.
|
|
76
|
+
- **`--start`**:<br>
|
|
77
|
+
Timestamp to start searching.
|
|
78
|
+
- **`--end`**:<br>
|
|
79
|
+
Timestamp to end searching.
|
|
80
|
+
|
|
81
|
+
- **Filtering:**<br>
|
|
82
|
+
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
|
|
83
|
+
|
|
84
|
+
- **`--filetype`** `<filetype>`:<br>
|
|
85
|
+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
86
|
+
|
|
87
|
+
- **`--statuscode`** `<statuscode>`:<br>
|
|
88
|
+
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
|
|
89
|
+
Common status codes you may want to handle/filter:
|
|
90
|
+
- `200` (OK)
|
|
91
|
+
- `301` (Moved Permanently - will redirect snapshot)
|
|
92
|
+
- `404` (Not Found - snapshot seems to be empty)
|
|
93
|
+
- `500` (Internal Server Error - snapshot is at least for now not available)
|
|
81
94
|
|
|
82
95
|
### Optional
|
|
83
96
|
|
|
84
97
|
#### Behavior Manipulation
|
|
85
98
|
|
|
86
99
|
- **`-o`**, **`--output`**:<br>
|
|
87
|
-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
100
|
+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
88
101
|
|
|
89
102
|
- **`-m`**, **`--metadata`**<br>
|
|
90
|
-
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
103
|
+
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
104
|
+
|
|
105
|
+
- **`--verbose`**:<br>
|
|
106
|
+
Increase output verbosity.
|
|
91
107
|
|
|
92
108
|
<!-- - **`--verbosity`** `<level>`:<br>
|
|
93
109
|
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
|
|
94
110
|
|
|
95
111
|
- **`--log`** <!-- `<path>` -->:<br>
|
|
96
|
-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
112
|
+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
97
113
|
|
|
98
114
|
- **`--progress`**:<br>
|
|
99
|
-
Shows a progress bar instead of the default output.
|
|
115
|
+
Shows a progress bar instead of the default output.
|
|
100
116
|
|
|
101
117
|
- **`--workers`** `<count>`:<br>
|
|
102
|
-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
118
|
+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
103
119
|
|
|
104
120
|
- **`--no-redirect`**:<br>
|
|
105
|
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
121
|
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
106
122
|
|
|
107
123
|
- **`--retry`** `<attempts>`:<br>
|
|
108
|
-
Specifies number of retry attempts for failed downloads.
|
|
124
|
+
Specifies number of retry attempts for failed downloads.
|
|
109
125
|
|
|
110
126
|
- **`--delay`** `<seconds>`:<br>
|
|
111
|
-
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
112
|
-
|
|
113
|
-
- **`--verbose`**:<br>
|
|
114
|
-
Increase output verbosity.
|
|
115
|
-
- verbose:
|
|
116
|
-
```
|
|
117
|
-
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
118
|
-
SUCCESS -> 200 OK
|
|
119
|
-
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
120
|
-
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
121
|
-
```
|
|
122
|
-
- non-verbose:
|
|
123
|
-
```
|
|
124
|
-
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
125
|
-
```
|
|
127
|
+
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
126
128
|
|
|
127
129
|
<!-- - **`--convert-links`**:<br>
|
|
128
130
|
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
|
|
@@ -147,14 +149,16 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
147
149
|
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
|
|
148
150
|
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
|
|
149
151
|
- Skips previously downloaded files to save time.
|
|
150
|
-
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
152
|
+
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
151
153
|
|
|
152
154
|
#### Resetting a Job (`--reset`)
|
|
155
|
+
|
|
153
156
|
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
|
|
154
157
|
- Does **not** remove already downloaded files.
|
|
155
158
|
- `waybackup -u https://example.com -a --reset`
|
|
156
159
|
|
|
157
160
|
#### Keeping Job Data (`--keep`)
|
|
161
|
+
|
|
158
162
|
- Normally, `.cdx` and `.db` files are deleted after a successful job.
|
|
159
163
|
- `--keep` preserves them for future re-analysis or extending the query.
|
|
160
164
|
- `waybackup -u https://example.com -a --keep`
|
|
@@ -165,13 +169,13 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
165
169
|
## Examples
|
|
166
170
|
|
|
167
171
|
1. Download a specific single snapshot of all available files (starting from root):<br>
|
|
168
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
172
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
169
173
|
2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
|
|
170
|
-
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
174
|
+
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
171
175
|
3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
|
|
172
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
176
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
173
177
|
4. Download all snapshots of all available files in the given range:<br>
|
|
174
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
178
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
175
179
|
|
|
176
180
|
<br>
|
|
177
181
|
<br>
|
|
@@ -184,7 +188,9 @@ The output path is currently structured as follows by an example for the query:<
|
|
|
184
188
|
`http://example.com/subdir1/subdir2/assets/`
|
|
185
189
|
<br><br>
|
|
186
190
|
For the first and last version (`-f` or `-l`):
|
|
191
|
+
|
|
187
192
|
- Will only include all files/folders starting from your query-path.
|
|
193
|
+
|
|
188
194
|
```
|
|
189
195
|
your/path/waybackup_snapshots/
|
|
190
196
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -195,8 +201,11 @@ your/path/waybackup_snapshots/
|
|
|
195
201
|
├── style.css
|
|
196
202
|
...
|
|
197
203
|
```
|
|
204
|
+
|
|
198
205
|
For all versions (`-a`):
|
|
206
|
+
|
|
199
207
|
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
|
|
208
|
+
|
|
200
209
|
```
|
|
201
210
|
your/path/waybackup_snapshots/
|
|
202
211
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -237,6 +246,23 @@ For download queries:
|
|
|
237
246
|
]
|
|
238
247
|
```
|
|
239
248
|
|
|
249
|
+
### Log
|
|
250
|
+
|
|
251
|
+
Verbose:
|
|
252
|
+
|
|
253
|
+
```
|
|
254
|
+
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
255
|
+
SUCCESS -> 200 OK
|
|
256
|
+
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
257
|
+
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
Non-verbose:
|
|
261
|
+
|
|
262
|
+
```
|
|
263
|
+
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
264
|
+
```
|
|
265
|
+
|
|
240
266
|
### Debugging
|
|
241
267
|
|
|
242
268
|
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
|
|
@@ -3,6 +3,8 @@ import sys
|
|
|
3
3
|
import os
|
|
4
4
|
import argparse
|
|
5
5
|
|
|
6
|
+
from argparse import RawTextHelpFormatter
|
|
7
|
+
|
|
6
8
|
from importlib.metadata import version
|
|
7
9
|
|
|
8
10
|
from pywaybackup.helper import url_split, sanitize_filename
|
|
@@ -10,9 +12,10 @@ from pywaybackup.helper import url_split, sanitize_filename
|
|
|
10
12
|
class Arguments:
|
|
11
13
|
|
|
12
14
|
def __init__(self):
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
|
|
15
|
+
parser = argparse.ArgumentParser(
|
|
16
|
+
description=f"<<< python-wayback-machine-downloader v{version('pywaybackup')} >>>\nby @bitdruid -> https://github.com/bitdruid",
|
|
17
|
+
formatter_class=RawTextHelpFormatter,
|
|
18
|
+
)
|
|
16
19
|
|
|
17
20
|
required = parser.add_argument_group('required (one exclusive)')
|
|
18
21
|
required.add_argument('-u', '--url', type=str, metavar="", help='url (with subdir/subdomain) to download')
|
|
@@ -27,12 +30,14 @@ class Arguments:
|
|
|
27
30
|
optional.add_argument('-r', '--range', type=int, metavar="", help='range in years to search')
|
|
28
31
|
optional.add_argument('--start', type=int, metavar="", help='start timestamp format: YYYYMMDDhhmmss')
|
|
29
32
|
optional.add_argument('--end', type=int, metavar="", help='end timestamp format: YYYYMMDDhhmmss')
|
|
30
|
-
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (e.g. "html,css")')
|
|
31
33
|
optional.add_argument('--limit', type=int, nargs='?', const=True, metavar='int', help='limit the number of snapshots to download')
|
|
34
|
+
optional.add_argument('--filetype', type=str, metavar="", help='filetypes to download comma separated (js,css,...)')
|
|
35
|
+
optional.add_argument('--statuscode', type=str, metavar="", help='statuscodes to download comma separated (200,404,...)')
|
|
32
36
|
|
|
33
37
|
behavior = parser.add_argument_group('manipulate behavior')
|
|
34
38
|
behavior.add_argument('-o', '--output', type=str, metavar="", help='output for all files - defaults to current directory')
|
|
35
39
|
behavior.add_argument('-m', '--metadata', type=str, metavar="", help='change directory for db/cdx/csv/log files')
|
|
40
|
+
behavior.add_argument('-v', '--verbose', action='store_true', help='overwritten by progress - gives detailed output')
|
|
36
41
|
behavior.add_argument('--log', action='store_true', help='save a log file into the output folder')
|
|
37
42
|
behavior.add_argument('--progress', action='store_true', help='show a progress bar')
|
|
38
43
|
behavior.add_argument('--no-redirect', action='store_true', help='do not follow redirects by archive.org')
|
|
@@ -40,7 +45,6 @@ class Arguments:
|
|
|
40
45
|
behavior.add_argument('--workers', type=int, default=1, metavar="", help='number of workers (simultaneous downloads)')
|
|
41
46
|
# behavior.add_argument('--convert-links', action='store_true', help='Convert all links in the files to local paths. Requires -c/--current')
|
|
42
47
|
behavior.add_argument('--delay', type=int, default=0, metavar="", help='delay between each download in seconds')
|
|
43
|
-
behavior.add_argument('--verbose', action='store_true', help='overwritten by progress - gives detailed output')
|
|
44
48
|
|
|
45
49
|
special = parser.add_argument_group('special')
|
|
46
50
|
special.add_argument('--reset', action='store_true', help='reset the job and ignore existing cdx/db/csv files')
|
|
@@ -61,6 +65,52 @@ class Arguments:
|
|
|
61
65
|
return self.args
|
|
62
66
|
|
|
63
67
|
class Configuration:
|
|
68
|
+
|
|
69
|
+
# def __init__(self):
|
|
70
|
+
# self.args = Arguments().get_args()
|
|
71
|
+
# for key, value in vars(self.args).items():
|
|
72
|
+
# setattr(Configuration, key, value)
|
|
73
|
+
|
|
74
|
+
# self.set_config()
|
|
75
|
+
|
|
76
|
+
# def set_config(self):
|
|
77
|
+
# # args now attributes of Configuration // Configuration.output, ...
|
|
78
|
+
# self.command = ' '.join(sys.argv[1:])
|
|
79
|
+
# self.domain, self.subdir, self.filename = url_split(self.url)
|
|
80
|
+
|
|
81
|
+
# if self.output is None:
|
|
82
|
+
# self.output = os.path.join(os.getcwd(), "waybackup_snapshots")
|
|
83
|
+
# if self.metadata is None:
|
|
84
|
+
# self.metadata = self.output
|
|
85
|
+
# os.makedirs(self.output, exist_ok=True) if not self.save else None
|
|
86
|
+
# os.makedirs(self.metadata, exist_ok=True) if not self.save else None
|
|
87
|
+
|
|
88
|
+
# if self.all:
|
|
89
|
+
# self.mode = "all"
|
|
90
|
+
# if self.last:
|
|
91
|
+
# self.mode = "last"
|
|
92
|
+
# if self.first:
|
|
93
|
+
# self.mode = "first"
|
|
94
|
+
# if self.save:
|
|
95
|
+
# self.mode = "save"
|
|
96
|
+
|
|
97
|
+
# if self.filetype:
|
|
98
|
+
# self.filetype = [f.lower().strip() for f in self.filetype.split(",")]
|
|
99
|
+
# if self.statuscode:
|
|
100
|
+
# self.statuscode = [s.lower().strip() for s in self.statuscode.split(",")]
|
|
101
|
+
|
|
102
|
+
# base_path = self.metadata
|
|
103
|
+
# base_name = f"waybackup_{sanitize_filename(self.url)}"
|
|
104
|
+
# self.cdxfile = os.path.join(base_path, f"{base_name}.cdx")
|
|
105
|
+
# self.dbfile = os.path.join(base_path, f"{base_name}.db")
|
|
106
|
+
# self.csvfile = os.path.join(base_path, f"{base_name}.csv")
|
|
107
|
+
# self.log = os.path.join(base_path, f"{base_name}.log") if self.log else None
|
|
108
|
+
|
|
109
|
+
# if self.reset:
|
|
110
|
+
# os.remove(self.cdxfile) if os.path.isfile(self.cdxfile) else None
|
|
111
|
+
# os.remove(self.dbfile) if os.path.isfile(self.dbfile) else None
|
|
112
|
+
# os.remove(self.csvfile) if os.path.isfile(self.csvfile) else None
|
|
113
|
+
|
|
64
114
|
|
|
65
115
|
@classmethod
|
|
66
116
|
def init(cls):
|
|
@@ -90,7 +140,9 @@ class Configuration:
|
|
|
90
140
|
cls.mode = "save"
|
|
91
141
|
|
|
92
142
|
if cls.filetype:
|
|
93
|
-
cls.filetype = [
|
|
143
|
+
cls.filetype = [f.lower().strip() for f in cls.filetype.split(",")]
|
|
144
|
+
if cls.statuscode:
|
|
145
|
+
cls.statuscode = [s.lower().strip() for s in cls.statuscode.split(",")]
|
|
94
146
|
|
|
95
147
|
base_path = cls.metadata
|
|
96
148
|
base_name = f"waybackup_{sanitize_filename(cls.url)}"
|
|
@@ -22,11 +22,10 @@ class SnapshotCollection:
|
|
|
22
22
|
SNAPSHOT_UNHANDLED = 0 # all unhandled snapshots in the db (without response)
|
|
23
23
|
SNAPSHOT_HANDLED = 0 # snapshots with a response
|
|
24
24
|
|
|
25
|
-
SNAPSHOT_REMOVALS = 0 # not to be utilized (total - unhandled - skip)
|
|
26
|
-
SNAPSHOT_FAULTY = 0 # snapshots which could not be loaded from cdx file into db
|
|
27
25
|
FILTER_DUPLICATES = 0 # with identical url_archive
|
|
28
26
|
FILTER_MODE = 0 # all snapshots filtered by the MODE (last or first)
|
|
29
27
|
FILTER_SKIP = 0 # content of the csv file
|
|
28
|
+
FILTER_RESPONSE = 0 # snapshots which could not be loaded from cdx file into db or 404
|
|
30
29
|
|
|
31
30
|
@classmethod
|
|
32
31
|
def init(cls, mode):
|
|
@@ -71,35 +70,40 @@ class SnapshotCollection:
|
|
|
71
70
|
cls.db.set_index_complete()
|
|
72
71
|
else:
|
|
73
72
|
vb.write(verbose=True, content="\nAlready indexed snapshots")
|
|
74
|
-
if
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
vb.write(verbose=True, content="\nAlready filtered snapshots (last or first version)")
|
|
73
|
+
if not cls.db.get_filter_complete():
|
|
74
|
+
vb.write(content="\nFiltering snapshots (last or first version)...")
|
|
75
|
+
cls.filter_snapshots() # filter: keep newest or oldest based on MODE
|
|
76
|
+
cls.db.set_filter_complete()
|
|
77
|
+
else:
|
|
78
|
+
vb.write(verbose=True, content="\nAlready filtered snapshots (last or first version)")
|
|
81
79
|
|
|
82
80
|
cls.skip_set(csvfile) # set response to NULL or read csv file and write values into db
|
|
81
|
+
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
|
|
86
|
+
@classmethod
|
|
87
|
+
def calculate(cls):
|
|
83
88
|
cls.SNAPSHOT_UNHANDLED = cls.count_totals(unhandled=True) # count all unhandled in db
|
|
84
89
|
cls.SNAPSHOT_HANDLED = cls.count_totals(handled=True) # count all handled in db
|
|
85
90
|
cls.SNAPSHOT_TOTAL = cls.count_totals(total=True) # count all in db
|
|
86
|
-
cls.SNAPSHOT_REMOVALS = cls.CDX_TOTAL - cls.SNAPSHOT_UNHANDLED - cls.FILTER_SKIP # count all removals
|
|
87
91
|
|
|
88
92
|
vb.write(content="\nSnapshot calculation:")
|
|
89
93
|
vb.write(content=f"-----> {'in CDX file'.ljust(18)}: {cls.CDX_TOTAL:,}")
|
|
90
94
|
|
|
91
|
-
if cls.FILTER_DUPLICATES == 0 and cls.FILTER_MODE == 0:
|
|
92
|
-
vb.write(content=f"-----> {'total removals'.ljust(18)}: {cls.SNAPSHOT_REMOVALS:,}")
|
|
93
|
-
if cls.SNAPSHOT_FAULTY > 0:
|
|
94
|
-
vb.write(content=f"-----> {'removed faulty'.ljust(18)}: {cls.SNAPSHOT_FAULTY}")
|
|
95
95
|
if cls.FILTER_DUPLICATES > 0:
|
|
96
96
|
vb.write(content=f"-----> {'removed duplicates'.ljust(18)}: {cls.FILTER_DUPLICATES:,}")
|
|
97
97
|
if cls.FILTER_MODE > 0:
|
|
98
98
|
vb.write(content=f"-----> {'removed versions'.ljust(18)}: {cls.FILTER_MODE:,}")
|
|
99
|
+
|
|
99
100
|
if cls.FILTER_SKIP > 0:
|
|
100
|
-
vb.write(content=f"-----> {'
|
|
101
|
+
vb.write(content=f"-----> {'skip existing'.ljust(18)}: {cls.FILTER_SKIP:,}")
|
|
102
|
+
if cls.FILTER_RESPONSE > 0:
|
|
103
|
+
vb.write(content=f"-----> {'skip statuscode'.ljust(18)}: {cls.FILTER_RESPONSE}")
|
|
101
104
|
|
|
102
|
-
|
|
105
|
+
if cls.SNAPSHOT_UNHANDLED > 0:
|
|
106
|
+
vb.write(content=f"\n-----> {'to utilize'.ljust(18)}: {cls.SNAPSHOT_UNHANDLED:,}")
|
|
103
107
|
|
|
104
108
|
|
|
105
109
|
|
|
@@ -112,7 +116,23 @@ class SnapshotCollection:
|
|
|
112
116
|
- Removes duplicates by url_archive (same timestamp and url_origin)
|
|
113
117
|
- Filters the snapshots by the given mode (last or first)
|
|
114
118
|
"""
|
|
119
|
+
|
|
120
|
+
def _parse_line(line):
|
|
121
|
+
line = json.loads(line)
|
|
122
|
+
line = {
|
|
123
|
+
"timestamp": line[0],
|
|
124
|
+
"digest": line[1],
|
|
125
|
+
"mimetype": line[2],
|
|
126
|
+
"statuscode": line[3],
|
|
127
|
+
"origin": line[4],
|
|
128
|
+
}
|
|
129
|
+
url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['origin']}"
|
|
130
|
+
statuscode = line["statuscode"] if line["statuscode"] in ("301", "404") else None
|
|
131
|
+
return (line["timestamp"], url_archive, line["origin"], statuscode)
|
|
132
|
+
|
|
133
|
+
|
|
115
134
|
vb.write(verbose=None, content="\nInserting CDX data into database...")
|
|
135
|
+
|
|
116
136
|
with open(cdxfile, "r", encoding="utf-8") as f, tqdm(
|
|
117
137
|
unit=" lines",
|
|
118
138
|
total=cls.CDX_TOTAL,
|
|
@@ -123,11 +143,9 @@ class SnapshotCollection:
|
|
|
123
143
|
line_batchsize = 2500
|
|
124
144
|
line_batch = []
|
|
125
145
|
total_inserted = 0
|
|
126
|
-
|
|
127
|
-
query_duplicates = (
|
|
128
|
-
"""INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin) VALUES (?, ?, ?)"""
|
|
129
|
-
)
|
|
146
|
+
query_duplicates = """INSERT OR IGNORE INTO snapshot_tbl (timestamp, url_archive, url_origin, response) VALUES (?, ?, ?, ?)"""
|
|
130
147
|
first_line = True
|
|
148
|
+
|
|
131
149
|
for line in f:
|
|
132
150
|
if first_line:
|
|
133
151
|
first_line = False
|
|
@@ -137,29 +155,15 @@ class SnapshotCollection:
|
|
|
137
155
|
line = line.rsplit("]", 1)[0]
|
|
138
156
|
if line.endswith(","):
|
|
139
157
|
line = line.rsplit(",", 1)[0]
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
url_archive = f"https://web.archive.org/web/{line['timestamp']}id_/{line['url']}"
|
|
150
|
-
line_batch.append((line["timestamp"], url_archive, line["url"]))
|
|
151
|
-
if len(line_batch) >= line_batchsize:
|
|
152
|
-
total_inserted += len(line_batch)
|
|
153
|
-
cls.db.cursor.executemany(query_duplicates, line_batch)
|
|
154
|
-
line_batch = []
|
|
155
|
-
pbar.update(line_batchsize)
|
|
156
|
-
except json.JSONDecodeError as e:
|
|
157
|
-
faulty_lines += 1
|
|
158
|
-
vb.write(
|
|
159
|
-
verbose=None,
|
|
160
|
-
content=f"JSONDecodeError: {e} on line {cls.CDX_TOTAL}",
|
|
161
|
-
)
|
|
162
|
-
continue
|
|
158
|
+
|
|
159
|
+
line_batch.append(_parse_line(line))
|
|
160
|
+
|
|
161
|
+
if len(line_batch) >= line_batchsize:
|
|
162
|
+
total_inserted += len(line_batch)
|
|
163
|
+
cls.db.cursor.executemany(query_duplicates, line_batch)
|
|
164
|
+
line_batch = []
|
|
165
|
+
pbar.update(line_batchsize)
|
|
166
|
+
|
|
163
167
|
if line_batch:
|
|
164
168
|
total_inserted += len(line_batch)
|
|
165
169
|
cls.db.cursor.executemany(query_duplicates, line_batch)
|
|
@@ -167,8 +171,7 @@ class SnapshotCollection:
|
|
|
167
171
|
|
|
168
172
|
cls.db.conn.commit()
|
|
169
173
|
|
|
170
|
-
cls.
|
|
171
|
-
cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(unhandled=True) + cls.SNAPSHOT_FAULTY
|
|
174
|
+
cls.FILTER_DUPLICATES = cls.CDX_TOTAL - cls.count_totals(total=True)
|
|
172
175
|
|
|
173
176
|
|
|
174
177
|
|
|
@@ -181,7 +184,7 @@ class SnapshotCollection:
|
|
|
181
184
|
"""
|
|
182
185
|
row_batchsize = 2500
|
|
183
186
|
cls.db.cursor.execute("UPDATE snapshot_tbl SET response = NULL WHERE response = 'LOCK'") # reset locked to unprocessed
|
|
184
|
-
cls.db.cursor.execute("SELECT * FROM
|
|
187
|
+
cls.db.cursor.execute("SELECT * FROM csv_view WHERE response IS NOT NULL") # only write processed snapshots
|
|
185
188
|
headers = [description[0] for description in cls.db.cursor.description]
|
|
186
189
|
with open(csvfile, "w", encoding="utf-8") as f:
|
|
187
190
|
writer = csv.writer(f)
|
|
@@ -203,13 +206,15 @@ class SnapshotCollection:
|
|
|
203
206
|
Create indexes for the snapshot table.
|
|
204
207
|
"""
|
|
205
208
|
# index for filtering last snapshots
|
|
206
|
-
cls.
|
|
207
|
-
|
|
208
|
-
|
|
209
|
+
if cls.MODE_LAST:
|
|
210
|
+
cls.db.cursor.execute(
|
|
211
|
+
"CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_desc ON snapshot_tbl(url_origin, timestamp DESC);"
|
|
212
|
+
)
|
|
209
213
|
# index for filtering first snapshots
|
|
210
|
-
cls.
|
|
211
|
-
|
|
212
|
-
|
|
214
|
+
if cls.MODE_FIRST:
|
|
215
|
+
cls.db.cursor.execute(
|
|
216
|
+
"CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_url_origin_timestamp_asc ON snapshot_tbl(url_origin, timestamp ASC);"
|
|
217
|
+
)
|
|
213
218
|
# index for skippable snapshots
|
|
214
219
|
cls.db.cursor.execute(
|
|
215
220
|
"CREATE INDEX IF NOT EXISTS idx_snapshot_tbl_timestamp_url_origin_response ON snapshot_tbl(timestamp, url_origin);"
|
|
@@ -247,6 +252,26 @@ class SnapshotCollection:
|
|
|
247
252
|
"""
|
|
248
253
|
)
|
|
249
254
|
cls.FILTER_MODE = cls.db.cursor.rowcount
|
|
255
|
+
|
|
256
|
+
cls.db.cursor.execute(
|
|
257
|
+
"""
|
|
258
|
+
SELECT COUNT(*) FROM snapshot_tbl WHERE response IN ('404', '301')
|
|
259
|
+
"""
|
|
260
|
+
)
|
|
261
|
+
cls.FILTER_RESPONSE = cls.db.cursor.fetchone()[0]
|
|
262
|
+
|
|
263
|
+
cls.db.cursor.execute(
|
|
264
|
+
"""
|
|
265
|
+
WITH numbered AS (
|
|
266
|
+
SELECT rowid, ROW_NUMBER() OVER (ORDER BY rowid) AS rn
|
|
267
|
+
FROM snapshot_tbl
|
|
268
|
+
)
|
|
269
|
+
UPDATE snapshot_tbl
|
|
270
|
+
SET counter = (
|
|
271
|
+
SELECT rn FROM numbered WHERE numbered.rowid = snapshot_tbl.rowid
|
|
272
|
+
);
|
|
273
|
+
"""
|
|
274
|
+
)
|
|
250
275
|
|
|
251
276
|
cls.db.conn.commit()
|
|
252
277
|
|
|
@@ -259,13 +284,6 @@ class SnapshotCollection:
|
|
|
259
284
|
"""
|
|
260
285
|
If an existing csv-file for the job exists, the responses will be overwritten by the csv-content.
|
|
261
286
|
"""
|
|
262
|
-
cls.db.cursor.execute(
|
|
263
|
-
"""
|
|
264
|
-
UPDATE snapshot_tbl
|
|
265
|
-
SET response = NULL
|
|
266
|
-
"""
|
|
267
|
-
)
|
|
268
|
-
cls.db.conn.commit()
|
|
269
287
|
if not os.path.isfile(csvfile):
|
|
270
288
|
return
|
|
271
289
|
else:
|
|
@@ -327,16 +345,16 @@ class SnapshotCollection:
|
|
|
327
345
|
if unhandled:
|
|
328
346
|
return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE response IS NULL").fetchone()[0]
|
|
329
347
|
if success:
|
|
330
|
-
return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NOT NULL").fetchone()[0]
|
|
348
|
+
return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NOT NULL AND file != ''").fetchone()[0]
|
|
331
349
|
if fail:
|
|
332
|
-
return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NULL").fetchone()[0]
|
|
350
|
+
return cls.db.cursor.execute("SELECT COUNT(rowid) FROM snapshot_tbl WHERE file IS NULL OR file = ''").fetchone()[0]
|
|
333
351
|
|
|
334
352
|
@staticmethod
|
|
335
353
|
def modify_snapshot(connection, snapshot_id, column, value):
|
|
336
354
|
"""
|
|
337
355
|
Modify a snapshot-row in the snapshot table.
|
|
338
356
|
"""
|
|
339
|
-
query = f"UPDATE snapshot_tbl SET {column} = ? WHERE
|
|
357
|
+
query = f"UPDATE snapshot_tbl SET {column} = ? WHERE counter = ?"
|
|
340
358
|
connection.cursor.execute(query, (value, snapshot_id))
|
|
341
359
|
connection.conn.commit()
|
|
342
360
|
|
|
@@ -37,7 +37,7 @@ class Worker:
|
|
|
37
37
|
self.snapshot = sc.get_snapshot(self.db)
|
|
38
38
|
if not self.snapshot:
|
|
39
39
|
return
|
|
40
|
-
self.
|
|
40
|
+
self.counter = self.snapshot["counter"]
|
|
41
41
|
self.timestamp = self.snapshot["timestamp"]
|
|
42
42
|
self.url_archive = self.snapshot["url_archive"]
|
|
43
43
|
self.url_origin = self.snapshot["url_origin"]
|
|
@@ -64,7 +64,7 @@ class Worker:
|
|
|
64
64
|
if self.redirect_timestamp is None and value is None:
|
|
65
65
|
return
|
|
66
66
|
self._redirect_url = value
|
|
67
|
-
sc.modify_snapshot(self.db, self.
|
|
67
|
+
sc.modify_snapshot(self.db, self.counter, "redirect_url", value)
|
|
68
68
|
|
|
69
69
|
@property
|
|
70
70
|
def redirect_timestamp(self):
|
|
@@ -75,7 +75,7 @@ class Worker:
|
|
|
75
75
|
if self.redirect_url is None and value is None:
|
|
76
76
|
return
|
|
77
77
|
self._redirect_timestamp = value
|
|
78
|
-
sc.modify_snapshot(self.db, self.
|
|
78
|
+
sc.modify_snapshot(self.db, self.counter, "redirect_timestamp", value)
|
|
79
79
|
|
|
80
80
|
@property
|
|
81
81
|
def response(self):
|
|
@@ -86,7 +86,7 @@ class Worker:
|
|
|
86
86
|
if self.redirect_url is None and value is None:
|
|
87
87
|
return
|
|
88
88
|
self._response = value
|
|
89
|
-
sc.modify_snapshot(self.db, self.
|
|
89
|
+
sc.modify_snapshot(self.db, self.counter, "response", value)
|
|
90
90
|
|
|
91
91
|
@property
|
|
92
92
|
def file(self):
|
|
@@ -97,7 +97,7 @@ class Worker:
|
|
|
97
97
|
if self.redirect_url is None and value is None:
|
|
98
98
|
return
|
|
99
99
|
self._file = value
|
|
100
|
-
sc.modify_snapshot(self.db, self.
|
|
100
|
+
sc.modify_snapshot(self.db, self.counter, "file", value)
|
|
101
101
|
|
|
102
102
|
|
|
103
103
|
class Message(Worker):
|
|
@@ -141,14 +141,15 @@ class Message(Worker):
|
|
|
141
141
|
"verbose": True,
|
|
142
142
|
"content": _format_verbose({"result": result, "info": info, "content": content}),
|
|
143
143
|
}
|
|
144
|
+
self.buffer.append(self.message)
|
|
144
145
|
if verbose is False or verbose is None:
|
|
145
146
|
result = result + " - " if result else ""
|
|
146
147
|
content = content + " - " if content else ""
|
|
147
148
|
self.message = {
|
|
148
149
|
"verbose": False,
|
|
149
|
-
"content": f"{self.worker.
|
|
150
|
+
"content": f"{self.worker.counter}/{sc.SNAPSHOT_TOTAL} - W:{self.worker.id} - {result}{content}{self.worker.timestamp} - {self.worker.url_origin}",
|
|
150
151
|
}
|
|
151
|
-
|
|
152
|
+
self.buffer.append(self.message)
|
|
152
153
|
|
|
153
154
|
def write(self):
|
|
154
155
|
"""
|
|
@@ -47,7 +47,7 @@ def startup():
|
|
|
47
47
|
|
|
48
48
|
|
|
49
49
|
|
|
50
|
-
def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list):
|
|
50
|
+
def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,end: int,explicit: bool,filter_filetype: list,filter_statuscode: list):
|
|
51
51
|
|
|
52
52
|
def inject(cdxinject: str) -> bool:
|
|
53
53
|
if os.path.isfile(cdxinject):
|
|
@@ -60,7 +60,7 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
|
|
|
60
60
|
)
|
|
61
61
|
return False
|
|
62
62
|
|
|
63
|
-
def create_query(queryrange: int, limit: int, filter_filetype: list, start: int, end: int, explicit: bool) -> str:
|
|
63
|
+
def create_query(queryrange: int, limit: int, filter_filetype: list, filter_statuscode: list, start: int, end: int, explicit: bool) -> str:
|
|
64
64
|
if queryrange:
|
|
65
65
|
query_range = f"&from={datetime.now().year - queryrange}"
|
|
66
66
|
else:
|
|
@@ -81,9 +81,10 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
|
|
|
81
81
|
|
|
82
82
|
limit = f"&limit={limit}" if limit else ""
|
|
83
83
|
|
|
84
|
+
filter_statuscode = (f"&filter=statuscode:({'|'.join(filter_statuscode)})$" if filter_statuscode else "")
|
|
84
85
|
filter_filetype = (f"&filter=original:.*\\.({'|'.join(filter_filetype)})$" if filter_filetype else "")
|
|
85
86
|
|
|
86
|
-
cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}"
|
|
87
|
+
cdxquery = f"https://web.archive.org/cdx/search/cdx?output=json&url={cdx_url}{query_range}&fl=timestamp,digest,mimetype,statuscode,original{limit}{filter_filetype}{filter_statuscode}"
|
|
87
88
|
|
|
88
89
|
return cdxquery
|
|
89
90
|
|
|
@@ -111,9 +112,10 @@ def query_list(csvfile: str, cdxfile: str,queryrange: int,limit: int,start: int,
|
|
|
111
112
|
|
|
112
113
|
cdxinject = inject(cdxfile)
|
|
113
114
|
if not cdxinject:
|
|
114
|
-
cdxquery = create_query(queryrange, limit, filter_filetype, start, end, explicit)
|
|
115
|
+
cdxquery = create_query(queryrange, limit, filter_filetype, filter_statuscode, start, end, explicit)
|
|
115
116
|
cdxfile = run_query(cdxfile, cdxquery)
|
|
116
117
|
sc.process_cdx(cdxfile, csvfile)
|
|
118
|
+
sc.calculate()
|
|
117
119
|
|
|
118
120
|
|
|
119
121
|
|
|
@@ -131,7 +133,7 @@ def download_list(output, retry, no_redirect, delay, workers):
|
|
|
131
133
|
threads = []
|
|
132
134
|
for i in range(workers):
|
|
133
135
|
worker = Worker(id=i + 1)
|
|
134
|
-
vb.write(verbose=True, content=f"\n-----> Starting
|
|
136
|
+
vb.write(verbose=True, content=f"\n-----> Starting Worker: {worker.id}")
|
|
135
137
|
thread = threading.Thread(target=download_loop, args=(worker, output, retry, no_redirect, delay))
|
|
136
138
|
threads.append(thread)
|
|
137
139
|
thread.start()
|
|
@@ -163,7 +165,7 @@ def download_loop(worker, output, retry, no_redirect, delay):
|
|
|
163
165
|
|
|
164
166
|
while worker.attempt <= retry_max_attempt: # retry as given by user
|
|
165
167
|
|
|
166
|
-
worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.
|
|
168
|
+
worker.message.store(verbose=True, content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}]")
|
|
167
169
|
download_attempt = 1
|
|
168
170
|
download_max_attempt = 3
|
|
169
171
|
|
|
@@ -180,11 +182,11 @@ def download_loop(worker, output, retry, no_redirect, delay):
|
|
|
180
182
|
download_attempt += 1 # try again 2x with same connection
|
|
181
183
|
vb.write(
|
|
182
184
|
verbose=True,
|
|
183
|
-
content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.
|
|
185
|
+
content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - requesting again in 50 seconds...",
|
|
184
186
|
)
|
|
185
187
|
vb.write(
|
|
186
188
|
verbose=False,
|
|
187
|
-
content=f"Worker: {worker.id} - Snapshot {worker.
|
|
189
|
+
content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - requesting again in 50 seconds...",
|
|
188
190
|
)
|
|
189
191
|
time.sleep(50)
|
|
190
192
|
continue
|
|
@@ -195,17 +197,17 @@ def download_loop(worker, output, retry, no_redirect, delay):
|
|
|
195
197
|
download_attempt = download_max_attempt # try again 1x with new connection
|
|
196
198
|
vb.write(
|
|
197
199
|
verbose=True,
|
|
198
|
-
content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.
|
|
200
|
+
content=f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - {e.__class__.__name__} - renewing connection in 15 seconds...",
|
|
199
201
|
)
|
|
200
202
|
vb.write(
|
|
201
203
|
verbose=False,
|
|
202
|
-
content=f"Worker: {worker.id} - Snapshot {worker.
|
|
204
|
+
content=f"Worker: {worker.id} - Snapshot {worker.counter}/{sc.SNAPSHOT_TOTAL} - renewing connection in 15 seconds...",
|
|
203
205
|
)
|
|
204
206
|
time.sleep(15)
|
|
205
207
|
worker.refresh_connection()
|
|
206
208
|
continue
|
|
207
209
|
else:
|
|
208
|
-
ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.
|
|
210
|
+
ex.exception(f"\n-----> Worker: {worker.id} - Attempt: [{worker.attempt}/{retry_max_attempt}] Snapshot ID: [{worker.counter}/{sc.SNAPSHOT_TOTAL}] - EXCEPTION - {e}", e=e)
|
|
209
211
|
worker.attempt = retry_max_attempt
|
|
210
212
|
break
|
|
211
213
|
|
|
@@ -19,6 +19,7 @@ class Database:
|
|
|
19
19
|
filter_complete INTEGER
|
|
20
20
|
)"""
|
|
21
21
|
snapshot_table = """CREATE TABLE IF NOT EXISTS snapshot_tbl (
|
|
22
|
+
counter INT,
|
|
22
23
|
timestamp TEXT,
|
|
23
24
|
url_archive TEXT,
|
|
24
25
|
url_origin TEXT,
|
|
@@ -28,6 +29,18 @@ class Database:
|
|
|
28
29
|
file TEXT,
|
|
29
30
|
UNIQUE (url_archive)
|
|
30
31
|
)"""
|
|
32
|
+
csv_view = """CREATE VIEW IF NOT EXISTS csv_view
|
|
33
|
+
AS
|
|
34
|
+
SELECT
|
|
35
|
+
timestamp AS timestamp,
|
|
36
|
+
url_archive AS url_archive,
|
|
37
|
+
url_origin AS url_origin,
|
|
38
|
+
redirect_url AS redirect_url,
|
|
39
|
+
redirect_timestamp AS redirect_timestamp,
|
|
40
|
+
response AS response,
|
|
41
|
+
file AS file
|
|
42
|
+
FROM snapshot_tbl;
|
|
43
|
+
"""
|
|
31
44
|
|
|
32
45
|
QUERY_EXIST = False
|
|
33
46
|
QUERY_PROGRESS = "0 / 0"
|
|
@@ -38,6 +51,7 @@ class Database:
|
|
|
38
51
|
db = Database()
|
|
39
52
|
db.cursor.execute(cls.waybackup_table)
|
|
40
53
|
db.cursor.execute(cls.snapshot_table)
|
|
54
|
+
db.cursor.execute(cls.csv_view)
|
|
41
55
|
db.cursor.execute("SELECT query_identifier FROM waybackup_table WHERE query_identifier = ?", (query_identifier,))
|
|
42
56
|
if db.cursor.fetchone():
|
|
43
57
|
cls.QUERY_EXIST = True
|
|
@@ -29,7 +29,7 @@ def main():
|
|
|
29
29
|
archive_download.startup()
|
|
30
30
|
|
|
31
31
|
try:
|
|
32
|
-
archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype)
|
|
32
|
+
archive_download.query_list(config.csvfile, config.cdxfile, config.range, config.limit, config.start, config.end, config.explicit, config.filetype, config.statuscode)
|
|
33
33
|
archive_download.download_list(config.output, config.retry, config.no_redirect, config.delay, config.workers)
|
|
34
34
|
except KeyboardInterrupt:
|
|
35
35
|
print("\nInterrupted by user\n")
|
|
@@ -38,7 +38,7 @@ def main():
|
|
|
38
38
|
|
|
39
39
|
except Exception as e:
|
|
40
40
|
config.keep = True
|
|
41
|
-
ex.exception(
|
|
41
|
+
ex.exception(message="", e=e)
|
|
42
42
|
|
|
43
43
|
finally:
|
|
44
44
|
sc.csv_create(config.csvfile)
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pywaybackup
|
|
3
|
-
Version: 3.
|
|
3
|
+
Version: 3.3.1
|
|
4
4
|
Summary: Query and download archive.org as simple as possible.
|
|
5
5
|
Author-email: bitdruid <bitdruid@outlook.com>
|
|
6
6
|
License: MIT License
|
|
@@ -55,16 +55,16 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
55
55
|
### Pip
|
|
56
56
|
|
|
57
57
|
1. Install the package <br>
|
|
58
|
-
|
|
58
|
+
`pip install pywaybackup`
|
|
59
59
|
2. Run the tool <br>
|
|
60
|
-
|
|
60
|
+
`waybackup -h`
|
|
61
61
|
|
|
62
62
|
### Manual
|
|
63
63
|
|
|
64
64
|
1. Clone the repository <br>
|
|
65
|
-
|
|
65
|
+
`git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
|
|
66
66
|
2. Install <br>
|
|
67
|
-
|
|
67
|
+
`pip install .`
|
|
68
68
|
- in a virtual env or use `--break-system-package`
|
|
69
69
|
|
|
70
70
|
## notes / issues / hints
|
|
@@ -88,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
88
88
|
The URL of the web page to download. This argument is required.
|
|
89
89
|
|
|
90
90
|
#### Mode Selection (Choose One)
|
|
91
|
+
|
|
91
92
|
- **`-a`**, **`--all`**:<br>
|
|
92
93
|
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
|
|
93
94
|
- **`-l`**, **`--last`**:<br>
|
|
@@ -102,66 +103,67 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
102
103
|
- **`-e`**, **`--explicit`**:<br>
|
|
103
104
|
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
|
|
104
105
|
|
|
105
|
-
- **`--filetype`** `<filetype>`:<br>
|
|
106
|
-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
107
|
-
|
|
108
106
|
- **`--limit`** `<count>`:<br>
|
|
109
|
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
107
|
+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
110
108
|
|
|
111
109
|
- **Range Selection:**<br>
|
|
112
110
|
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
|
|
113
111
|
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
112
|
+
|
|
113
|
+
- **`-r`**, **`--range`**:<br>
|
|
114
|
+
Specify the range in years for which to search and download snapshots.
|
|
115
|
+
- **`--start`**:<br>
|
|
116
|
+
Timestamp to start searching.
|
|
117
|
+
- **`--end`**:<br>
|
|
118
|
+
Timestamp to end searching.
|
|
119
|
+
|
|
120
|
+
- **Filtering:**<br>
|
|
121
|
+
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
|
|
122
|
+
|
|
123
|
+
- **`--filetype`** `<filetype>`:<br>
|
|
124
|
+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
125
|
+
|
|
126
|
+
- **`--statuscode`** `<statuscode>`:<br>
|
|
127
|
+
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
|
|
128
|
+
Common status codes you may want to handle/filter:
|
|
129
|
+
- `200` (OK)
|
|
130
|
+
- `301` (Moved Permanently - will redirect snapshot)
|
|
131
|
+
- `404` (Not Found - snapshot seems to be empty)
|
|
132
|
+
- `500` (Internal Server Error - snapshot is at least for now not available)
|
|
120
133
|
|
|
121
134
|
### Optional
|
|
122
135
|
|
|
123
136
|
#### Behavior Manipulation
|
|
124
137
|
|
|
125
138
|
- **`-o`**, **`--output`**:<br>
|
|
126
|
-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
139
|
+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
127
140
|
|
|
128
141
|
- **`-m`**, **`--metadata`**<br>
|
|
129
|
-
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
142
|
+
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
143
|
+
|
|
144
|
+
- **`--verbose`**:<br>
|
|
145
|
+
Increase output verbosity.
|
|
130
146
|
|
|
131
147
|
<!-- - **`--verbosity`** `<level>`:<br>
|
|
132
148
|
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
|
|
133
149
|
|
|
134
150
|
- **`--log`** <!-- `<path>` -->:<br>
|
|
135
|
-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
151
|
+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
136
152
|
|
|
137
153
|
- **`--progress`**:<br>
|
|
138
|
-
Shows a progress bar instead of the default output.
|
|
154
|
+
Shows a progress bar instead of the default output.
|
|
139
155
|
|
|
140
156
|
- **`--workers`** `<count>`:<br>
|
|
141
|
-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
157
|
+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
142
158
|
|
|
143
159
|
- **`--no-redirect`**:<br>
|
|
144
|
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
160
|
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
145
161
|
|
|
146
162
|
- **`--retry`** `<attempts>`:<br>
|
|
147
|
-
Specifies number of retry attempts for failed downloads.
|
|
163
|
+
Specifies number of retry attempts for failed downloads.
|
|
148
164
|
|
|
149
165
|
- **`--delay`** `<seconds>`:<br>
|
|
150
|
-
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
151
|
-
|
|
152
|
-
- **`--verbose`**:<br>
|
|
153
|
-
Increase output verbosity.
|
|
154
|
-
- verbose:
|
|
155
|
-
```
|
|
156
|
-
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
157
|
-
SUCCESS -> 200 OK
|
|
158
|
-
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
159
|
-
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
160
|
-
```
|
|
161
|
-
- non-verbose:
|
|
162
|
-
```
|
|
163
|
-
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
164
|
-
```
|
|
166
|
+
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
165
167
|
|
|
166
168
|
<!-- - **`--convert-links`**:<br>
|
|
167
169
|
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
|
|
@@ -186,14 +188,16 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
186
188
|
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
|
|
187
189
|
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
|
|
188
190
|
- Skips previously downloaded files to save time.
|
|
189
|
-
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
191
|
+
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
190
192
|
|
|
191
193
|
#### Resetting a Job (`--reset`)
|
|
194
|
+
|
|
192
195
|
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
|
|
193
196
|
- Does **not** remove already downloaded files.
|
|
194
197
|
- `waybackup -u https://example.com -a --reset`
|
|
195
198
|
|
|
196
199
|
#### Keeping Job Data (`--keep`)
|
|
200
|
+
|
|
197
201
|
- Normally, `.cdx` and `.db` files are deleted after a successful job.
|
|
198
202
|
- `--keep` preserves them for future re-analysis or extending the query.
|
|
199
203
|
- `waybackup -u https://example.com -a --keep`
|
|
@@ -204,13 +208,13 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
204
208
|
## Examples
|
|
205
209
|
|
|
206
210
|
1. Download a specific single snapshot of all available files (starting from root):<br>
|
|
207
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
211
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
208
212
|
2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
|
|
209
|
-
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
213
|
+
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
210
214
|
3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
|
|
211
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
215
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
212
216
|
4. Download all snapshots of all available files in the given range:<br>
|
|
213
|
-
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
217
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
214
218
|
|
|
215
219
|
<br>
|
|
216
220
|
<br>
|
|
@@ -223,7 +227,9 @@ The output path is currently structured as follows by an example for the query:<
|
|
|
223
227
|
`http://example.com/subdir1/subdir2/assets/`
|
|
224
228
|
<br><br>
|
|
225
229
|
For the first and last version (`-f` or `-l`):
|
|
230
|
+
|
|
226
231
|
- Will only include all files/folders starting from your query-path.
|
|
232
|
+
|
|
227
233
|
```
|
|
228
234
|
your/path/waybackup_snapshots/
|
|
229
235
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -234,8 +240,11 @@ your/path/waybackup_snapshots/
|
|
|
234
240
|
├── style.css
|
|
235
241
|
...
|
|
236
242
|
```
|
|
243
|
+
|
|
237
244
|
For all versions (`-a`):
|
|
245
|
+
|
|
238
246
|
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
|
|
247
|
+
|
|
239
248
|
```
|
|
240
249
|
your/path/waybackup_snapshots/
|
|
241
250
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -276,6 +285,23 @@ For download queries:
|
|
|
276
285
|
]
|
|
277
286
|
```
|
|
278
287
|
|
|
288
|
+
### Log
|
|
289
|
+
|
|
290
|
+
Verbose:
|
|
291
|
+
|
|
292
|
+
```
|
|
293
|
+
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
294
|
+
SUCCESS -> 200 OK
|
|
295
|
+
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
296
|
+
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
Non-verbose:
|
|
300
|
+
|
|
301
|
+
```
|
|
302
|
+
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
303
|
+
```
|
|
304
|
+
|
|
279
305
|
### Debugging
|
|
280
306
|
|
|
281
307
|
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|