pywaybackup 3.1.0__tar.gz → 3.3.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {pywaybackup-3.1.0/pywaybackup.egg-info → pywaybackup-3.3.0}/PKG-INFO +134 -74
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/README.md +127 -69
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pyproject.toml +5 -4
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/Arguments.py +74 -19
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/Converter.py +10 -10
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/Exception.py +13 -18
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/SnapshotCollection.py +159 -87
- pywaybackup-3.3.0/pywaybackup/Verbosity.py +93 -0
- pywaybackup-3.3.0/pywaybackup/Worker.py +159 -0
- pywaybackup-3.3.0/pywaybackup/archive_download.py +336 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/archive_save.py +19 -19
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/db.py +14 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/helper.py +7 -7
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/main.py +2 -2
- {pywaybackup-3.1.0 → pywaybackup-3.3.0/pywaybackup.egg-info}/PKG-INFO +134 -74
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup.egg-info/SOURCES.txt +1 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup.egg-info/requires.txt +4 -3
- pywaybackup-3.1.0/pywaybackup/Verbosity.py +0 -121
- pywaybackup-3.1.0/pywaybackup/archive_download.py +0 -332
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/LICENSE +0 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup/__init__.py +0 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup.egg-info/dependency_links.txt +0 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup.egg-info/entry_points.txt +0 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/pywaybackup.egg-info/top_level.txt +0 -0
- {pywaybackup-3.1.0 → pywaybackup-3.3.0}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
Metadata-Version: 2.
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
2
|
Name: pywaybackup
|
|
3
|
-
Version: 3.
|
|
3
|
+
Version: 3.3.0
|
|
4
4
|
Summary: Query and download archive.org as simple as possible.
|
|
5
5
|
Author-email: bitdruid <bitdruid@outlook.com>
|
|
6
6
|
License: MIT License
|
|
@@ -29,18 +29,19 @@ Project-URL: homepage, https://github.com/bitdruid/python-wayback-machine-downlo
|
|
|
29
29
|
Requires-Python: >=3.8
|
|
30
30
|
Description-Content-Type: text/markdown
|
|
31
31
|
License-File: LICENSE
|
|
32
|
-
Requires-Dist: pysqlite3-binary==0.5.4
|
|
33
|
-
Requires-Dist:
|
|
34
|
-
Requires-Dist:
|
|
32
|
+
Requires-Dist: pysqlite3-binary==0.5.4; sys_platform == "linux"
|
|
33
|
+
Requires-Dist: pysqlite-binary; sys_platform == "win32"
|
|
34
|
+
Requires-Dist: requests==2.32.3
|
|
35
|
+
Requires-Dist: tqdm==4.67.1
|
|
35
36
|
Requires-Dist: python-magic==0.4.27; sys_platform == "linux"
|
|
36
37
|
Requires-Dist: python-magic-bin==0.4.14; sys_platform == "win32"
|
|
38
|
+
Dynamic: license-file
|
|
37
39
|
|
|
38
40
|
# python wayback machine downloader
|
|
39
41
|
|
|
40
42
|
[](https://pypi.org/project/pywaybackup/)
|
|
41
43
|
[](https://pypi.org/project/pywaybackup/)
|
|
42
44
|

|
|
43
|
-
<!--  -->
|
|
44
45
|
[](https://opensource.org/licenses/MIT)
|
|
45
46
|
|
|
46
47
|
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
|
|
@@ -54,23 +55,27 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
54
55
|
### Pip
|
|
55
56
|
|
|
56
57
|
1. Install the package <br>
|
|
57
|
-
|
|
58
|
+
`pip install pywaybackup`
|
|
58
59
|
2. Run the tool <br>
|
|
59
|
-
|
|
60
|
+
`waybackup -h`
|
|
60
61
|
|
|
61
62
|
### Manual
|
|
62
63
|
|
|
63
64
|
1. Clone the repository <br>
|
|
64
|
-
|
|
65
|
+
`git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
|
|
65
66
|
2. Install <br>
|
|
66
|
-
|
|
67
|
+
`pip install .`
|
|
67
68
|
- in a virtual env or use `--break-system-package`
|
|
68
69
|
|
|
69
|
-
##
|
|
70
|
+
## notes / issues / hints
|
|
70
71
|
|
|
71
|
-
- Linux recommended: On Windows machines, the path length is limited.
|
|
72
|
-
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
|
|
72
|
+
- Linux recommended: On Windows machines, the path length is limited. Files that exceed the path length will not be downloaded.
|
|
73
73
|
- The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
|
|
74
|
+
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
|
|
75
|
+
- Downloading directly into a network share is not recommended. The sqlite locking mechanism may cause issues. If you need to download into a network share, set the `--metadata` argument to a local path.
|
|
76
|
+
|
|
77
|
+
<br>
|
|
78
|
+
<br>
|
|
74
79
|
|
|
75
80
|
## Arguments
|
|
76
81
|
|
|
@@ -83,6 +88,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
83
88
|
The URL of the web page to download. This argument is required.
|
|
84
89
|
|
|
85
90
|
#### Mode Selection (Choose One)
|
|
91
|
+
|
|
86
92
|
- **`-a`**, **`--all`**:<br>
|
|
87
93
|
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
|
|
88
94
|
- **`-l`**, **`--last`**:<br>
|
|
@@ -92,57 +98,77 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
92
98
|
- **`-s`**, **`--save`**:<br>
|
|
93
99
|
Save a page to the Wayback Machine. (beta)
|
|
94
100
|
|
|
95
|
-
|
|
101
|
+
#### Optional query parameters
|
|
96
102
|
|
|
97
103
|
- **`-e`**, **`--explicit`**:<br>
|
|
98
104
|
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
|
|
99
105
|
|
|
100
|
-
- **`--filetype`** `<filetype>`:<br>
|
|
101
|
-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
102
|
-
|
|
103
106
|
- **`--limit`** `<count>`:<br>
|
|
104
|
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
107
|
+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
105
108
|
|
|
106
109
|
- **Range Selection:**<br>
|
|
107
110
|
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
|
|
108
111
|
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
|
|
109
|
-
- **`-r`**, **`--range`**:<br>
|
|
110
|
-
Specify the range in years for which to search and download snapshots.
|
|
111
|
-
- **`--start`**:<br>
|
|
112
|
-
Timestamp to start searching.
|
|
113
|
-
- **`--end`**:<br>
|
|
114
|
-
Timestamp to end searching.
|
|
115
112
|
|
|
116
|
-
|
|
113
|
+
- **`-r`**, **`--range`**:<br>
|
|
114
|
+
Specify the range in years for which to search and download snapshots.
|
|
115
|
+
- **`--start`**:<br>
|
|
116
|
+
Timestamp to start searching.
|
|
117
|
+
- **`--end`**:<br>
|
|
118
|
+
Timestamp to end searching.
|
|
119
|
+
|
|
120
|
+
- **Filtering:**<br>
|
|
121
|
+
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
|
|
122
|
+
|
|
123
|
+
- **`--filetype`** `<filetype>`:<br>
|
|
124
|
+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
125
|
+
|
|
126
|
+
- **`--statuscode`** `<statuscode>`:<br>
|
|
127
|
+
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
|
|
128
|
+
Common status codes you may want to handle/filter:
|
|
129
|
+
- `200` (OK)
|
|
130
|
+
- `301` (Moved Permanently - will redirect snapshot)
|
|
131
|
+
- `404` (Not Found - snapshot seems to be empty)
|
|
132
|
+
- `500` (Internal Server Error - snapshot is at least for now not available)
|
|
133
|
+
|
|
134
|
+
### Optional
|
|
135
|
+
|
|
136
|
+
#### Behavior Manipulation
|
|
117
137
|
|
|
118
138
|
- **`-o`**, **`--output`**:<br>
|
|
119
|
-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
139
|
+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
140
|
+
|
|
141
|
+
- **`-m`**, **`--metadata`**<br>
|
|
142
|
+
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
143
|
+
|
|
144
|
+
- **`--verbose`**:<br>
|
|
145
|
+
Increase output verbosity.
|
|
120
146
|
|
|
121
147
|
<!-- - **`--verbosity`** `<level>`:<br>
|
|
122
148
|
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
|
|
123
149
|
|
|
124
150
|
- **`--log`** <!-- `<path>` -->:<br>
|
|
125
|
-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
151
|
+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
126
152
|
|
|
127
153
|
- **`--progress`**:<br>
|
|
128
|
-
Shows a progress bar instead of the default output.
|
|
154
|
+
Shows a progress bar instead of the default output.
|
|
129
155
|
|
|
130
156
|
- **`--workers`** `<count>`:<br>
|
|
131
|
-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
157
|
+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
132
158
|
|
|
133
159
|
- **`--no-redirect`**:<br>
|
|
134
|
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
135
|
-
|
|
160
|
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
161
|
+
|
|
136
162
|
- **`--retry`** `<attempts>`:<br>
|
|
137
|
-
Specifies number of retry attempts for failed downloads.
|
|
163
|
+
Specifies number of retry attempts for failed downloads.
|
|
138
164
|
|
|
139
165
|
- **`--delay`** `<seconds>`:<br>
|
|
140
|
-
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
166
|
+
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
141
167
|
|
|
142
168
|
<!-- - **`--convert-links`**:<br>
|
|
143
169
|
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
|
|
144
170
|
|
|
145
|
-
|
|
171
|
+
#### Job Handling:
|
|
146
172
|
|
|
147
173
|
- **`--reset`**:
|
|
148
174
|
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
|
|
@@ -150,47 +176,60 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
150
176
|
- **`--keep`**:
|
|
151
177
|
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
|
|
152
178
|
|
|
153
|
-
|
|
179
|
+
<br>
|
|
180
|
+
<br>
|
|
181
|
+
|
|
182
|
+
## Usage
|
|
154
183
|
|
|
155
184
|
### Handling Interrupted Jobs
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
>
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
185
|
+
|
|
186
|
+
`pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
|
|
187
|
+
|
|
188
|
+
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
|
|
189
|
+
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
|
|
190
|
+
- Skips previously downloaded files to save time.
|
|
191
|
+
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
192
|
+
|
|
193
|
+
#### Resetting a Job (`--reset`)
|
|
194
|
+
|
|
195
|
+
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
|
|
196
|
+
- Does **not** remove already downloaded files.
|
|
197
|
+
- `waybackup -u https://example.com -a --reset`
|
|
198
|
+
|
|
199
|
+
#### Keeping Job Data (`--keep`)
|
|
200
|
+
|
|
201
|
+
- Normally, `.cdx` and `.db` files are deleted after a successful job.
|
|
202
|
+
- `--keep` preserves them for future re-analysis or extending the query.
|
|
203
|
+
- `waybackup -u https://example.com -a --keep`
|
|
204
|
+
|
|
205
|
+
<br>
|
|
206
|
+
<br>
|
|
207
|
+
|
|
208
|
+
## Examples
|
|
209
|
+
|
|
210
|
+
1. Download a specific single snapshot of all available files (starting from root):<br>
|
|
211
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
212
|
+
2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
|
|
213
|
+
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
214
|
+
3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
|
|
215
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
216
|
+
4. Download all snapshots of all available files in the given range:<br>
|
|
217
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
218
|
+
|
|
219
|
+
<br>
|
|
220
|
+
<br>
|
|
221
|
+
|
|
222
|
+
## Output
|
|
223
|
+
|
|
224
|
+
### Path Structure
|
|
188
225
|
|
|
189
226
|
The output path is currently structured as follows by an example for the query:<br>
|
|
190
|
-
`http://example.com/subdir1/subdir2/assets
|
|
227
|
+
`http://example.com/subdir1/subdir2/assets/`
|
|
191
228
|
<br><br>
|
|
192
229
|
For the first and last version (`-f` or `-l`):
|
|
193
|
-
|
|
230
|
+
|
|
231
|
+
- Will only include all files/folders starting from your query-path.
|
|
232
|
+
|
|
194
233
|
```
|
|
195
234
|
your/path/waybackup_snapshots/
|
|
196
235
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -201,8 +240,11 @@ your/path/waybackup_snapshots/
|
|
|
201
240
|
├── style.css
|
|
202
241
|
...
|
|
203
242
|
```
|
|
243
|
+
|
|
204
244
|
For all versions (`-a`):
|
|
205
|
-
|
|
245
|
+
|
|
246
|
+
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
|
|
247
|
+
|
|
206
248
|
```
|
|
207
249
|
your/path/waybackup_snapshots/
|
|
208
250
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -221,7 +263,7 @@ your/path/waybackup_snapshots/
|
|
|
221
263
|
...
|
|
222
264
|
```
|
|
223
265
|
|
|
224
|
-
|
|
266
|
+
### CSV
|
|
225
267
|
|
|
226
268
|
Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.
|
|
227
269
|
|
|
@@ -243,15 +285,33 @@ For download queries:
|
|
|
243
285
|
]
|
|
244
286
|
```
|
|
245
287
|
|
|
288
|
+
### Log
|
|
289
|
+
|
|
290
|
+
Verbose:
|
|
291
|
+
|
|
292
|
+
```
|
|
293
|
+
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
294
|
+
SUCCESS -> 200 OK
|
|
295
|
+
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
296
|
+
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
Non-verbose:
|
|
300
|
+
|
|
301
|
+
```
|
|
302
|
+
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
303
|
+
```
|
|
304
|
+
|
|
246
305
|
### Debugging
|
|
247
306
|
|
|
248
307
|
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
|
|
249
308
|
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
- [ ] currently there is no logic to handle if both a http and https version of a page is available
|
|
309
|
+
<br>
|
|
310
|
+
<br>
|
|
253
311
|
|
|
254
312
|
## Contributing
|
|
255
313
|
|
|
256
314
|
I'm always happy for some feature requests to improve the usability of this tool.
|
|
257
315
|
Feel free to give suggestions and report issues. Project is still far from being perfect.
|
|
316
|
+
|
|
317
|
+
> Please PR from dev into dev.
|
|
@@ -3,7 +3,6 @@
|
|
|
3
3
|
[](https://pypi.org/project/pywaybackup/)
|
|
4
4
|
[](https://pypi.org/project/pywaybackup/)
|
|
5
5
|

|
|
6
|
-
<!--  -->
|
|
7
6
|
[](https://opensource.org/licenses/MIT)
|
|
8
7
|
|
|
9
8
|
Downloading archived web pages from the [Wayback Machine](https://archive.org/web/).
|
|
@@ -17,23 +16,27 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
17
16
|
### Pip
|
|
18
17
|
|
|
19
18
|
1. Install the package <br>
|
|
20
|
-
|
|
19
|
+
`pip install pywaybackup`
|
|
21
20
|
2. Run the tool <br>
|
|
22
|
-
|
|
21
|
+
`waybackup -h`
|
|
23
22
|
|
|
24
23
|
### Manual
|
|
25
24
|
|
|
26
25
|
1. Clone the repository <br>
|
|
27
|
-
|
|
26
|
+
`git clone https://github.com/bitdruid/python-wayback-machine-downloader.git`
|
|
28
27
|
2. Install <br>
|
|
29
|
-
|
|
28
|
+
`pip install .`
|
|
30
29
|
- in a virtual env or use `--break-system-package`
|
|
31
30
|
|
|
32
|
-
##
|
|
31
|
+
## notes / issues / hints
|
|
33
32
|
|
|
34
|
-
- Linux recommended: On Windows machines, the path length is limited.
|
|
35
|
-
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
|
|
33
|
+
- Linux recommended: On Windows machines, the path length is limited. Files that exceed the path length will not be downloaded.
|
|
36
34
|
- The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
|
|
35
|
+
- If you query an explicit file (e.g. a query-string `?query=this` or `login.html`), the `--explicit`-argument is recommended as a wildcard query may lead to an empty result.
|
|
36
|
+
- Downloading directly into a network share is not recommended. The sqlite locking mechanism may cause issues. If you need to download into a network share, set the `--metadata` argument to a local path.
|
|
37
|
+
|
|
38
|
+
<br>
|
|
39
|
+
<br>
|
|
37
40
|
|
|
38
41
|
## Arguments
|
|
39
42
|
|
|
@@ -46,6 +49,7 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
46
49
|
The URL of the web page to download. This argument is required.
|
|
47
50
|
|
|
48
51
|
#### Mode Selection (Choose One)
|
|
52
|
+
|
|
49
53
|
- **`-a`**, **`--all`**:<br>
|
|
50
54
|
Download snapshots of all timestamps. You will get a folder per timestamp with the files available at that time.
|
|
51
55
|
- **`-l`**, **`--last`**:<br>
|
|
@@ -55,57 +59,77 @@ This tool allows you to download content from the Wayback Machine (archive.org).
|
|
|
55
59
|
- **`-s`**, **`--save`**:<br>
|
|
56
60
|
Save a page to the Wayback Machine. (beta)
|
|
57
61
|
|
|
58
|
-
|
|
62
|
+
#### Optional query parameters
|
|
59
63
|
|
|
60
64
|
- **`-e`**, **`--explicit`**:<br>
|
|
61
65
|
Only download the explicit given URL. No wildcard subdomains or paths. Use e.g. to get root-only snapshots. This is recommended for explicit files like `login.html` or `?query=this`.
|
|
62
66
|
|
|
63
|
-
- **`--filetype`** `<filetype>`:<br>
|
|
64
|
-
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
65
|
-
|
|
66
67
|
- **`--limit`** `<count>`:<br>
|
|
67
|
-
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
68
|
+
Limits the amount of snapshots to query from the CDX server. If an existing CDX file is injected, the limit will have no effect. So you would need to set `--keep`.
|
|
68
69
|
|
|
69
70
|
- **Range Selection:**<br>
|
|
70
71
|
Specify the range in years or a specific timestamp either start, end, or both. If you specify the `range` argument, the `start` and `end` arguments will be ignored. Format for timestamps: YYYYMMDDhhmmss. You can only give a year or increase specificity by going through the timestamp starting on the left.<br>
|
|
71
72
|
(year 2019, year 2019, year+month+day 20190101, year+month+day+hour 2019010112)
|
|
72
|
-
- **`-r`**, **`--range`**:<br>
|
|
73
|
-
Specify the range in years for which to search and download snapshots.
|
|
74
|
-
- **`--start`**:<br>
|
|
75
|
-
Timestamp to start searching.
|
|
76
|
-
- **`--end`**:<br>
|
|
77
|
-
Timestamp to end searching.
|
|
78
73
|
|
|
79
|
-
|
|
74
|
+
- **`-r`**, **`--range`**:<br>
|
|
75
|
+
Specify the range in years for which to search and download snapshots.
|
|
76
|
+
- **`--start`**:<br>
|
|
77
|
+
Timestamp to start searching.
|
|
78
|
+
- **`--end`**:<br>
|
|
79
|
+
Timestamp to end searching.
|
|
80
|
+
|
|
81
|
+
- **Filtering:**<br>
|
|
82
|
+
A filter will result in a filtered cdx-file. So if you want to download all files later, you need to query again without the filter.
|
|
83
|
+
|
|
84
|
+
- **`--filetype`** `<filetype>`:<br>
|
|
85
|
+
Specify filetypes to download. Default is all filetypes. Separate multiple filetypes with a comma. Example: `--filetype jpg,css,js`. Filetypes are filtered as they are in the snapshot. So if there is no explicit `html` file in the path (common practice) then you cant filter them.
|
|
86
|
+
|
|
87
|
+
- **`--statuscode`** `<statuscode>`:<br>
|
|
88
|
+
Specify HTTP status codes to download. Default is all statuscodes. Separate multiple status codes with a comma. Example: `--statuscode 200,301`. Pywaybackup will try to download any snapshot regardless of it's statuscode. For 404 of course this means logged errors and corresponding entries in the csv. However, you may want to get a csv that includes these negative attempts for your needs.<br>
|
|
89
|
+
Common status codes you may want to handle/filter:
|
|
90
|
+
- `200` (OK)
|
|
91
|
+
- `301` (Moved Permanently - will redirect snapshot)
|
|
92
|
+
- `404` (Not Found - snapshot seems to be empty)
|
|
93
|
+
- `500` (Internal Server Error - snapshot is at least for now not available)
|
|
94
|
+
|
|
95
|
+
### Optional
|
|
96
|
+
|
|
97
|
+
#### Behavior Manipulation
|
|
80
98
|
|
|
81
99
|
- **`-o`**, **`--output`**:<br>
|
|
82
|
-
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
100
|
+
Defaults to `waybackup_snapshots` in the current directory. The folder where downloaded files will be saved.
|
|
101
|
+
|
|
102
|
+
- **`-m`**, **`--metadata`**<br>
|
|
103
|
+
Change the folder where metadata will be saved (`cdx`/`db`/`csv`/`log`). Especially if you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
|
|
104
|
+
|
|
105
|
+
- **`--verbose`**:<br>
|
|
106
|
+
Increase output verbosity.
|
|
83
107
|
|
|
84
108
|
<!-- - **`--verbosity`** `<level>`:<br>
|
|
85
109
|
Sets verbosity level. Options are `info`and `trace`. Default is `info`. -->
|
|
86
110
|
|
|
87
111
|
- **`--log`** <!-- `<path>` -->:<br>
|
|
88
|
-
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
112
|
+
Saves a log file into the output-dir. Named as `waybackup_<sanitized_url>.log`.
|
|
89
113
|
|
|
90
114
|
- **`--progress`**:<br>
|
|
91
|
-
Shows a progress bar instead of the default output.
|
|
115
|
+
Shows a progress bar instead of the default output.
|
|
92
116
|
|
|
93
117
|
- **`--workers`** `<count>`:<br>
|
|
94
|
-
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
118
|
+
Sets the number of simultaneous download workers. Default is 1, safe range is about 10. Be cautious as too many workers may lead to refused connections from the Wayback Machine.
|
|
95
119
|
|
|
96
120
|
- **`--no-redirect`**:<br>
|
|
97
|
-
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
98
|
-
|
|
121
|
+
Disables following redirects of snapshots. Useful for preventing timestamp-folder mismatches caused by Archive.org redirects.
|
|
122
|
+
|
|
99
123
|
- **`--retry`** `<attempts>`:<br>
|
|
100
|
-
Specifies number of retry attempts for failed downloads.
|
|
124
|
+
Specifies number of retry attempts for failed downloads.
|
|
101
125
|
|
|
102
126
|
- **`--delay`** `<seconds>`:<br>
|
|
103
|
-
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
127
|
+
Specifies delay between download requests in seconds. Default is no delay (0).
|
|
104
128
|
|
|
105
129
|
<!-- - **`--convert-links`**:<br>
|
|
106
130
|
If set, all links in the downloaded files will be converted to local links. This is useful for offline browsing. The links are converted to the local path structure. Show output with `--verbosity trace`. -->
|
|
107
131
|
|
|
108
|
-
|
|
132
|
+
#### Job Handling:
|
|
109
133
|
|
|
110
134
|
- **`--reset`**:
|
|
111
135
|
If set, the job will be reset, and any existing `cdx`, `db`, `csv` files will be **deleted**. This allows you to start the job from scratch without considering previously downloaded data.
|
|
@@ -113,47 +137,60 @@ If set, all links in the downloaded files will be converted to local links. This
|
|
|
113
137
|
- **`--keep`**:
|
|
114
138
|
If set, all files will be kept after the job is finished. This includes the `cdx` and `db` file. Without this argument, they will be deleted if the job finished successfully.
|
|
115
139
|
|
|
116
|
-
|
|
140
|
+
<br>
|
|
141
|
+
<br>
|
|
142
|
+
|
|
143
|
+
## Usage
|
|
117
144
|
|
|
118
145
|
### Handling Interrupted Jobs
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
>
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
146
|
+
|
|
147
|
+
`pywaybackup` resumes interrupted jobs. The tool automatically continues from where it left off.
|
|
148
|
+
|
|
149
|
+
- Detects existing `.cdx` and `.db` files in an `output dir` to resume downloading from the last successful point.
|
|
150
|
+
- Compares `URL`, `mode`, and `optional query parameters` to ensure automatic resumption.
|
|
151
|
+
- Skips previously downloaded files to save time.
|
|
152
|
+
> **Note:** Changing URL, mode selection, query parameters or output prevents automatic resumption.
|
|
153
|
+
|
|
154
|
+
#### Resetting a Job (`--reset`)
|
|
155
|
+
|
|
156
|
+
- Deletes `.cdx` and `.db` files and restarts the process from scratch.
|
|
157
|
+
- Does **not** remove already downloaded files.
|
|
158
|
+
- `waybackup -u https://example.com -a --reset`
|
|
159
|
+
|
|
160
|
+
#### Keeping Job Data (`--keep`)
|
|
161
|
+
|
|
162
|
+
- Normally, `.cdx` and `.db` files are deleted after a successful job.
|
|
163
|
+
- `--keep` preserves them for future re-analysis or extending the query.
|
|
164
|
+
- `waybackup -u https://example.com -a --keep`
|
|
165
|
+
|
|
166
|
+
<br>
|
|
167
|
+
<br>
|
|
168
|
+
|
|
169
|
+
## Examples
|
|
170
|
+
|
|
171
|
+
1. Download a specific single snapshot of all available files (starting from root):<br>
|
|
172
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000`
|
|
173
|
+
2. Download a specific single snapshot of all available files (starting from a subdirectory):<br>
|
|
174
|
+
`waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000`
|
|
175
|
+
3. Download a specific single snapshot of the exact given URL (no subdirs):<br>
|
|
176
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit`
|
|
177
|
+
4. Download all snapshots of all available files in the given range:<br>
|
|
178
|
+
`waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000`
|
|
179
|
+
|
|
180
|
+
<br>
|
|
181
|
+
<br>
|
|
182
|
+
|
|
183
|
+
## Output
|
|
184
|
+
|
|
185
|
+
### Path Structure
|
|
151
186
|
|
|
152
187
|
The output path is currently structured as follows by an example for the query:<br>
|
|
153
|
-
`http://example.com/subdir1/subdir2/assets
|
|
188
|
+
`http://example.com/subdir1/subdir2/assets/`
|
|
154
189
|
<br><br>
|
|
155
190
|
For the first and last version (`-f` or `-l`):
|
|
156
|
-
|
|
191
|
+
|
|
192
|
+
- Will only include all files/folders starting from your query-path.
|
|
193
|
+
|
|
157
194
|
```
|
|
158
195
|
your/path/waybackup_snapshots/
|
|
159
196
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -164,8 +201,11 @@ your/path/waybackup_snapshots/
|
|
|
164
201
|
├── style.css
|
|
165
202
|
...
|
|
166
203
|
```
|
|
204
|
+
|
|
167
205
|
For all versions (`-a`):
|
|
168
|
-
|
|
206
|
+
|
|
207
|
+
- Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.
|
|
208
|
+
|
|
169
209
|
```
|
|
170
210
|
your/path/waybackup_snapshots/
|
|
171
211
|
└── the_root_of_your_query/ (example.com/)
|
|
@@ -184,7 +224,7 @@ your/path/waybackup_snapshots/
|
|
|
184
224
|
...
|
|
185
225
|
```
|
|
186
226
|
|
|
187
|
-
|
|
227
|
+
### CSV
|
|
188
228
|
|
|
189
229
|
Each snapshot is stored with the following keys/values. These are either stored in a sqlite database while the download is running or saved into a CSV file after the download is finished.
|
|
190
230
|
|
|
@@ -206,15 +246,33 @@ For download queries:
|
|
|
206
246
|
]
|
|
207
247
|
```
|
|
208
248
|
|
|
249
|
+
### Log
|
|
250
|
+
|
|
251
|
+
Verbose:
|
|
252
|
+
|
|
253
|
+
```
|
|
254
|
+
-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
|
|
255
|
+
SUCCESS -> 200 OK
|
|
256
|
+
-> URL: https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
|
|
257
|
+
-> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
Non-verbose:
|
|
261
|
+
|
|
262
|
+
```
|
|
263
|
+
55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css
|
|
264
|
+
```
|
|
265
|
+
|
|
209
266
|
### Debugging
|
|
210
267
|
|
|
211
268
|
Exceptions will be written into `waybackup_error.log` (each run overwrites the file).
|
|
212
269
|
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
- [ ] currently there is no logic to handle if both a http and https version of a page is available
|
|
270
|
+
<br>
|
|
271
|
+
<br>
|
|
216
272
|
|
|
217
273
|
## Contributing
|
|
218
274
|
|
|
219
275
|
I'm always happy for some feature requests to improve the usability of this tool.
|
|
220
276
|
Feel free to give suggestions and report issues. Project is still far from being perfect.
|
|
277
|
+
|
|
278
|
+
> Please PR from dev into dev.
|