waybackprov 0.0.9__tar.gz → 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,53 +1,74 @@
1
- Metadata-Version: 2.1
1
+ Metadata-Version: 2.3
2
2
  Name: waybackprov
3
- Version: 0.0.9
3
+ Version: 0.1.1
4
4
  Summary: Checks the provenance of a URL in the Wayback machine
5
- Home-page: https://github.com/edsu/waybackprov
6
5
  Author: Ed Summers
7
- Author-email: ehs@pobox.com
8
- License: UNKNOWN
9
- Platform: UNKNOWN
6
+ Author-email: Ed Summers <ehs@pobox.com>
10
7
  Requires-Python: >=3.0
8
+ Project-URL: repository, https://github.com/docnow/waybackprov
11
9
  Description-Content-Type: text/markdown
12
10
 
11
+ # waybackprov
12
+
13
+ [![Test](https://github.com/DocNow/waybackprov/actions/workflows/test.yml/badge.svg)](https://github.com/DocNow/waybackprov/actions/workflows/test.yml)
14
+
13
15
  Give *waybackprov* a URL and it will summarize which Internet Archive
14
16
  collections have archived the URL. This kind of information can sometimes
15
17
  provide insight about why a particular web resource or set of web resources were
16
18
  archived from the web.
17
19
 
18
- ## Install
20
+ ## Run
21
+
22
+ If you have [uv] installed you can run `waybackprov` easily without installing anything:
23
+
24
+ ```
25
+ uvx waybackprov
26
+ ```
27
+
28
+ Otherwise you'll probably want to install it with `pip`:
19
29
 
20
- pip install waybackprov
30
+ ```
31
+ pip install waybackprov
32
+ ```
21
33
 
22
34
  ## Basic Usage
23
35
 
24
36
  To check a particular URL here's how it works:
25
37
 
26
- % waybackprov https://twitter.com/EPAScottPruitt
27
- 364 https://archive.org/details/focused_crawls
28
- 306 https://archive.org/details/edgi_monitor
29
- 151 https://archive.org/details/www3.epa.gov
30
- 60 https://archive.org/details/epa.gov4
31
- 47 https://archive.org/details/epa.gov5
32
- ...
38
+ ```shell
39
+ waybackprov https://twitter.com/EPAScottPruitt
40
+
41
+ crawls collections
42
+ 364 https://archive.org/details/focused_crawls
43
+ 306 https://archive.org/details/edgi_monitor
44
+ 151 https://archive.org/details/www3.epa.gov
45
+ 60 https://archive.org/details/epa.gov4
46
+ 47 https://archive.org/details/epa.gov5
47
+ ```
33
48
 
34
49
  The first column contains the number of crawls for a particular URL, and the
35
50
  second column contains the URL for the Internet Archive collection that added
36
51
  it.
37
52
 
53
+ When evaluating the counts it's important to remember that collections can be contained in other collections. So `epa.gov4` in the example above is part of the `edgi_monitor` collection.
54
+
38
55
  ## Time
39
56
 
40
57
  By default waybackprov will only look at the current year. If you would like it
41
58
  to examine a range of years use the `--start` and `--end` options:
42
59
 
43
- % waybackprov --start 2016 --end 2018 https://twitter.com/EPAScottPruitt
60
+ ```shell
61
+ waybackprov --start 2016 --end 2018 https://twitter.com/EPAScottPruitt
62
+ ```
44
63
 
45
64
  ## Multiple Pages
46
65
 
47
66
  If you would like to look at all URLs at a particular URL prefix you can use the
48
67
  `--prefix` option:
49
68
 
50
- % waybackprov --prefix https://twitter.com/EPAScottPruitt
69
+ ```shell
70
+ waybackprov --prefix https://twitter.com/EPAScottPruitt
71
+ ```
51
72
 
52
73
  This will use the Internet Archive's [CDX API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API) to also include URLs that are extensions of the URL you supply, so it would include for example:
53
74
 
@@ -63,7 +84,9 @@ interested in is highly recommended since it prevents lots of lookups for CSS,
63
84
  JavaScript and image files that are components of the resource that was
64
85
  initially crawled.
65
86
 
66
- % waybackprov --prefix --match 'status/\d+$' https://twitter.com/EPAScottPruitt
87
+ ```
88
+ waybackprov --prefix --match 'status/\d+$' https://twitter.com/EPAScottPruitt
89
+ ```
67
90
 
68
91
  ## Collections
69
92
 
@@ -88,14 +111,15 @@ rather than a summary.
88
111
  If you would like to see detailed information about what *waybackprov* is doing
89
112
  use the `--log` option to supply the a file path to log to:
90
113
 
91
- % waybackprov --log waybackprov.log https://example.com/
114
+ ```shell
115
+ waybackprov --log waybackprov.log https://example.com/
116
+ ```
92
117
 
93
118
  ## Test
94
119
 
95
120
  If you would like to test it first install [pytest] and then:
96
121
 
97
- pytest test.py
122
+ uv run pytest test.py
98
123
 
99
124
  [pytest]: https://docs.pytest.org/en/latest/
100
-
101
-
125
+ [uv]: https://docs.astral.sh/uv/
@@ -1,53 +1,64 @@
1
- Metadata-Version: 2.1
2
- Name: waybackprov
3
- Version: 0.0.9
4
- Summary: Checks the provenance of a URL in the Wayback machine
5
- Home-page: https://github.com/edsu/waybackprov
6
- Author: Ed Summers
7
- Author-email: ehs@pobox.com
8
- License: UNKNOWN
9
- Platform: UNKNOWN
10
- Requires-Python: >=3.0
11
- Description-Content-Type: text/markdown
1
+ # waybackprov
2
+
3
+ [![Test](https://github.com/DocNow/waybackprov/actions/workflows/test.yml/badge.svg)](https://github.com/DocNow/waybackprov/actions/workflows/test.yml)
12
4
 
13
5
  Give *waybackprov* a URL and it will summarize which Internet Archive
14
6
  collections have archived the URL. This kind of information can sometimes
15
7
  provide insight about why a particular web resource or set of web resources were
16
8
  archived from the web.
17
9
 
18
- ## Install
10
+ ## Run
11
+
12
+ If you have [uv] installed you can run `waybackprov` easily without installing anything:
13
+
14
+ ```
15
+ uvx waybackprov
16
+ ```
17
+
18
+ Otherwise you'll probably want to install it with `pip`:
19
19
 
20
- pip install waybackprov
20
+ ```
21
+ pip install waybackprov
22
+ ```
21
23
 
22
24
  ## Basic Usage
23
25
 
24
26
  To check a particular URL here's how it works:
25
27
 
26
- % waybackprov https://twitter.com/EPAScottPruitt
27
- 364 https://archive.org/details/focused_crawls
28
- 306 https://archive.org/details/edgi_monitor
29
- 151 https://archive.org/details/www3.epa.gov
30
- 60 https://archive.org/details/epa.gov4
31
- 47 https://archive.org/details/epa.gov5
32
- ...
28
+ ```shell
29
+ waybackprov https://twitter.com/EPAScottPruitt
30
+
31
+ crawls collections
32
+ 364 https://archive.org/details/focused_crawls
33
+ 306 https://archive.org/details/edgi_monitor
34
+ 151 https://archive.org/details/www3.epa.gov
35
+ 60 https://archive.org/details/epa.gov4
36
+ 47 https://archive.org/details/epa.gov5
37
+ ```
33
38
 
34
39
  The first column contains the number of crawls for a particular URL, and the
35
40
  second column contains the URL for the Internet Archive collection that added
36
41
  it.
37
42
 
43
+ When evaluating the counts it's important to remember that collections can be contained in other collections. So `epa.gov4` in the example above is part of the `edgi_monitor` collection.
44
+
38
45
  ## Time
39
46
 
40
47
  By default waybackprov will only look at the current year. If you would like it
41
48
  to examine a range of years use the `--start` and `--end` options:
42
49
 
43
- % waybackprov --start 2016 --end 2018 https://twitter.com/EPAScottPruitt
50
+ ```shell
51
+ waybackprov --start 2016 --end 2018 https://twitter.com/EPAScottPruitt
52
+ ```
44
53
 
45
54
  ## Multiple Pages
46
55
 
47
56
  If you would like to look at all URLs at a particular URL prefix you can use the
48
57
  `--prefix` option:
49
58
 
50
- % waybackprov --prefix https://twitter.com/EPAScottPruitt
59
+ ```shell
60
+ waybackprov --prefix https://twitter.com/EPAScottPruitt
61
+ ```
51
62
 
52
63
  This will use the Internet Archive's [CDX API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API) to also include URLs that are extensions of the URL you supply, so it would include for example:
53
64
 
@@ -63,7 +74,9 @@ interested in is highly recommended since it prevents lots of lookups for CSS,
63
74
  JavaScript and image files that are components of the resource that was
64
75
  initially crawled.
65
76
 
66
- % waybackprov --prefix --match 'status/\d+$' https://twitter.com/EPAScottPruitt
77
+ ```
78
+ waybackprov --prefix --match 'status/\d+$' https://twitter.com/EPAScottPruitt
79
+ ```
67
80
 
68
81
  ## Collections
69
82
 
@@ -88,14 +101,15 @@ rather than a summary.
88
101
  If you would like to see detailed information about what *waybackprov* is doing
89
102
  use the `--log` option to supply the a file path to log to:
90
103
 
91
- % waybackprov --log waybackprov.log https://example.com/
104
+ ```shell
105
+ waybackprov --log waybackprov.log https://example.com/
106
+ ```
92
107
 
93
108
  ## Test
94
109
 
95
110
  If you would like to test it first install [pytest] and then:
96
111
 
97
- pytest test.py
112
+ uv run pytest test.py
98
113
 
99
114
  [pytest]: https://docs.pytest.org/en/latest/
100
-
101
-
115
+ [uv]: https://docs.astral.sh/uv/
@@ -0,0 +1,30 @@
1
+ [project]
2
+ name = "waybackprov"
3
+ version = "0.1.1"
4
+ description = "Checks the provenance of a URL in the Wayback machine"
5
+ readme = "README.md"
6
+ authors = [
7
+ { name = "Ed Summers", email = "ehs@pobox.com" }
8
+ ]
9
+ requires-python = ">=3.0"
10
+ dependencies = []
11
+
12
+ [project.urls]
13
+ repository = "https://github.com/docnow/waybackprov"
14
+
15
+ [project.scripts]
16
+ waybackprov = "waybackprov:main"
17
+
18
+ [build-system]
19
+ requires = ["uv_build>=0.9.8,<0.10.0"]
20
+ build-backend = "uv_build"
21
+
22
+ [dependency-groups]
23
+ dev = [
24
+ "pytest>=4.6.11",
25
+ ]
26
+
27
+ [tool.pytest.ini_options]
28
+ addopts = "-v -s"
29
+ log_file = "test.log"
30
+ log_file_level = "DEBUG"
@@ -0,0 +1,268 @@
1
+ #!/usr/bin/env python3
2
+
3
+ import re
4
+ import csv
5
+ import sys
6
+ import json
7
+ import time
8
+ import codecs
9
+ import logging
10
+ import datetime
11
+ import optparse
12
+ import collections
13
+
14
+ from urllib.parse import quote
15
+ from urllib.request import urlopen
16
+
17
+ colls = {}
18
+
19
+
20
+ def main():
21
+ now = datetime.datetime.now()
22
+
23
+ parser = optparse.OptionParser("waybackprov [options] <url>")
24
+ parser.add_option("--start", default=now.year - 1, help="start year")
25
+ parser.add_option("--end", default=now.year, help="end year")
26
+ parser.add_option(
27
+ "--format", choices=["text", "csv", "json"], default="text", help="output data"
28
+ )
29
+ parser.add_option(
30
+ "--collapse", action="store_true", help="only display most specific collection"
31
+ )
32
+ parser.add_option("--prefix", action="store_true", help="use url as a prefix")
33
+ parser.add_option("--match", help="limit to urls that match pattern")
34
+ parser.add_option("--log", help="where to log activity to")
35
+ opts, args = parser.parse_args()
36
+
37
+ if opts.log:
38
+ logging.basicConfig(
39
+ filename=opts.log,
40
+ format="%(asctime)s - %(levelname)s - %(message)s",
41
+ level=logging.INFO,
42
+ )
43
+ else:
44
+ logging.basicConfig(
45
+ format="%(asctime)s - %(levelname)s - %(message)s", level=logging.WARNING
46
+ )
47
+ if len(args) != 1:
48
+ parser.error("You must supply a URL to lookup")
49
+
50
+ url = args[0]
51
+
52
+ crawl_data = get_crawls(
53
+ url,
54
+ start_year=opts.start,
55
+ end_year=opts.end,
56
+ collapse=opts.collapse,
57
+ prefix=opts.prefix,
58
+ match=opts.match,
59
+ )
60
+
61
+ if opts.format == "text":
62
+ # coll_urls is a dictionary where the key is a collection id and the
63
+ # value is a set of URLs that have been crawled
64
+ coll_urls = {}
65
+
66
+ # coll_counter is a Counter that counts the number of crawls that are
67
+ # in a collection
68
+ coll_counter = collections.Counter()
69
+
70
+ for crawl in crawl_data:
71
+ coll_counter.update(crawl["collections"])
72
+
73
+ # a crawl can appear in multiple collections because of how
74
+ # collections can contain other collections
75
+ for coll in crawl["collections"]:
76
+ # keep track of urls in each collection
77
+ if coll not in coll_urls:
78
+ coll_urls[coll] = set()
79
+ coll_urls[coll].add(crawl["url"])
80
+
81
+ if len(coll_counter) == 0:
82
+ print(
83
+ "No results for %s-%s, consider using --start and --end to broaden."
84
+ % (opts.start, opts.end)
85
+ )
86
+ return
87
+
88
+ if opts.prefix:
89
+ str_format = "%6s %6s %s"
90
+ print(str_format % ("crawls", "urls", "collection"))
91
+ else:
92
+ str_format = "%6s %s"
93
+ print(str_format % ("crawls", "collection"))
94
+
95
+ for coll_id, count in coll_counter.most_common():
96
+ coll_url = f"https://archive.org/details/{coll_id}"
97
+ if opts.prefix:
98
+ print(str_format % (count, len(coll_urls[coll_id]), coll_url))
99
+ else:
100
+ print(str_format % (count, coll_url))
101
+
102
+ elif opts.format == "json":
103
+ data = list(crawl_data)
104
+ print(json.dumps(data, indent=2))
105
+
106
+ elif opts.format == "csv":
107
+ w = csv.DictWriter(
108
+ sys.stdout,
109
+ fieldnames=["timestamp", "status", "collections", "url", "wayback_url"],
110
+ )
111
+ for crawl in crawl_data:
112
+ crawl["collections"] = ",".join(crawl["collections"])
113
+ w.writerow(crawl)
114
+
115
+
116
+ def get_crawls(
117
+ url, start_year=None, end_year=None, collapse=False, prefix=False, match=None
118
+ ):
119
+ if prefix is True:
120
+ for year, sub_url in cdx(
121
+ url, match=match, start_year=start_year, end_year=end_year
122
+ ):
123
+ yield from get_crawls(sub_url, start_year=year, end_year=year)
124
+
125
+ if start_year is None:
126
+ start_year = datetime.datetime.now().year - 1
127
+ else:
128
+ start_year = int(start_year)
129
+ if end_year is None:
130
+ end_year = datetime.datetime.now().year
131
+ else:
132
+ end_year = int(end_year)
133
+
134
+ api = "https://web.archive.org/__wb/calendarcaptures?url=%s&selected_year=%s"
135
+ for year in range(start_year, end_year + 1):
136
+ # This calendar data structure reflects the layout of a calendar
137
+ # month. So some spots in the first and last row are null. Not
138
+ # every day has any data if the URL wasn't crawled then.
139
+ logging.info("getting calendar year %s for %s", year, url)
140
+ cal = get_json(api % (url, year))
141
+ for month in cal:
142
+ for week in month:
143
+ for day in week:
144
+ if day is None or day == {}:
145
+ continue
146
+ # note: we can't seem to rely on 'cnt' as a count
147
+ for i in range(0, len(day["st"])):
148
+ c = {
149
+ "status": day["st"][i],
150
+ "timestamp": day["ts"][i],
151
+ "collections": day["why"][i],
152
+ "url": url,
153
+ }
154
+ c["wayback_url"] = "https://web.archive.org/web/%s/%s" % (
155
+ c["timestamp"],
156
+ url,
157
+ )
158
+ if c["collections"] is None:
159
+ continue
160
+ if collapse and len(c["collections"]) > 0:
161
+ c["collections"] = [deepest_collection(c["collections"])]
162
+ logging.info("found crawl %s", c)
163
+ yield c
164
+
165
+
166
+ def deepest_collection(coll_ids):
167
+ return max(coll_ids, key=get_depth)
168
+
169
+
170
+ def get_collection(coll_id):
171
+ # no need to fetch twice
172
+ if coll_id in colls:
173
+ return colls[coll_id]
174
+
175
+ logging.info("fetching collection %s", coll_id)
176
+
177
+ # get the collection metadata
178
+ url = "https://archive.org/metadata/%s" % coll_id
179
+ data = get_json(url)["metadata"]
180
+
181
+ # make collection into reliable array
182
+ if "collection" in data:
183
+ if type(data["collection"]) is str:
184
+ data["collection"] = [data["collection"]]
185
+ else:
186
+ data["collection"] = []
187
+
188
+ # so we don't have to look it up again
189
+ colls[coll_id] = data
190
+
191
+ return data
192
+
193
+
194
+ def get_depth(coll_id, seen_colls=None):
195
+ coll = get_collection(coll_id)
196
+ if "depth" in coll:
197
+ return coll["depth"]
198
+
199
+ logging.info("calculating depth of %s", coll_id)
200
+
201
+ if len(coll["collection"]) == 0:
202
+ return 0
203
+
204
+ # prevent recursive loops
205
+ if seen_colls is None:
206
+ seen_colls = set()
207
+ if coll_id in seen_colls:
208
+ return 0
209
+ seen_colls.add(coll_id)
210
+
211
+ depth = max(map(lambda id: get_depth(id, seen_colls) + 1, coll["collection"]))
212
+
213
+ coll["depth"] = depth
214
+ logging.info("depth %s = %s", coll_id, depth)
215
+ return depth
216
+
217
+
218
+ def get_json(url):
219
+ count = 0
220
+ while True:
221
+ count += 1
222
+ if count >= 10:
223
+ logging.error("giving up on fetching JSON from %s", url)
224
+ try:
225
+ resp = urlopen(url)
226
+ reader = codecs.getreader("utf-8")
227
+ return json.load(reader(resp))
228
+ except Exception as e:
229
+ logging.debug("caught exception: %s", e)
230
+ logging.debug("sleeping for %s seconds", count * 10)
231
+ time.sleep(count * 10)
232
+ raise (Exception("unable to get JSON for %s", url))
233
+
234
+
235
+ def cdx(url, match=None, start_year=None, end_year=None):
236
+ logging.info("searching cdx for %s with regex %s", url, match)
237
+
238
+ if match:
239
+ try:
240
+ pattern = re.compile(match)
241
+ except Exception as e:
242
+ sys.exit("invalid regular expression: {}".format(e))
243
+ else:
244
+ pattern = None
245
+
246
+ cdx_url = "http://web.archive.org/cdx/search/cdx?url={}&matchType=prefix&from={}&to={}".format(
247
+ quote(url), start_year, end_year
248
+ )
249
+ seen = set()
250
+ results = codecs.decode(urlopen(cdx_url).read(), encoding="utf8")
251
+
252
+ for line in results.split("\n"):
253
+ parts = line.split(" ")
254
+ if len(parts) == 7:
255
+ year = int(parts[1][0:4])
256
+ url = parts[2]
257
+ seen_key = "{}:{}".format(year, url)
258
+ if seen_key in seen:
259
+ continue
260
+ if pattern and not pattern.search(url):
261
+ continue
262
+ seen.add(seen_key)
263
+ logging.info("cdx found %s", url)
264
+ yield (year, url)
265
+
266
+
267
+ if __name__ == "__main__":
268
+ main()
@@ -1,87 +0,0 @@
1
- Give *waybackprov* a URL and it will summarize which Internet Archive
2
- collections have archived the URL. This kind of information can sometimes
3
- provide insight about why a particular web resource or set of web resources were
4
- archived from the web.
5
-
6
- ## Install
7
-
8
- pip install waybackprov
9
-
10
- ## Basic Usage
11
-
12
- To check a particular URL here's how it works:
13
-
14
- % waybackprov https://twitter.com/EPAScottPruitt
15
- 364 https://archive.org/details/focused_crawls
16
- 306 https://archive.org/details/edgi_monitor
17
- 151 https://archive.org/details/www3.epa.gov
18
- 60 https://archive.org/details/epa.gov4
19
- 47 https://archive.org/details/epa.gov5
20
- ...
21
-
22
- The first column contains the number of crawls for a particular URL, and the
23
- second column contains the URL for the Internet Archive collection that added
24
- it.
25
-
26
- ## Time
27
-
28
- By default waybackprov will only look at the current year. If you would like it
29
- to examine a range of years use the `--start` and `--end` options:
30
-
31
- % waybackprov --start 2016 --end 2018 https://twitter.com/EPAScottPruitt
32
-
33
- ## Multiple Pages
34
-
35
- If you would like to look at all URLs at a particular URL prefix you can use the
36
- `--prefix` option:
37
-
38
- % waybackprov --prefix https://twitter.com/EPAScottPruitt
39
-
40
- This will use the Internet Archive's [CDX API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API) to also include URLs that are extensions of the URL you supply, so it would include for example:
41
-
42
- https://twitter.com/EPAScottPruitt/status/1309839080398339
43
-
44
- But it can also include things you may not want, such as:
45
-
46
- https://twitter.com/EPAScottPruitt/status/1309839080398339/media/1
47
-
48
- To further limit the URLs use the `--match` parameter to specify a regular
49
- expression only check particular URLs. Further specifying the URLs you are
50
- interested in is highly recommended since it prevents lots of lookups for CSS,
51
- JavaScript and image files that are components of the resource that was
52
- initially crawled.
53
-
54
- % waybackprov --prefix --match 'status/\d+$' https://twitter.com/EPAScottPruitt
55
-
56
- ## Collections
57
-
58
- One thing to remember when interpreting this data is that collections can
59
- contain other collections. For example the *edgi_monitor* collection is a
60
- sub-collection of *focused_crawls*.
61
-
62
- If you use the `--collapse` option only the most specific collection will be
63
- reported for a given crawl. So if *coll1* is part of *coll2* which is part of
64
- *coll3*, only *coll1* will be reported instead of *coll1*, *coll2* and *coll3*.
65
- This does involve collection metadata lookups at the Internet Archive API, so it
66
- does slow performance significantly.
67
-
68
- ## JSON and CSV
69
-
70
- If you would rather see the raw data as JSON or CSV use the `--format` option.
71
- When you use either of these formats you will see the metadata for each crawl,
72
- rather than a summary.
73
-
74
- ## Log
75
-
76
- If you would like to see detailed information about what *waybackprov* is doing
77
- use the `--log` option to supply the a file path to log to:
78
-
79
- % waybackprov --log waybackprov.log https://example.com/
80
-
81
- ## Test
82
-
83
- If you would like to test it first install [pytest] and then:
84
-
85
- pytest test.py
86
-
87
- [pytest]: https://docs.pytest.org/en/latest/
@@ -1,4 +0,0 @@
1
- [egg_info]
2
- tag_build =
3
- tag_date = 0
4
-
@@ -1,19 +0,0 @@
1
- from setuptools import setup
2
-
3
- with open("README.md") as f:
4
- long_description = f.read()
5
-
6
- if __name__ == "__main__":
7
- setup(
8
- name='waybackprov',
9
- version='0.0.9',
10
- url='https://github.com/edsu/waybackprov',
11
- author='Ed Summers',
12
- author_email='ehs@pobox.com',
13
- py_modules=['waybackprov', ],
14
- description='Checks the provenance of a URL in the Wayback machine',
15
- long_description=long_description,
16
- long_description_content_type="text/markdown",
17
- python_requires='>=3.0',
18
- entry_points={'console_scripts': ['waybackprov = waybackprov:main']}
19
- )
@@ -1,8 +0,0 @@
1
- README.md
2
- setup.py
3
- waybackprov.py
4
- waybackprov.egg-info/PKG-INFO
5
- waybackprov.egg-info/SOURCES.txt
6
- waybackprov.egg-info/dependency_links.txt
7
- waybackprov.egg-info/entry_points.txt
8
- waybackprov.egg-info/top_level.txt
@@ -1,3 +0,0 @@
1
- [console_scripts]
2
- waybackprov = waybackprov:main
3
-
@@ -1 +0,0 @@
1
- waybackprov
@@ -1,249 +0,0 @@
1
- #!/usr/bin/env python3
2
-
3
- import re
4
- import csv
5
- import sys
6
- import json
7
- import time
8
- import codecs
9
- import logging
10
- import operator
11
- import datetime
12
- import optparse
13
- import collections
14
-
15
- from functools import reduce
16
- from urllib.parse import quote
17
- from urllib.request import urlopen
18
-
19
- colls = {}
20
-
21
- def main():
22
- now = datetime.datetime.now()
23
-
24
- parser = optparse.OptionParser('waybackprov.py [options] <url>')
25
- parser.add_option('--start', default=now.year -1, help='start year')
26
- parser.add_option('--end', default=now.year, help='end year')
27
- parser.add_option('--format', choices=['text', 'csv', 'json'],
28
- default='text', help='output data')
29
- parser.add_option('--collapse', action='store_true',
30
- help='only display most specific collection')
31
- parser.add_option('--prefix', action='store_true',
32
- help='use url as a prefix')
33
- parser.add_option('--match', help='limit to urls that match pattern')
34
- parser.add_option('--log', help='where to log activity to')
35
- opts, args = parser.parse_args()
36
-
37
- if opts.log:
38
- logging.basicConfig(
39
- filename=opts.log,
40
- format='%(asctime)s - %(levelname)s - %(message)s',
41
- level=logging.INFO
42
- )
43
- else:
44
- logging.basicConfig(
45
- format='%(asctime)s - %(levelname)s - %(message)s',
46
- level=logging.WARNING
47
- )
48
- if len(args) != 1:
49
- parser.error('You must supply a URL to lookup')
50
-
51
- url = args[0]
52
-
53
- crawl_data = get_crawls(url,
54
- start_year=opts.start,
55
- end_year=opts.end,
56
- collapse=opts.collapse,
57
- prefix=opts.prefix,
58
- match=opts.match
59
- )
60
-
61
- if opts.format == 'text':
62
- crawls = 0
63
- coll_urls = {}
64
- coll_counter = collections.Counter()
65
- for crawl in crawl_data:
66
- crawls += 1
67
- coll_counter.update(crawl['collections'])
68
- for coll in crawl['collections']:
69
- # keep track of urls in each collection
70
- if coll not in coll_urls:
71
- coll_urls[coll] = set()
72
- coll_urls[coll].add(crawl['url'])
73
-
74
- if len(coll_counter) == 0:
75
- print('No results for %s-%s, consider using --start and --end to broaden.' % (opts.start, opts.end))
76
- return
77
-
78
- max_pos = str(len(str(coll_counter.most_common(1)[0][1])))
79
- if opts.prefix:
80
- str_format = '%' + max_pos + 'i %' + max_pos + 'i https://archive.org/details/%s'
81
- else:
82
- str_format = '%' + max_pos + 'i https://archive.org/details/%s'
83
-
84
- for coll_id, count in coll_counter.most_common():
85
- if opts.prefix:
86
- print(str_format % (count, len(coll_urls[coll_id]), coll_id))
87
- else:
88
- print(str_format % (count, coll_id))
89
-
90
- print('')
91
- print('total crawls %s-%s: %s' % (opts.start, opts.end, crawls))
92
- if (opts.prefix):
93
- total_urls = len(reduce(operator.or_, coll_urls.values()))
94
- print('total urls: %s' % total_urls)
95
-
96
- elif opts.format == 'json':
97
- data = list(crawl_data)
98
- print(json.dumps(data, indent=2))
99
-
100
- elif opts.format == 'csv':
101
- w = csv.DictWriter(sys.stdout,
102
- fieldnames=['timestamp', 'status', 'collections', 'url', 'wayback_url'])
103
- for crawl in crawl_data:
104
- crawl['collections'] = ','.join(crawl['collections'])
105
- w.writerow(crawl)
106
-
107
- def get_crawls(url, start_year=None, end_year=None, collapse=False,
108
- prefix=False, match=None):
109
-
110
- if prefix == True:
111
- for year, sub_url in cdx(url, match=match, start_year=start_year,
112
- end_year=end_year):
113
- yield from get_crawls(sub_url, start_year=year, end_year=year)
114
-
115
- if start_year is None:
116
- start_year = datetime.datetime.now().year - 1
117
- else:
118
- start_year = int(start_year)
119
- if end_year is None:
120
- end_year = datetime.datetime.now().year
121
- else:
122
- end_year = int(end_year)
123
-
124
- api = 'https://web.archive.org/__wb/calendarcaptures?url=%s&selected_year=%s'
125
- for year in range(start_year, end_year + 1):
126
- # This calendar data structure reflects the layout of a calendar
127
- # month. So some spots in the first and last row are null. Not
128
- # every day has any data if the URL wasn't crawled then.
129
- logging.info("getting calendar year %s for %s", year, url)
130
- cal = get_json(api % (url, year))
131
- found = False
132
- for month in cal:
133
- for week in month:
134
- for day in week:
135
- if day is None or day == {}:
136
- continue
137
- # note: we can't seem to rely on 'cnt' as a count
138
- for i in range(0, len(day['st'])):
139
- c = {
140
- 'status': day['st'][i],
141
- 'timestamp': day['ts'][i],
142
- 'collections': day['why'][i],
143
- 'url': url
144
- }
145
- c['wayback_url'] = 'https://web.archive.org/web/%s/%s' % (c['timestamp'], url)
146
- if c['collections'] is None:
147
- continue
148
- if collapse and len(c['collections']) > 0:
149
- c['collections'] = [deepest_collection(c['collections'])]
150
- logging.info('found crawl %s', c)
151
- found = True
152
- yield c
153
-
154
- def deepest_collection(coll_ids):
155
- return max(coll_ids, key=get_depth)
156
-
157
- def get_collection(coll_id):
158
- # no need to fetch twice
159
- if coll_id in colls:
160
- return colls[coll_id]
161
-
162
- logging.info('fetching collection %s', coll_id)
163
-
164
- # get the collection metadata
165
- url = 'https://archive.org/metadata/%s' % coll_id
166
- data = get_json(url)['metadata']
167
-
168
- # make collection into reliable array
169
- if 'collection' in data:
170
- if type(data['collection']) == str:
171
- data['collection'] = [data['collection']]
172
- else:
173
- data['collection'] = []
174
-
175
- # so we don't have to look it up again
176
- colls[coll_id] = data
177
-
178
- return data
179
-
180
- def get_depth(coll_id, seen_colls=None):
181
- coll = get_collection(coll_id)
182
- if 'depth' in coll:
183
- return coll['depth']
184
-
185
- logging.info('calculating depth of %s', coll_id)
186
-
187
- if len(coll['collection']) == 0:
188
- return 0
189
-
190
- # prevent recursive loops
191
- if seen_colls == None:
192
- seen_colls = set()
193
- if coll_id in seen_colls:
194
- return 0
195
- seen_colls.add(coll_id)
196
-
197
- depth = max(map(lambda id: get_depth(id, seen_colls) + 1, coll['collection']))
198
-
199
- coll['depth'] = depth
200
- logging.info('depth %s = %s', coll_id, depth)
201
- return depth
202
-
203
- def get_json(url):
204
- count = 0
205
- while True:
206
- count += 1
207
- if count >= 10:
208
- logging.error("giving up on fetching JSON from %s", url)
209
- try:
210
- resp = urlopen(url)
211
- reader = codecs.getreader('utf-8')
212
- return json.load(reader(resp))
213
- except Exception as e:
214
- logging.error('caught exception: %s', e)
215
- logging.info('sleeping for %s seconds', count * 10)
216
- time.sleep(count * 10)
217
- raise(Exception("unable to get JSON for %s", url))
218
-
219
- def cdx(url, match=None, start_year=None, end_year=None):
220
- logging.info('searching cdx for %s with regex %s', url, match)
221
-
222
- if match:
223
- try:
224
- pattern = re.compile(match)
225
- except Exception as e:
226
- sys.exit('invalid regular expression: {}'.format(e))
227
- else:
228
- pattern = None
229
-
230
- cdx_url = 'http://web.archive.org/cdx/search/cdx?url={}&matchType=prefix&from={}&to={}'.format(quote(url), start_year, end_year)
231
- seen = set()
232
- results = codecs.decode(urlopen(cdx_url).read(), encoding='utf8')
233
-
234
- for line in results.split('\n'):
235
- parts = line.split(' ')
236
- if len(parts) == 7:
237
- year = int(parts[1][0:4])
238
- url = parts[2]
239
- seen_key = '{}:{}'.format(year, url)
240
- if seen_key in seen:
241
- continue
242
- if pattern and not pattern.search(url):
243
- continue
244
- seen.add(seen_key)
245
- logging.info('cdx found %s', url)
246
- yield(year, url)
247
-
248
- if __name__ == "__main__":
249
- main()