waymore 5.0__tar.gz → 6.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {waymore-5.0/waymore.egg-info → waymore-6.0}/PKG-INFO +6 -4
- {waymore-5.0 → waymore-6.0}/README.md +5 -3
- waymore-6.0/waymore/__init__.py +1 -0
- {waymore-5.0 → waymore-6.0}/waymore/waymore.py +492 -68
- {waymore-5.0 → waymore-6.0/waymore.egg-info}/PKG-INFO +6 -4
- waymore-5.0/waymore/__init__.py +0 -1
- {waymore-5.0 → waymore-6.0}/LICENSE +0 -0
- {waymore-5.0 → waymore-6.0}/setup.cfg +0 -0
- {waymore-5.0 → waymore-6.0}/setup.py +0 -0
- {waymore-5.0 → waymore-6.0}/waymore.egg-info/SOURCES.txt +0 -0
- {waymore-5.0 → waymore-6.0}/waymore.egg-info/dependency_links.txt +0 -0
- {waymore-5.0 → waymore-6.0}/waymore.egg-info/entry_points.txt +0 -0
- {waymore-5.0 → waymore-6.0}/waymore.egg-info/requires.txt +0 -0
- {waymore-5.0 → waymore-6.0}/waymore.egg-info/top_level.txt +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: waymore
|
|
3
|
-
Version:
|
|
3
|
+
Version: 6.0
|
|
4
4
|
Summary: Find way more from the Wayback Machine, Common Crawl, Alien Vault OTX, URLScan & VirusTotal!
|
|
5
5
|
Home-page: https://github.com/xnl-h4ck3r/waymore
|
|
6
6
|
Author: @xnl-h4ck3r
|
|
@@ -15,7 +15,7 @@ Requires-Dist: tldextract
|
|
|
15
15
|
|
|
16
16
|
<center><img src="https://github.com/xnl-h4ck3r/waymore/blob/main/waymore/images/title.png"></center>
|
|
17
17
|
|
|
18
|
-
## About -
|
|
18
|
+
## About - v6.0
|
|
19
19
|
|
|
20
20
|
The idea behind **waymore** is to find even more links from the Wayback Machine than other existing tools.
|
|
21
21
|
|
|
@@ -112,7 +112,7 @@ pipx install git+https://github.com/xnl-h4ck3r/waymore.git
|
|
|
112
112
|
| -c | --config | Path to the YML config file. If not passed, it looks for file `config.yml` in the default directory, typically `~/.config/waymore`. |
|
|
113
113
|
| -wrlr | --wayback-rate-limit-retry | The number of minutes the user wants to wait for a rate limit pause on Wayback Machine (archive.org) instead of stopping with a `429` error (default: 3). |
|
|
114
114
|
| -urlr | --urlscan-rate-limit-retry | The number of minutes the user wants to wait for a rate limit pause on URLScan.io instead of stopping with a `429` error (default: 1). |
|
|
115
|
-
| -co | --check-only | This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine.
|
|
115
|
+
| -co | --check-only | This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine (unfortunately it isn't possible to check how long it will take to download responses from URLScan). |
|
|
116
116
|
| -nd | --notify-discord | Whether to send a notification to Discord when waymore completes. It requires `WEBHOOK_DISCORD` to be provided in the `config.yml` file. |
|
|
117
117
|
| -oijs | --output-inline-js | Whether to save combined inline javascript of all relevant files in the response directory when `-mode R` (or `-mode B`) has been used. The files are saved with the name `combinedInline{}.js` where `{}` is the number of the file, saving 1000 unique scripts per file. The file `combinedInlineSrc.txt` will also be created, containing the `src` value of all external scripts referenced in the files. |
|
|
118
118
|
| -v | --verbose | Verbose output |
|
|
@@ -284,7 +284,9 @@ If you come across any problems at all, or have ideas for improvements, please f
|
|
|
284
284
|
|
|
285
285
|
## TODO
|
|
286
286
|
|
|
287
|
-
- Add an `-oos` argument that accepts a file of Out Of Scope subdomains/URLs that will not be returned in the output, or have any responses downloaded
|
|
287
|
+
- Add an `-oos` argument that accepts a file of Out Of Scope subdomains/URLs that will not be returned in the output, or have any responses downloaded.
|
|
288
|
+
- The `waymore_index.txt` isn't de-duplicated if run multiple times for the same input with `-mode R` or `-mode B`.
|
|
289
|
+
- Rewrite to get from sources in parallel. Currently they are run consecutively sorry!
|
|
288
290
|
|
|
289
291
|
## References
|
|
290
292
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
<center><img src="https://github.com/xnl-h4ck3r/waymore/blob/main/waymore/images/title.png"></center>
|
|
2
2
|
|
|
3
|
-
## About -
|
|
3
|
+
## About - v6.0
|
|
4
4
|
|
|
5
5
|
The idea behind **waymore** is to find even more links from the Wayback Machine than other existing tools.
|
|
6
6
|
|
|
@@ -97,7 +97,7 @@ pipx install git+https://github.com/xnl-h4ck3r/waymore.git
|
|
|
97
97
|
| -c | --config | Path to the YML config file. If not passed, it looks for file `config.yml` in the default directory, typically `~/.config/waymore`. |
|
|
98
98
|
| -wrlr | --wayback-rate-limit-retry | The number of minutes the user wants to wait for a rate limit pause on Wayback Machine (archive.org) instead of stopping with a `429` error (default: 3). |
|
|
99
99
|
| -urlr | --urlscan-rate-limit-retry | The number of minutes the user wants to wait for a rate limit pause on URLScan.io instead of stopping with a `429` error (default: 1). |
|
|
100
|
-
| -co | --check-only | This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine.
|
|
100
|
+
| -co | --check-only | This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine (unfortunately it isn't possible to check how long it will take to download responses from URLScan). |
|
|
101
101
|
| -nd | --notify-discord | Whether to send a notification to Discord when waymore completes. It requires `WEBHOOK_DISCORD` to be provided in the `config.yml` file. |
|
|
102
102
|
| -oijs | --output-inline-js | Whether to save combined inline javascript of all relevant files in the response directory when `-mode R` (or `-mode B`) has been used. The files are saved with the name `combinedInline{}.js` where `{}` is the number of the file, saving 1000 unique scripts per file. The file `combinedInlineSrc.txt` will also be created, containing the `src` value of all external scripts referenced in the files. |
|
|
103
103
|
| -v | --verbose | Verbose output |
|
|
@@ -269,7 +269,9 @@ If you come across any problems at all, or have ideas for improvements, please f
|
|
|
269
269
|
|
|
270
270
|
## TODO
|
|
271
271
|
|
|
272
|
-
- Add an `-oos` argument that accepts a file of Out Of Scope subdomains/URLs that will not be returned in the output, or have any responses downloaded
|
|
272
|
+
- Add an `-oos` argument that accepts a file of Out Of Scope subdomains/URLs that will not be returned in the output, or have any responses downloaded.
|
|
273
|
+
- The `waymore_index.txt` isn't de-duplicated if run multiple times for the same input with `-mode R` or `-mode B`.
|
|
274
|
+
- Rewrite to get from sources in parallel. Currently they are run consecutively sorry!
|
|
273
275
|
|
|
274
276
|
## References
|
|
275
277
|
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
__version__="6.0"
|
|
@@ -58,10 +58,12 @@ stopSource = False
|
|
|
58
58
|
successCount = 0
|
|
59
59
|
failureCount = 0
|
|
60
60
|
fileCount = 0
|
|
61
|
+
totalFileCount = 0
|
|
61
62
|
totalResponses = 0
|
|
62
63
|
totalPages = 0
|
|
63
64
|
indexFile = None
|
|
64
65
|
continueRespFile = None
|
|
66
|
+
continueRespFileURLScan = None
|
|
65
67
|
inputIsDomainANDPath = False
|
|
66
68
|
inputIsSubDomain = False
|
|
67
69
|
subs = '*.'
|
|
@@ -82,12 +84,14 @@ checkVirusTotal = 0
|
|
|
82
84
|
checkIntelx = 0
|
|
83
85
|
argsInputHostname = ''
|
|
84
86
|
responseOutputDirectory = ''
|
|
87
|
+
urlscanRequestLinks = set()
|
|
85
88
|
|
|
86
89
|
# Source Provider URLs
|
|
87
90
|
WAYBACK_URL = 'https://web.archive.org/cdx/search/cdx?url={DOMAIN}{COLLAPSE}&fl=timestamp,original,mimetype,statuscode,digest'
|
|
88
91
|
CCRAWL_INDEX_URL = 'https://index.commoncrawl.org/collinfo.json'
|
|
89
92
|
ALIENVAULT_URL = 'https://otx.alienvault.com/api/v1/indicators/{TYPE}/{DOMAIN}/url_list?limit=500'
|
|
90
|
-
URLSCAN_URL = 'https://urlscan.io/api/v1/search/?q=domain:{DOMAIN}&size=10000'
|
|
93
|
+
URLSCAN_URL = 'https://urlscan.io/api/v1/search/?q=domain:{DOMAIN}{DATERANGE}&size=10000'
|
|
94
|
+
URLSCAN_DOM_URL = 'https://urlscan.io/dom/'
|
|
91
95
|
VIRUSTOTAL_URL = 'https://www.virustotal.com/vtapi/v2/domain/report?apikey={APIKEY}&domain={DOMAIN}'
|
|
92
96
|
INTELX_SEARCH_URL = 'https://2.intelx.io/phonebook/search'
|
|
93
97
|
INTELX_RESULTS_URL = 'https://2.intelx.io/phonebook/search/result?id='
|
|
@@ -952,7 +956,7 @@ def processArchiveUrl(url):
|
|
|
952
956
|
indexFile.write(hashValue+','+archiveUrl+' ,'+timestamp+'\n')
|
|
953
957
|
indexFile.flush()
|
|
954
958
|
except Exception as e:
|
|
955
|
-
writerr(colored(getSPACER('[ ERR ] Failed to write to
|
|
959
|
+
writerr(colored(getSPACER('[ ERR ] Failed to write to waymore_index.txt for "' + archiveUrl + '": '+ str(e)), 'red'))
|
|
956
960
|
|
|
957
961
|
# FOR DEBUGGING PURPOSES
|
|
958
962
|
try:
|
|
@@ -1188,27 +1192,6 @@ def processURLOutput():
|
|
|
1188
1192
|
if verbose():
|
|
1189
1193
|
writerr(colored("ERROR processURLOutput 1: " + str(e), "red"))
|
|
1190
1194
|
|
|
1191
|
-
def processResponsesOutput():
|
|
1192
|
-
"""
|
|
1193
|
-
Show results of the archive responses saved
|
|
1194
|
-
"""
|
|
1195
|
-
global successCount, failureCount, subs, fileCount, argsInput, DEFAULT_OUTPUT_DIR, responseOutputDirectory
|
|
1196
|
-
try:
|
|
1197
|
-
|
|
1198
|
-
if failureCount > 0:
|
|
1199
|
-
if verbose():
|
|
1200
|
-
write(colored('\nResponses saved to ','cyan')+colored(responseOutputDirectory,'white') + colored(' for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘','white')+colored(' (' + str(failureCount) + ' failed)\n','red'))
|
|
1201
|
-
else:
|
|
1202
|
-
write(colored('\nResponses saved for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘','white')+colored(' (' + str(failureCount) + ' failed)\n','red'))
|
|
1203
|
-
else:
|
|
1204
|
-
if verbose():
|
|
1205
|
-
write(colored('\nResponses saved to ','cyan')+colored(responseOutputDirectory,'white') + colored(' for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘\n','white'))
|
|
1206
|
-
else:
|
|
1207
|
-
write(colored('\nResponses saved for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘\n','white'))
|
|
1208
|
-
except Exception as e:
|
|
1209
|
-
if verbose():
|
|
1210
|
-
writerr(colored("ERROR processResponsesOutput 1: " + str(e), "red"))
|
|
1211
|
-
|
|
1212
1195
|
def validateArgProcesses(x):
|
|
1213
1196
|
"""
|
|
1214
1197
|
Validate the -p / --processes argument
|
|
@@ -1280,6 +1263,38 @@ def validateArgStatusCodes(x):
|
|
|
1280
1263
|
raise argparse.ArgumentTypeError('Pass HTTP status codes separated by a comma')
|
|
1281
1264
|
return x
|
|
1282
1265
|
|
|
1266
|
+
def validateArgDate(x):
|
|
1267
|
+
"""
|
|
1268
|
+
Validate the -from and -to arguments
|
|
1269
|
+
"""
|
|
1270
|
+
invalid = False
|
|
1271
|
+
|
|
1272
|
+
# Map string lengths to their corresponding datetime formats
|
|
1273
|
+
formats = {
|
|
1274
|
+
4: "%Y",
|
|
1275
|
+
6: "%Y%m",
|
|
1276
|
+
8: "%Y%m%d",
|
|
1277
|
+
10: "%Y%m%d%H",
|
|
1278
|
+
12: "%Y%m%d%H%M",
|
|
1279
|
+
14: "%Y%m%d%H%M%S",
|
|
1280
|
+
}
|
|
1281
|
+
|
|
1282
|
+
# Check if length matches any expected format
|
|
1283
|
+
fmt = formats.get(len(x))
|
|
1284
|
+
if not fmt:
|
|
1285
|
+
invalid = True
|
|
1286
|
+
|
|
1287
|
+
# Try to parse with the matching format
|
|
1288
|
+
try:
|
|
1289
|
+
datetime.strptime(x, fmt)
|
|
1290
|
+
except ValueError:
|
|
1291
|
+
invalid = True
|
|
1292
|
+
|
|
1293
|
+
# If invalid then return an error
|
|
1294
|
+
if invalid:
|
|
1295
|
+
raise argparse.ArgumentTypeError('A valid date/time needs to be passed in YYYYMMDDhhmmss format (or part of, e.g. YYYYMM) .')
|
|
1296
|
+
return x
|
|
1297
|
+
|
|
1283
1298
|
def validateArgMimeTypes(x):
|
|
1284
1299
|
"""
|
|
1285
1300
|
Validate the -ft and -mt arguments
|
|
@@ -1316,7 +1331,7 @@ def validateArgProviders(x):
|
|
|
1316
1331
|
invalid = True
|
|
1317
1332
|
break
|
|
1318
1333
|
if invalid:
|
|
1319
|
-
raise argparse.ArgumentTypeError('Pass providers separated by a comma, e.g. wayback,commoncrawl,otx,urlscan,virustotal')
|
|
1334
|
+
raise argparse.ArgumentTypeError('Pass providers separated by a comma, e.g. wayback,commoncrawl,otx,urlscan,virustotal,intelx')
|
|
1320
1335
|
return x
|
|
1321
1336
|
|
|
1322
1337
|
def processAlienVaultPage(url):
|
|
@@ -1534,17 +1549,22 @@ def getAlienVaultUrls():
|
|
|
1534
1549
|
except Exception as e:
|
|
1535
1550
|
writerr(colored('ERROR getAlienVaultUrls 1: ' + str(e), 'red'))
|
|
1536
1551
|
|
|
1537
|
-
def processURLScanUrl(url, httpCode, mimeType):
|
|
1552
|
+
def processURLScanUrl(url, httpCode, mimeType, urlscanID=''):
|
|
1538
1553
|
"""
|
|
1539
1554
|
Process a specific URL from urlscan.io to determine whether to save the link
|
|
1540
1555
|
"""
|
|
1541
|
-
global argsInput, argsInputHostname
|
|
1556
|
+
global argsInput, argsInputHostname, urlscanRequestLinks
|
|
1542
1557
|
|
|
1543
1558
|
addLink = True
|
|
1544
1559
|
|
|
1545
|
-
try:
|
|
1560
|
+
try:
|
|
1561
|
+
# If the input has a / in it, then a URL was passed, so the link will only be added if the URL matches
|
|
1562
|
+
if url.contains('/'):
|
|
1563
|
+
if argsInput not in url:
|
|
1564
|
+
addLink = False
|
|
1565
|
+
|
|
1546
1566
|
# If filters are required then test them
|
|
1547
|
-
if not args.filter_responses_only:
|
|
1567
|
+
if addLink and not args.filter_responses_only:
|
|
1548
1568
|
|
|
1549
1569
|
# If the user requested -n / --no-subs then we don't want to add it if it has a sub domain (www. will not be classed as a sub domain)
|
|
1550
1570
|
if args.no_subs:
|
|
@@ -1593,7 +1613,8 @@ def processURLScanUrl(url, httpCode, mimeType):
|
|
|
1593
1613
|
|
|
1594
1614
|
# Add MIME Types if --verbose option was selected
|
|
1595
1615
|
if verbose():
|
|
1596
|
-
|
|
1616
|
+
if mimeType.strip() != '':
|
|
1617
|
+
linkMimes.add(mimeType)
|
|
1597
1618
|
|
|
1598
1619
|
# Add link if it passed filters
|
|
1599
1620
|
if addLink:
|
|
@@ -1608,11 +1629,183 @@ def processURLScanUrl(url, httpCode, mimeType):
|
|
|
1608
1629
|
# Check the URL
|
|
1609
1630
|
match = re.search(r'(^|\.)'+re.escape(argsInputHostname)+'$', domainOnly, flags=re.IGNORECASE)
|
|
1610
1631
|
if match is not None:
|
|
1611
|
-
|
|
1612
|
-
|
|
1632
|
+
if args.mode in ('U','B'):
|
|
1633
|
+
linksFoundAdd(url)
|
|
1634
|
+
# If Response mode is requested then add the DOM ID to try later, for the number of responses wanted
|
|
1635
|
+
if urlscanID != '' and args.mode in ('R','B'):
|
|
1636
|
+
if args.limit == 0 or len(urlscanRequestLinks) < args.limit:
|
|
1637
|
+
urlscanRequestLinks.add((url, URLSCAN_DOM_URL+urlscanID))
|
|
1638
|
+
|
|
1613
1639
|
except Exception as e:
|
|
1614
1640
|
writerr(colored('ERROR processURLScanUrl 1: ' + str(e), 'red'))
|
|
1641
|
+
|
|
1642
|
+
def getURLScanDOM(originalUrl, domUrl):
|
|
1643
|
+
"""
|
|
1644
|
+
Get the DOM for the passed URLScan link
|
|
1645
|
+
"""
|
|
1646
|
+
global stopProgram, successCount, failureCount, fileCount, DEFAULT_OUTPUT_DIR, totalResponses, indexFile, argsInput, continueRespFileURLScan, REGEX_404
|
|
1647
|
+
try:
|
|
1648
|
+
if stopProgram is None:
|
|
1649
|
+
|
|
1650
|
+
hashValue = ''
|
|
1615
1651
|
|
|
1652
|
+
# Get memory usage every 100 responses
|
|
1653
|
+
if (successCount + failureCount) % 100 == 0:
|
|
1654
|
+
try:
|
|
1655
|
+
getMemory()
|
|
1656
|
+
except:
|
|
1657
|
+
pass
|
|
1658
|
+
|
|
1659
|
+
# Make a request to URLScan
|
|
1660
|
+
try:
|
|
1661
|
+
try:
|
|
1662
|
+
# Choose a random user agent string to use for any requests
|
|
1663
|
+
userAgent = "waymore v"+__version__+" by xnl-h4ck3r"
|
|
1664
|
+
session = requests.Session()
|
|
1665
|
+
session.mount('https://', HTTP_ADAPTER)
|
|
1666
|
+
session.mount('http://', HTTP_ADAPTER)
|
|
1667
|
+
resp = session.get(domUrl, headers={"User-Agent":userAgent}, allow_redirects = True)
|
|
1668
|
+
archiveHtml = str(resp.text)
|
|
1669
|
+
|
|
1670
|
+
# If there is a specific URLScan error in the response, raise an exception
|
|
1671
|
+
if archiveHtml.lower().strip() == 'not found!':
|
|
1672
|
+
raise WayBackException
|
|
1673
|
+
|
|
1674
|
+
# Only create a file if there is a response
|
|
1675
|
+
if len(archiveHtml) != 0:
|
|
1676
|
+
|
|
1677
|
+
# Add the URL as a comment at the start of the response
|
|
1678
|
+
if args.url_filename:
|
|
1679
|
+
archiveHtml = '/* Original URL: ' + originalUrl + ' */\n' + archiveHtml
|
|
1680
|
+
|
|
1681
|
+
# Create file name based on url or hash value of the response, depending on selection. Ensure the file name isn't over 255 characters
|
|
1682
|
+
if args.url_filename:
|
|
1683
|
+
fileName = originalUrl.replace('/','-').replace(':','')
|
|
1684
|
+
fileName = fileName[0:254]
|
|
1685
|
+
else:
|
|
1686
|
+
hashValue = filehash(archiveHtml)
|
|
1687
|
+
fileName = hashValue
|
|
1688
|
+
|
|
1689
|
+
# Determine extension of file from the content-type using the mimetypes library
|
|
1690
|
+
extension = ''
|
|
1691
|
+
try:
|
|
1692
|
+
# Get path extension
|
|
1693
|
+
targetUrl = 'https://' + originalUrl.split("://")[1]
|
|
1694
|
+
parsed = urlparse(targetUrl.strip())
|
|
1695
|
+
path = parsed.path
|
|
1696
|
+
extension = path[path.rindex('.')+1:]
|
|
1697
|
+
# If there is a / in the extension then it's not the extension at all, so reset to blank
|
|
1698
|
+
if '/' in extension:
|
|
1699
|
+
extension = ''
|
|
1700
|
+
except:
|
|
1701
|
+
pass
|
|
1702
|
+
|
|
1703
|
+
# If the extension is blank, numeric, longer than 4 characters or not alphanumeric - then set to html if the content ends with HTML tag, otherwise set to unknown
|
|
1704
|
+
if extension == '':
|
|
1705
|
+
if archiveHtml.lower().strip().endswith('</html>') or archiveHtml.lower().strip().endswith('</body>') or archiveHtml.lower().strip().startswith('<!doctype html') or archiveHtml.lower().strip().startswith('<html') or archiveHtml.lower().strip().startswith('<head'):
|
|
1706
|
+
extension = 'html'
|
|
1707
|
+
else:
|
|
1708
|
+
extension = 'unknown'
|
|
1709
|
+
|
|
1710
|
+
fileName = fileName + '.' + extension
|
|
1711
|
+
|
|
1712
|
+
# If -oR / --output-responses was passed then add the file to that directory,
|
|
1713
|
+
# else add to the default "results/{target.domain}" directory in the same path as the .py file
|
|
1714
|
+
if args.output_responses != '':
|
|
1715
|
+
filePath = args.output_responses + '/' + f'{fileName}'
|
|
1716
|
+
else:
|
|
1717
|
+
filePath = (DEFAULT_OUTPUT_DIR + '/results/' + str(argsInput).replace('/','-') + '/' + f'{fileName}')
|
|
1718
|
+
|
|
1719
|
+
# Write the file
|
|
1720
|
+
try:
|
|
1721
|
+
responseFile = open(filePath, 'w', encoding='utf8')
|
|
1722
|
+
responseFile.write(archiveHtml)
|
|
1723
|
+
responseFile.close()
|
|
1724
|
+
fileCount = fileCount + 1
|
|
1725
|
+
except Exception as e:
|
|
1726
|
+
writerr(colored(getSPACER('[ ERR ] Failed to write file ' + filePath + ': '+ str(e)), 'red'))
|
|
1727
|
+
|
|
1728
|
+
# Write the hash value and URL to the index file
|
|
1729
|
+
if not args.url_filename:
|
|
1730
|
+
try:
|
|
1731
|
+
timestamp = str(datetime.now())
|
|
1732
|
+
indexFile.write(hashValue+','+domUrl+'/'+originalUrl+' ,'+timestamp+'\n')
|
|
1733
|
+
indexFile.flush()
|
|
1734
|
+
except Exception as e:
|
|
1735
|
+
writerr(colored(getSPACER('[ ERR ] Failed to write to waymore_index.txt for "' + domUrl + '": '+ str(e)), 'red'))
|
|
1736
|
+
|
|
1737
|
+
successCount = successCount + 1
|
|
1738
|
+
|
|
1739
|
+
except WayBackException as wbe:
|
|
1740
|
+
failureCount = failureCount + 1
|
|
1741
|
+
|
|
1742
|
+
except Exception as e:
|
|
1743
|
+
failureCount = failureCount + 1
|
|
1744
|
+
if verbose():
|
|
1745
|
+
try:
|
|
1746
|
+
writerr(colored(getSPACER('[ ' + str(resp.status_code) +' ] Failed to get response for "' + domUrl + '"'), 'red'))
|
|
1747
|
+
except:
|
|
1748
|
+
writerr(colored(getSPACER('[ ERR ] Failed to get response for "' + domUrl + '": '+ str(e)), 'red'))
|
|
1749
|
+
|
|
1750
|
+
# Show progress bar
|
|
1751
|
+
fillTest = (successCount + failureCount) % 2
|
|
1752
|
+
fillChar = "o"
|
|
1753
|
+
if fillTest == 0:
|
|
1754
|
+
fillChar = "O"
|
|
1755
|
+
suffix="Complete "
|
|
1756
|
+
# Show memory usage if -v option chosen, and check memory every 25 responses (or if its the last)
|
|
1757
|
+
if (successCount + failureCount) % 25 == 1 or (successCount + failureCount) == totalResponses:
|
|
1758
|
+
try:
|
|
1759
|
+
getMemory()
|
|
1760
|
+
if verbose():
|
|
1761
|
+
suffix = (
|
|
1762
|
+
"Complete (Mem Usage "
|
|
1763
|
+
+ humanReadableSize(currentMemUsage)
|
|
1764
|
+
+ ", Total Mem "
|
|
1765
|
+
+ str(currentMemPercent)
|
|
1766
|
+
+ "%) "
|
|
1767
|
+
)
|
|
1768
|
+
except:
|
|
1769
|
+
if verbose():
|
|
1770
|
+
suffix = 'Complete (To show mem use, run "pip install psutil")'
|
|
1771
|
+
printProgressBar(
|
|
1772
|
+
successCount + failureCount,
|
|
1773
|
+
totalResponses,
|
|
1774
|
+
prefix="Downloading " + str(totalResponses) + " responses:",
|
|
1775
|
+
suffix=suffix,
|
|
1776
|
+
length=getProgressBarLength(),
|
|
1777
|
+
fill=fillChar
|
|
1778
|
+
)
|
|
1779
|
+
|
|
1780
|
+
# Write the total count to the continueResp.URLScan.tmp file
|
|
1781
|
+
try:
|
|
1782
|
+
continueRespFileURLScan.seek(0)
|
|
1783
|
+
continueRespFileURLScan.write(str(successCount + failureCount)+'\n')
|
|
1784
|
+
except Exception as e:
|
|
1785
|
+
if verbose():
|
|
1786
|
+
writerr(colored(getSPACER('ERROR getURLScanDOM 2: ' + str(e)), 'red'))
|
|
1787
|
+
|
|
1788
|
+
except Exception as e:
|
|
1789
|
+
if verbose():
|
|
1790
|
+
writerr(colored(getSPACER('Error for "'+domUrl+'": ' + str(e)), 'red'))
|
|
1791
|
+
|
|
1792
|
+
except Exception as e:
|
|
1793
|
+
writerr(colored('ERROR getURLScanDOM 1: ' + str(e), 'red'))
|
|
1794
|
+
|
|
1795
|
+
def format_date_for_urlscan(date_str):
|
|
1796
|
+
# Handle different lengths of input
|
|
1797
|
+
if len(date_str) == 4: # YYYY
|
|
1798
|
+
date_str += "0101"
|
|
1799
|
+
elif len(date_str) == 6: # YYYYMM
|
|
1800
|
+
date_str += "01"
|
|
1801
|
+
|
|
1802
|
+
# Convert to YYYY-MM-DD format
|
|
1803
|
+
try:
|
|
1804
|
+
formatted_date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
|
|
1805
|
+
return formatted_date
|
|
1806
|
+
except:
|
|
1807
|
+
return ''
|
|
1808
|
+
|
|
1616
1809
|
def getURLScanUrls():
|
|
1617
1810
|
"""
|
|
1618
1811
|
Get URLs from the URLSCan API, urlscan.io
|
|
@@ -1628,9 +1821,26 @@ def getURLScanUrls():
|
|
|
1628
1821
|
|
|
1629
1822
|
# Set the URL to just the hostname
|
|
1630
1823
|
url = URLSCAN_URL.replace('{DOMAIN}',quote(argsInputHostname))
|
|
1824
|
+
|
|
1825
|
+
# If the --from-date or --to-date parameters were paassed then also add a date filter
|
|
1826
|
+
if args.from_date or args.to_date:
|
|
1827
|
+
if args.from_date:
|
|
1828
|
+
fromDate = format_date_for_urlscan(str(args.from_date)[:8])
|
|
1829
|
+
else:
|
|
1830
|
+
fromDate = '2016-01-01' # The year URLScan started
|
|
1831
|
+
if args.to_date:
|
|
1832
|
+
toDate = format_date_for_urlscan(str(args.to_date)[:8])
|
|
1833
|
+
else:
|
|
1834
|
+
toDate = 'now'
|
|
1835
|
+
url = url.replace('{DATERANGE}',f'%20date:[{fromDate}%20TO%20{toDate}]')
|
|
1836
|
+
else:
|
|
1837
|
+
url = url.replace('{DATERANGE}','')
|
|
1631
1838
|
|
|
1632
1839
|
if verbose():
|
|
1633
|
-
|
|
1840
|
+
if args.mode == 'R':
|
|
1841
|
+
write(colored('The URLScan URL requested to get links for responses: ','magenta')+colored(url+'\n','white'))
|
|
1842
|
+
else:
|
|
1843
|
+
write(colored('The URLScan URL requested to get links: ','magenta')+colored(url+'\n','white'))
|
|
1634
1844
|
|
|
1635
1845
|
if not args.check_only:
|
|
1636
1846
|
write(colored('\rGetting links from urlscan.io API (this can take a while for some domains)...\r','cyan'))
|
|
@@ -1638,7 +1848,10 @@ def getURLScanUrls():
|
|
|
1638
1848
|
# Get the first page from urlscan.io
|
|
1639
1849
|
try:
|
|
1640
1850
|
# Choose a random user agent string to use for any requests
|
|
1641
|
-
|
|
1851
|
+
# For other sources we would use `random.choice(USER_AGENT)` to asignn a random user-agent, but it seems
|
|
1852
|
+
# that there are a handful of those that ALWAYS return 429. Passing a specific one all the time seems to
|
|
1853
|
+
# be successful all the time
|
|
1854
|
+
userAgent = "waymore v"+__version__+" by xnl-h4ck3r"
|
|
1642
1855
|
session = requests.Session()
|
|
1643
1856
|
session.mount('https://', HTTP_ADAPTER)
|
|
1644
1857
|
session.mount('http://', HTTP_ADAPTER)
|
|
@@ -1702,7 +1915,7 @@ def getURLScanUrls():
|
|
|
1702
1915
|
totalUrls = 0
|
|
1703
1916
|
|
|
1704
1917
|
# Carry on if something was found
|
|
1705
|
-
if args.check_only:
|
|
1918
|
+
if args.check_only and args.mode != 'R':
|
|
1706
1919
|
try:
|
|
1707
1920
|
hasMore = jsonResp['has_more']
|
|
1708
1921
|
if hasMore:
|
|
@@ -1768,9 +1981,17 @@ def getURLScanUrls():
|
|
|
1768
1981
|
except:
|
|
1769
1982
|
mimeType = ''
|
|
1770
1983
|
|
|
1984
|
+
# If we are going to be downloading responses, then get the unique ID to retrieve the DOM later
|
|
1985
|
+
urlscanID = ''
|
|
1986
|
+
if args.mode in ('R','B'):
|
|
1987
|
+
try:
|
|
1988
|
+
urlscanID = urlSection['_id']
|
|
1989
|
+
except:
|
|
1990
|
+
pass
|
|
1991
|
+
|
|
1771
1992
|
# If a URL was found the process it
|
|
1772
1993
|
if foundUrl != '':
|
|
1773
|
-
processURLScanUrl(foundUrl, httpCode, mimeType)
|
|
1994
|
+
processURLScanUrl(foundUrl, httpCode, mimeType, urlscanID)
|
|
1774
1995
|
|
|
1775
1996
|
# If a pointer was found the process it
|
|
1776
1997
|
if pointer != '':
|
|
@@ -1789,7 +2010,6 @@ def getURLScanUrls():
|
|
|
1789
2010
|
# Get the next page from urlscan.io
|
|
1790
2011
|
try:
|
|
1791
2012
|
# Choose a random user agent string to use for any requests
|
|
1792
|
-
userAgent = random.choice(USER_AGENT)
|
|
1793
2013
|
session = requests.Session()
|
|
1794
2014
|
session.mount('https://', HTTP_ADAPTER)
|
|
1795
2015
|
session.mount('http://', HTTP_ADAPTER)
|
|
@@ -1829,19 +2049,20 @@ def getURLScanUrls():
|
|
|
1829
2049
|
jsonResp = json.loads(resp.text.strip())
|
|
1830
2050
|
|
|
1831
2051
|
# If there are no more results, or if the requests limit was specified and has been exceeded, then stop
|
|
1832
|
-
if jsonResp['results'] is None or len(jsonResp['results']) == 0 or (args.limit_requests != 0 and requestsMade > args.limit_requests):
|
|
2052
|
+
if jsonResp['results'] is None or len(jsonResp['results']) == 0 or (args.limit_requests != 0 and requestsMade > args.limit_requests) or (args.mode == 'R' and args.limit != 0 and requestsMade > args.limit):
|
|
1833
2053
|
stopSource = True
|
|
1834
2054
|
|
|
1835
2055
|
# Show the MIME types found (in case user wants to exclude more)
|
|
1836
|
-
if verbose() and len(linkMimes) > 0:
|
|
2056
|
+
if verbose() and len(linkMimes) > 0 and args.mode != 'R':
|
|
1837
2057
|
linkMimes.discard('warc/revisit')
|
|
1838
2058
|
write(getSPACER(colored('MIME types found: ','magenta')+colored(str(linkMimes),'white'))+'\n')
|
|
1839
2059
|
|
|
1840
2060
|
linkCount = len(linksFound) - originalLinkCount
|
|
1841
|
-
if args.
|
|
1842
|
-
|
|
1843
|
-
|
|
1844
|
-
|
|
2061
|
+
if args.mode != 'R':
|
|
2062
|
+
if args.xwm and args.xcc and args.xav:
|
|
2063
|
+
write(getSPACER(colored('Links found on urlscan.io: ', 'cyan')+colored(str(linkCount),'white'))+'\n')
|
|
2064
|
+
else:
|
|
2065
|
+
write(getSPACER(colored('Extra links found on urlscan.io: ', 'cyan')+colored(str(linkCount),'white'))+'\n')
|
|
1845
2066
|
|
|
1846
2067
|
except Exception as e:
|
|
1847
2068
|
writerr(colored('ERROR getURLScanUrls 1: ' + str(e), 'red'))
|
|
@@ -1939,7 +2160,9 @@ def processWayBackPage(url):
|
|
|
1939
2160
|
# Only get MIME Types if --verbose option was selected
|
|
1940
2161
|
if verbose():
|
|
1941
2162
|
try:
|
|
1942
|
-
|
|
2163
|
+
mimeType = str(results).split(' ')[2]
|
|
2164
|
+
if mimeType != '':
|
|
2165
|
+
linkMimes.add(mimeType)
|
|
1943
2166
|
except Exception as e:
|
|
1944
2167
|
if verbose():
|
|
1945
2168
|
writerr(colored(getSPACER('ERROR processWayBackPage 2: Cannot get MIME type from line: ' + str(line)),'red'))
|
|
@@ -2176,7 +2399,8 @@ def processCommonCrawlCollection(cdxApiUrl):
|
|
|
2176
2399
|
# Get MIME Types if --verbose option was seletced
|
|
2177
2400
|
if verbose():
|
|
2178
2401
|
try:
|
|
2179
|
-
|
|
2402
|
+
if data['mime'] != '':
|
|
2403
|
+
linkMimes.add(data['mime'])
|
|
2180
2404
|
except:
|
|
2181
2405
|
pass
|
|
2182
2406
|
linksFoundAdd(data['url'])
|
|
@@ -2713,12 +2937,200 @@ def getIntelxUrls():
|
|
|
2713
2937
|
except Exception as e:
|
|
2714
2938
|
writerr(colored('ERROR getIntelxUrls 1: ' + str(e), 'red'))
|
|
2715
2939
|
|
|
2716
|
-
def processResponses():
|
|
2940
|
+
def processResponses():
|
|
2941
|
+
"""
|
|
2942
|
+
Get archived responses from al sources
|
|
2943
|
+
"""
|
|
2944
|
+
global stopProgram, totalFileCount
|
|
2945
|
+
try:
|
|
2946
|
+
|
|
2947
|
+
# Get responses from URLScan unless excluded
|
|
2948
|
+
if stopProgram is None and not args.xus:
|
|
2949
|
+
processResponsesURLScan()
|
|
2950
|
+
|
|
2951
|
+
# Get responses from wayback machine unless excluded
|
|
2952
|
+
if stopProgram is None and not args.xwm:
|
|
2953
|
+
processResponsesWayback()
|
|
2954
|
+
|
|
2955
|
+
# If requested, generate the combined inline JS files
|
|
2956
|
+
if not args.check_only and stopProgram is None and totalFileCount > 0 and args.output_inline_js:
|
|
2957
|
+
combineInlineJS()
|
|
2958
|
+
|
|
2959
|
+
except Exception as e:
|
|
2960
|
+
writerr(colored(getSPACER('ERROR processResponses 1: ' + str(e)), 'red'))
|
|
2961
|
+
|
|
2962
|
+
def processResponsesURLScan():
|
|
2963
|
+
"""
|
|
2964
|
+
Get archived responses from URLScan (urlscan.io)
|
|
2965
|
+
"""
|
|
2966
|
+
global subs, path, indexFile, totalResponses, stopProgram, argsInput, continueRespFileURLScan, successCount, fileCount, DEFAULT_OUTPUT_DIR, responseOutputDirectory, urlscanRequestLinks, failureCount, totalFileCount, checkURLScan
|
|
2967
|
+
try:
|
|
2968
|
+
fileCount = 0
|
|
2969
|
+
failureCount = 0
|
|
2970
|
+
if not args.check_only:
|
|
2971
|
+
# Create 'results' and domain directory if needed
|
|
2972
|
+
createDirs()
|
|
2973
|
+
|
|
2974
|
+
# Get the path of the files, depending on whether -oR / --output_responses was passed
|
|
2975
|
+
try:
|
|
2976
|
+
continuePath = responseOutputDirectory + 'continueRes.URLScan.tmp'
|
|
2977
|
+
responsesPath = responseOutputDirectory + 'responses.URLScan.tmp'
|
|
2978
|
+
indexPath = responseOutputDirectory + 'waymore_index.txt'
|
|
2979
|
+
except Exception as e:
|
|
2980
|
+
if verbose():
|
|
2981
|
+
writerr(colored('ERROR processResponsesURLScan 4: ' + str(e), 'red'))
|
|
2982
|
+
|
|
2983
|
+
# Get URLs from URLScan.io if the DOM ID's haven't been retrieved yet
|
|
2984
|
+
if args.mode == 'R' and stopProgram is None and not args.check_only:
|
|
2985
|
+
write(colored('\rGetting list of response links (this can take a while for some domains)...\r','cyan'))
|
|
2986
|
+
getURLScanUrls()
|
|
2987
|
+
|
|
2988
|
+
# Check if a continueResp.URLScan.tmp and responses.URLScan.tmp files exists
|
|
2989
|
+
runPrevious = 'n'
|
|
2990
|
+
if not args.check_only and os.path.exists(continuePath) and os.path.exists(responsesPath):
|
|
2991
|
+
|
|
2992
|
+
# Load the links into the set
|
|
2993
|
+
with open(responsesPath,'rb') as fl:
|
|
2994
|
+
linkRequests = pickle.load(fl)
|
|
2995
|
+
totalPrevResponses = len(linkRequests)
|
|
2996
|
+
|
|
2997
|
+
# Get the previous end position to start again at this point
|
|
2998
|
+
try:
|
|
2999
|
+
with open(continuePath,'r') as fc:
|
|
3000
|
+
successCount = int(fc.readline().strip())
|
|
3001
|
+
except Exception as e:
|
|
3002
|
+
successCount = 0
|
|
3003
|
+
|
|
3004
|
+
# Ask the user if we should continue with previous run if the current starting position is greater than 0 and less than the total
|
|
3005
|
+
if successCount > 0 and successCount < totalPrevResponses:
|
|
3006
|
+
# If the program is not piped from or to another process, then ask whether to continue with previous run
|
|
3007
|
+
if sys.stdout.isatty() and sys.stdin.isatty():
|
|
3008
|
+
write(colored('The previous run to get archived responses for ' + argsInput + ' was not completed.\nYou can start from response ' + str(successCount) + ' of ' + str(totalPrevResponses) + ' for the previous run, or you can start a new run with your specified arguments.', 'yellow'))
|
|
3009
|
+
runPrevious = input('Continue with previous run? y/n: ')
|
|
3010
|
+
else:
|
|
3011
|
+
if CONTINUE_RESPONSES_IF_PIPED:
|
|
3012
|
+
runPrevious = 'y'
|
|
3013
|
+
writerr(colored('The previous run to get archived responses for ' + argsInput + ' was not completed. Starting from response ' + str(successCount) + ' of ' + str(totalPrevResponses) + '... ', 'yellow'))
|
|
3014
|
+
else:
|
|
3015
|
+
runPrevious = 'n'
|
|
3016
|
+
|
|
3017
|
+
# If we are going to run a new run
|
|
3018
|
+
if runPrevious.lower() == 'n':
|
|
3019
|
+
|
|
3020
|
+
# Set start point
|
|
3021
|
+
successCount = 0
|
|
3022
|
+
|
|
3023
|
+
# Get the URLScan DOM links
|
|
3024
|
+
linkRequests = []
|
|
3025
|
+
for originalUrl, domUrl in urlscanRequestLinks:
|
|
3026
|
+
linkRequests.append((originalUrl,domUrl))
|
|
3027
|
+
|
|
3028
|
+
# Write the links to a temp file
|
|
3029
|
+
if not args.check_only:
|
|
3030
|
+
with open(responsesPath,'wb') as f:
|
|
3031
|
+
pickle.dump(linkRequests, f)
|
|
3032
|
+
|
|
3033
|
+
# Get the total number of responses we will try to get and set the current file count to the success count
|
|
3034
|
+
totalResponses = len(linkRequests)
|
|
3035
|
+
checkURLScan = checkURLScan + totalResponses
|
|
3036
|
+
|
|
3037
|
+
# If there are no reponses to download, diaplay an error and exit
|
|
3038
|
+
if args.mode != 'R' and totalResponses == 0:
|
|
3039
|
+
writerr(colored(getSPACER('Failed to get responses from URLScan (urlscan.io) - check input and try again.'), 'red'))
|
|
3040
|
+
return
|
|
3041
|
+
|
|
3042
|
+
fileCount = successCount
|
|
3043
|
+
|
|
3044
|
+
if args.check_only:
|
|
3045
|
+
writerr(colored('Downloading archived responses: ','cyan')+colored('UNKNOWN requests','cyan'))
|
|
3046
|
+
writerr(colored('\n-> Downloading the responses can vary depending on the target and the rate limiting on URLScan','green'))
|
|
3047
|
+
# if args.limit == 5000 and totalResponses == 5000:
|
|
3048
|
+
# writerr(colored('Downloading archived responses: ','cyan')+colored(str(totalResponses+1)+' requests (the --limit argument defaults to '+str(DEFAULT_LIMIT)+')','cyan'))
|
|
3049
|
+
# else:
|
|
3050
|
+
# writerr(colored('Downloading archived responses: ','cyan')+colored(str(totalResponses+1)+' requests','white'))
|
|
3051
|
+
# minutes = round(totalResponses*2.5 // 60)
|
|
3052
|
+
# hours = minutes // 60
|
|
3053
|
+
# days = hours // 24
|
|
3054
|
+
# if minutes < 5:
|
|
3055
|
+
# write(colored('\n-> Downloading the responses (depending on their size) should be quite quick!','green'))
|
|
3056
|
+
# elif hours < 2:
|
|
3057
|
+
# write(colored('\n-> Downloading the responses (depending on their size) could take more than '+str(minutes)+' minutes.','green'))
|
|
3058
|
+
# elif hours < 6:
|
|
3059
|
+
# write(colored('\n-> Downloading the responses (depending on their size) could take more than '+str(hours)+' hours.','green'))
|
|
3060
|
+
# elif hours < 24:
|
|
3061
|
+
# write(colored('\n-> Downloading the responses (depending on their size) could take more than '+str(hours)+' hours.','yellow'))
|
|
3062
|
+
# elif days < 7:
|
|
3063
|
+
# write(colored('\n-> Downloading the responses (depending on their size) could take more than '+str(days)+' days. Consider using arguments -ko, -l, -ci, -from and -to wisely! ','red'))
|
|
3064
|
+
# else:
|
|
3065
|
+
# write(colored('\n-> Downloading the responses (depending on their size) could take more than '+str(days)+' days!!! Consider using arguments -ko, -l, -ci, -from and -to wisely!','red'))
|
|
3066
|
+
write('')
|
|
3067
|
+
else:
|
|
3068
|
+
# If the limit has been set over the default, give a warning that this could take a long time!
|
|
3069
|
+
if totalResponses - successCount > DEFAULT_LIMIT:
|
|
3070
|
+
if successCount > 0:
|
|
3071
|
+
writerr(colored(getSPACER('WARNING: Downloading remaining ' + str(totalResponses - successCount) + ' responses may take a loooooooong time! Consider using arguments -ko, -l, -ci, -from and -to wisely!'),'yellow'))
|
|
3072
|
+
else:
|
|
3073
|
+
writerr(colored(getSPACER('WARNING: Downloading ' + str(totalResponses) + ' responses may take a loooooooong time! Consider using arguments -ko, -l, -ci, -from and -to wisely!'),'yellow'))
|
|
3074
|
+
|
|
3075
|
+
# Open the index file if hash value is going to be used (not URL)
|
|
3076
|
+
if not args.url_filename:
|
|
3077
|
+
indexFile = open(indexPath,'a')
|
|
3078
|
+
|
|
3079
|
+
# Open the continue.URLScan.tmp file to store what record we are upto
|
|
3080
|
+
continueRespFileURLScan = open(continuePath,'w+')
|
|
3081
|
+
|
|
3082
|
+
# Process the URLs from URLScan
|
|
3083
|
+
if stopProgram is None:
|
|
3084
|
+
p = mp.Pool(args.processes)
|
|
3085
|
+
p.starmap(getURLScanDOM, linkRequests[successCount:])
|
|
3086
|
+
p.close()
|
|
3087
|
+
p.join()
|
|
3088
|
+
|
|
3089
|
+
# Delete the tmp files now it has run successfully
|
|
3090
|
+
if stopProgram is None:
|
|
3091
|
+
try:
|
|
3092
|
+
os.remove(continuePath)
|
|
3093
|
+
os.remove(responsesPath)
|
|
3094
|
+
except:
|
|
3095
|
+
pass
|
|
3096
|
+
|
|
3097
|
+
# Close the index file if hash value is going to be used (not URL)
|
|
3098
|
+
if not args.url_filename:
|
|
3099
|
+
indexFile.close()
|
|
3100
|
+
|
|
3101
|
+
# Close the continueResp.URLScan.tmp file
|
|
3102
|
+
continueRespFileURLScan.close()
|
|
3103
|
+
|
|
3104
|
+
if not args.check_only:
|
|
3105
|
+
try:
|
|
3106
|
+
if failureCount > 0:
|
|
3107
|
+
if verbose():
|
|
3108
|
+
write(colored('\nURLScan responses saved to ','cyan')+colored(responseOutputDirectory,'white') + colored(' for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘','white')+colored(' (' + str(failureCount) + ' not found)\n','red'))
|
|
3109
|
+
else:
|
|
3110
|
+
write(colored('\nURLScan responses saved for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘','white')+colored(' (' + str(failureCount) + ' not found)\n','red'))
|
|
3111
|
+
else:
|
|
3112
|
+
if verbose():
|
|
3113
|
+
write(colored('\nURLScan responses saved to ','cyan')+colored(responseOutputDirectory,'white') + colored(' for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘\n','white'))
|
|
3114
|
+
else:
|
|
3115
|
+
write(colored('\nURLScan responses saved for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘\n','white'))
|
|
3116
|
+
except Exception as e:
|
|
3117
|
+
if verbose():
|
|
3118
|
+
writerr(colored("ERROR processResponsesURLScan 5: " + str(e), "red"))
|
|
3119
|
+
|
|
3120
|
+
totalFileCount = totalFileCount + fileCount
|
|
3121
|
+
except Exception as e:
|
|
3122
|
+
writerr(colored(getSPACER('ERROR processResponsesURLScan 1: ' + str(e)), 'red'))
|
|
3123
|
+
finally:
|
|
3124
|
+
linkRequests = None
|
|
3125
|
+
|
|
3126
|
+
def processResponsesWayback():
|
|
2717
3127
|
"""
|
|
2718
3128
|
Get archived responses from Wayback Machine (archive.org)
|
|
2719
3129
|
"""
|
|
2720
|
-
global linksFound, subs, path, indexFile, totalResponses, stopProgram, argsInput, continueRespFile, successCount, fileCount, DEFAULT_OUTPUT_DIR, responseOutputDirectory
|
|
3130
|
+
global linksFound, subs, path, indexFile, totalResponses, stopProgram, argsInput, continueRespFile, successCount, fileCount, DEFAULT_OUTPUT_DIR, responseOutputDirectory, failureCount, totalFileCount
|
|
2721
3131
|
try:
|
|
3132
|
+
fileCount = 0
|
|
3133
|
+
failureCount = 0
|
|
2722
3134
|
if not args.check_only:
|
|
2723
3135
|
# Create 'results' and domain directory if needed
|
|
2724
3136
|
createDirs()
|
|
@@ -2727,11 +3139,11 @@ def processResponses():
|
|
|
2727
3139
|
try:
|
|
2728
3140
|
continuePath = responseOutputDirectory + 'continueResp.tmp'
|
|
2729
3141
|
responsesPath = responseOutputDirectory + 'responses.tmp'
|
|
2730
|
-
indexPath = responseOutputDirectory + '
|
|
3142
|
+
indexPath = responseOutputDirectory + 'waymore_index.txt'
|
|
2731
3143
|
except Exception as e:
|
|
2732
3144
|
if verbose():
|
|
2733
|
-
writerr(colored('ERROR
|
|
2734
|
-
|
|
3145
|
+
writerr(colored('ERROR processResponsesWayback 4: ' + str(e), 'red'))
|
|
3146
|
+
|
|
2735
3147
|
# Check if a continueResp.tmp and responses.tmp files exists
|
|
2736
3148
|
runPrevious = 'n'
|
|
2737
3149
|
if not args.check_only and os.path.exists(continuePath) and os.path.exists(responsesPath):
|
|
@@ -2820,7 +3232,7 @@ def processResponses():
|
|
|
2820
3232
|
url = WAYBACK_URL.replace('{DOMAIN}',subs + quote(argsInput) + path).replace('{COLLAPSE}',collapse) + filterMIME + filterCode + filterLimit + filterFrom + filterTo + filterKeywords
|
|
2821
3233
|
|
|
2822
3234
|
if verbose():
|
|
2823
|
-
write(colored('The
|
|
3235
|
+
write(colored('The Wayback Machine URL requested to get responses: ','magenta')+colored(url+'\n','white'))
|
|
2824
3236
|
|
|
2825
3237
|
if args.check_only:
|
|
2826
3238
|
write(colored('\rChecking archived response requests...\r','cyan'))
|
|
@@ -2853,7 +3265,7 @@ def processResponses():
|
|
|
2853
3265
|
if resp.text == '':
|
|
2854
3266
|
writerr(colored(getSPACER('No archived responses were found on Wayback Machine (archive.org) for the given search parameters.'),'red'))
|
|
2855
3267
|
success = False
|
|
2856
|
-
# If a status other of 429, then stop processing
|
|
3268
|
+
# If a status other of 429, then stop processing
|
|
2857
3269
|
if resp.status_code == 429:
|
|
2858
3270
|
writerr(colored(getSPACER('[ 429 ] Wayback Machine (archive.org) rate limit reached, so stopping. Links that have already been retrieved will be saved.'),'red'))
|
|
2859
3271
|
success = False
|
|
@@ -2890,7 +3302,7 @@ def processResponses():
|
|
|
2890
3302
|
originalUrl = parts[1]
|
|
2891
3303
|
linksFoundResponseAdd(timestamp+'/'+originalUrl)
|
|
2892
3304
|
except Exception as e:
|
|
2893
|
-
writerr(colored(getSPACER('ERROR
|
|
3305
|
+
writerr(colored(getSPACER('ERROR processResponsesWayback 3: Cannot to get link from line: '+str(line)), 'red'))
|
|
2894
3306
|
|
|
2895
3307
|
# Remove any links that have URL exclusions
|
|
2896
3308
|
linkRequests = []
|
|
@@ -2954,7 +3366,7 @@ def processResponses():
|
|
|
2954
3366
|
if not args.url_filename:
|
|
2955
3367
|
indexFile = open(indexPath,'a')
|
|
2956
3368
|
|
|
2957
|
-
# Open the
|
|
3369
|
+
# Open the continueResp.tmp file to store what record we are upto
|
|
2958
3370
|
continueRespFile = open(continuePath,'w+')
|
|
2959
3371
|
|
|
2960
3372
|
# Process the URLs from web archive
|
|
@@ -2979,8 +3391,26 @@ def processResponses():
|
|
|
2979
3391
|
# Close the continueResp.tmp file
|
|
2980
3392
|
continueRespFile.close()
|
|
2981
3393
|
|
|
3394
|
+
# Output results if not just checking
|
|
3395
|
+
if not args.check_only:
|
|
3396
|
+
try:
|
|
3397
|
+
if failureCount > 0:
|
|
3398
|
+
if verbose():
|
|
3399
|
+
write(colored('\nWayback responses saved to ','cyan')+colored(responseOutputDirectory,'white') + colored(' for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘','white')+colored(' (' + str(failureCount) + ' failed)\n','red'))
|
|
3400
|
+
else:
|
|
3401
|
+
write(colored('\nWayback responses saved for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘','white')+colored(' (' + str(failureCount) + ' failed)\n','red'))
|
|
3402
|
+
else:
|
|
3403
|
+
if verbose():
|
|
3404
|
+
write(colored('\nWayback responses saved to ','cyan')+colored(responseOutputDirectory,'white') + colored(' for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘\n','white'))
|
|
3405
|
+
else:
|
|
3406
|
+
write(colored('\nWayback responses saved for ' + subs + argsInput + ': ', 'cyan')+colored(str(fileCount) +' (' +str(successCount-fileCount) + ' empty responses) 🤘\n','white'))
|
|
3407
|
+
except Exception as e:
|
|
3408
|
+
if verbose():
|
|
3409
|
+
writerr(colored("ERROR processResponsesWayback 5: " + str(e), "red"))
|
|
3410
|
+
|
|
3411
|
+
totalFileCount = totalFileCount + fileCount
|
|
2982
3412
|
except Exception as e:
|
|
2983
|
-
writerr(colored(getSPACER('ERROR
|
|
3413
|
+
writerr(colored(getSPACER('ERROR processResponsesWayback 1: ' + str(e)), 'red'))
|
|
2984
3414
|
finally:
|
|
2985
3415
|
linkRequests = None
|
|
2986
3416
|
|
|
@@ -3096,7 +3526,7 @@ def combineInlineJS():
|
|
|
3096
3526
|
try:
|
|
3097
3527
|
write(colored('Creating combined inline JS files...', 'cyan'))
|
|
3098
3528
|
outputFileTemplate = "combinedInline{}.js"
|
|
3099
|
-
excludedNames = ['
|
|
3529
|
+
excludedNames = ['waymore_index.txt', 'continueResp.tmp', 'continueResp.URLScan.tmp', 'responses.tmp', 'responses.URLScan.tmp']
|
|
3100
3530
|
fileList = [name for name in os.listdir(responseOutputDirectory)
|
|
3101
3531
|
if os.path.isfile(os.path.join(responseOutputDirectory, name))
|
|
3102
3532
|
and not any(name.lower().endswith(ext) for ext in INLINE_JS_EXCLUDE)
|
|
@@ -3298,15 +3728,15 @@ def main():
|
|
|
3298
3728
|
'-from',
|
|
3299
3729
|
'--from-date',
|
|
3300
3730
|
action='store',
|
|
3301
|
-
type=
|
|
3731
|
+
type=validateArgDate,
|
|
3302
3732
|
help='What date to get responses from. If not specified it will get from the earliest possible results. A partial value can be passed, e.g. 2016, 201805, etc.',
|
|
3303
|
-
metavar='<yyyyMMddhhmmss>'
|
|
3733
|
+
metavar='<yyyyMMddhhmmss>',
|
|
3304
3734
|
)
|
|
3305
3735
|
parser.add_argument(
|
|
3306
3736
|
'-to',
|
|
3307
3737
|
'--to-date',
|
|
3308
3738
|
action='store',
|
|
3309
|
-
type=
|
|
3739
|
+
type=validateArgDate,
|
|
3310
3740
|
help='What date to get responses to. If not specified it will get to the latest possible results. A partial value can be passed, e.g. 2016, 201805, etc.',
|
|
3311
3741
|
metavar='<yyyyMMddhhmmss>'
|
|
3312
3742
|
)
|
|
@@ -3474,7 +3904,7 @@ def main():
|
|
|
3474
3904
|
"-co",
|
|
3475
3905
|
"--check-only",
|
|
3476
3906
|
action="store_true",
|
|
3477
|
-
help="This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine.",
|
|
3907
|
+
help="This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine (unfortunately it isn't possible to check how long it will take to download responses from URLScan).",
|
|
3478
3908
|
)
|
|
3479
3909
|
parser.add_argument(
|
|
3480
3910
|
"-nd",
|
|
@@ -3587,6 +4017,7 @@ def main():
|
|
|
3587
4017
|
indexFile = None
|
|
3588
4018
|
path = ''
|
|
3589
4019
|
stopSource = False
|
|
4020
|
+
urlscanRequestLinks = set()
|
|
3590
4021
|
|
|
3591
4022
|
# Get the config settings from the config.yml file
|
|
3592
4023
|
getConfig()
|
|
@@ -3638,16 +4069,9 @@ def main():
|
|
|
3638
4069
|
responseOutputDirectory = args.output_responses + '/'
|
|
3639
4070
|
else:
|
|
3640
4071
|
responseOutputDirectory = str(DEFAULT_OUTPUT_DIR) + '/results/' + str(argsInput).replace('/','-') + '/'
|
|
3641
|
-
|
|
4072
|
+
|
|
4073
|
+
# Get the responses
|
|
3642
4074
|
processResponses()
|
|
3643
|
-
|
|
3644
|
-
# Output details of the responses downloaded
|
|
3645
|
-
if not args.check_only:
|
|
3646
|
-
processResponsesOutput()
|
|
3647
|
-
|
|
3648
|
-
# If requested, generate the combined inline JS files
|
|
3649
|
-
if stopProgram is None and fileCount > 0 and args.output_inline_js:
|
|
3650
|
-
combineInlineJS()
|
|
3651
4075
|
|
|
3652
4076
|
if args.check_only:
|
|
3653
4077
|
write(colored('NOTE: The time frames are a very rough guide and doesn\'t take into account additonal time for rate limiting.','magenta'))
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.1
|
|
2
2
|
Name: waymore
|
|
3
|
-
Version:
|
|
3
|
+
Version: 6.0
|
|
4
4
|
Summary: Find way more from the Wayback Machine, Common Crawl, Alien Vault OTX, URLScan & VirusTotal!
|
|
5
5
|
Home-page: https://github.com/xnl-h4ck3r/waymore
|
|
6
6
|
Author: @xnl-h4ck3r
|
|
@@ -15,7 +15,7 @@ Requires-Dist: tldextract
|
|
|
15
15
|
|
|
16
16
|
<center><img src="https://github.com/xnl-h4ck3r/waymore/blob/main/waymore/images/title.png"></center>
|
|
17
17
|
|
|
18
|
-
## About -
|
|
18
|
+
## About - v6.0
|
|
19
19
|
|
|
20
20
|
The idea behind **waymore** is to find even more links from the Wayback Machine than other existing tools.
|
|
21
21
|
|
|
@@ -112,7 +112,7 @@ pipx install git+https://github.com/xnl-h4ck3r/waymore.git
|
|
|
112
112
|
| -c | --config | Path to the YML config file. If not passed, it looks for file `config.yml` in the default directory, typically `~/.config/waymore`. |
|
|
113
113
|
| -wrlr | --wayback-rate-limit-retry | The number of minutes the user wants to wait for a rate limit pause on Wayback Machine (archive.org) instead of stopping with a `429` error (default: 3). |
|
|
114
114
|
| -urlr | --urlscan-rate-limit-retry | The number of minutes the user wants to wait for a rate limit pause on URLScan.io instead of stopping with a `429` error (default: 1). |
|
|
115
|
-
| -co | --check-only | This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine.
|
|
115
|
+
| -co | --check-only | This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine (unfortunately it isn't possible to check how long it will take to download responses from URLScan). |
|
|
116
116
|
| -nd | --notify-discord | Whether to send a notification to Discord when waymore completes. It requires `WEBHOOK_DISCORD` to be provided in the `config.yml` file. |
|
|
117
117
|
| -oijs | --output-inline-js | Whether to save combined inline javascript of all relevant files in the response directory when `-mode R` (or `-mode B`) has been used. The files are saved with the name `combinedInline{}.js` where `{}` is the number of the file, saving 1000 unique scripts per file. The file `combinedInlineSrc.txt` will also be created, containing the `src` value of all external scripts referenced in the files. |
|
|
118
118
|
| -v | --verbose | Verbose output |
|
|
@@ -284,7 +284,9 @@ If you come across any problems at all, or have ideas for improvements, please f
|
|
|
284
284
|
|
|
285
285
|
## TODO
|
|
286
286
|
|
|
287
|
-
- Add an `-oos` argument that accepts a file of Out Of Scope subdomains/URLs that will not be returned in the output, or have any responses downloaded
|
|
287
|
+
- Add an `-oos` argument that accepts a file of Out Of Scope subdomains/URLs that will not be returned in the output, or have any responses downloaded.
|
|
288
|
+
- The `waymore_index.txt` isn't de-duplicated if run multiple times for the same input with `-mode R` or `-mode B`.
|
|
289
|
+
- Rewrite to get from sources in parallel. Currently they are run consecutively sorry!
|
|
288
290
|
|
|
289
291
|
## References
|
|
290
292
|
|
waymore-5.0/waymore/__init__.py
DELETED
|
@@ -1 +0,0 @@
|
|
|
1
|
-
__version__="5.0"
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|