powerdlz23 1.2.3 → 1.2.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/Spider/README.md +19 -0
- package/Spider/domain.py +18 -0
- package/Spider/general.py +51 -0
- package/Spider/link_finder.py +25 -0
- package/Spider/main.py +50 -0
- package/Spider/spider.py +74 -0
- package/crawler/.formatter.exs +5 -0
- package/crawler/.github/workflows/ci.yml +29 -0
- package/crawler/.recode.exs +33 -0
- package/crawler/.tool-versions +2 -0
- package/crawler/CHANGELOG.md +82 -0
- package/crawler/README.md +198 -0
- package/crawler/architecture.svg +4 -0
- package/crawler/config/config.exs +9 -0
- package/crawler/config/dev.exs +5 -0
- package/crawler/config/test.exs +5 -0
- package/crawler/examples/google_search/scraper.ex +37 -0
- package/crawler/examples/google_search/url_filter.ex +11 -0
- package/crawler/examples/google_search.ex +77 -0
- package/crawler/lib/crawler/dispatcher/worker.ex +14 -0
- package/crawler/lib/crawler/dispatcher.ex +20 -0
- package/crawler/lib/crawler/fetcher/header_preparer.ex +60 -0
- package/crawler/lib/crawler/fetcher/modifier.ex +45 -0
- package/crawler/lib/crawler/fetcher/policer.ex +77 -0
- package/crawler/lib/crawler/fetcher/recorder.ex +55 -0
- package/crawler/lib/crawler/fetcher/requester.ex +32 -0
- package/crawler/lib/crawler/fetcher/retrier.ex +43 -0
- package/crawler/lib/crawler/fetcher/url_filter.ex +26 -0
- package/crawler/lib/crawler/fetcher.ex +81 -0
- package/crawler/lib/crawler/http.ex +7 -0
- package/crawler/lib/crawler/linker/path_builder.ex +71 -0
- package/crawler/lib/crawler/linker/path_expander.ex +59 -0
- package/crawler/lib/crawler/linker/path_finder.ex +106 -0
- package/crawler/lib/crawler/linker/path_offliner.ex +59 -0
- package/crawler/lib/crawler/linker/path_prefixer.ex +46 -0
- package/crawler/lib/crawler/linker.ex +173 -0
- package/crawler/lib/crawler/options.ex +127 -0
- package/crawler/lib/crawler/parser/css_parser.ex +37 -0
- package/crawler/lib/crawler/parser/guarder.ex +38 -0
- package/crawler/lib/crawler/parser/html_parser.ex +41 -0
- package/crawler/lib/crawler/parser/link_parser/link_expander.ex +32 -0
- package/crawler/lib/crawler/parser/link_parser.ex +50 -0
- package/crawler/lib/crawler/parser.ex +122 -0
- package/crawler/lib/crawler/queue_handler.ex +45 -0
- package/crawler/lib/crawler/scraper.ex +28 -0
- package/crawler/lib/crawler/snapper/dir_maker.ex +45 -0
- package/crawler/lib/crawler/snapper/link_replacer.ex +95 -0
- package/crawler/lib/crawler/snapper.ex +82 -0
- package/crawler/lib/crawler/store/counter.ex +19 -0
- package/crawler/lib/crawler/store/page.ex +7 -0
- package/crawler/lib/crawler/store.ex +87 -0
- package/crawler/lib/crawler/worker.ex +62 -0
- package/crawler/lib/crawler.ex +91 -0
- package/crawler/mix.exs +78 -0
- package/crawler/mix.lock +40 -0
- package/crawler/test/fixtures/introducing-elixir.jpg +0 -0
- package/crawler/test/integration_test.exs +135 -0
- package/crawler/test/lib/crawler/dispatcher/worker_test.exs +7 -0
- package/crawler/test/lib/crawler/dispatcher_test.exs +5 -0
- package/crawler/test/lib/crawler/fetcher/header_preparer_test.exs +7 -0
- package/crawler/test/lib/crawler/fetcher/policer_test.exs +71 -0
- package/crawler/test/lib/crawler/fetcher/recorder_test.exs +9 -0
- package/crawler/test/lib/crawler/fetcher/requester_test.exs +9 -0
- package/crawler/test/lib/crawler/fetcher/retrier_test.exs +7 -0
- package/crawler/test/lib/crawler/fetcher/url_filter_test.exs +7 -0
- package/crawler/test/lib/crawler/fetcher_test.exs +153 -0
- package/crawler/test/lib/crawler/http_test.exs +47 -0
- package/crawler/test/lib/crawler/linker/path_builder_test.exs +7 -0
- package/crawler/test/lib/crawler/linker/path_expander_test.exs +7 -0
- package/crawler/test/lib/crawler/linker/path_finder_test.exs +7 -0
- package/crawler/test/lib/crawler/linker/path_offliner_test.exs +7 -0
- package/crawler/test/lib/crawler/linker/path_prefixer_test.exs +7 -0
- package/crawler/test/lib/crawler/linker_test.exs +7 -0
- package/crawler/test/lib/crawler/options_test.exs +7 -0
- package/crawler/test/lib/crawler/parser/css_parser_test.exs +7 -0
- package/crawler/test/lib/crawler/parser/guarder_test.exs +7 -0
- package/crawler/test/lib/crawler/parser/html_parser_test.exs +7 -0
- package/crawler/test/lib/crawler/parser/link_parser/link_expander_test.exs +7 -0
- package/crawler/test/lib/crawler/parser/link_parser_test.exs +7 -0
- package/crawler/test/lib/crawler/parser_test.exs +8 -0
- package/crawler/test/lib/crawler/queue_handler_test.exs +7 -0
- package/crawler/test/lib/crawler/scraper_test.exs +7 -0
- package/crawler/test/lib/crawler/snapper/dir_maker_test.exs +7 -0
- package/crawler/test/lib/crawler/snapper/link_replacer_test.exs +7 -0
- package/crawler/test/lib/crawler/snapper_test.exs +9 -0
- package/crawler/test/lib/crawler/worker_test.exs +5 -0
- package/crawler/test/lib/crawler_test.exs +295 -0
- package/crawler/test/support/test_case.ex +24 -0
- package/crawler/test/support/test_helpers.ex +28 -0
- package/crawler/test/test_helper.exs +7 -0
- package/package.json +1 -1
- package/rubyretriever/.rspec +2 -0
- package/rubyretriever/.travis.yml +7 -0
- package/rubyretriever/Gemfile +3 -0
- package/rubyretriever/Gemfile.lock +64 -0
- package/rubyretriever/LICENSE +20 -0
- package/rubyretriever/Rakefile +7 -0
- package/rubyretriever/bin/rr +79 -0
- package/rubyretriever/lib/retriever/cli.rb +25 -0
- package/rubyretriever/lib/retriever/core_ext.rb +13 -0
- package/rubyretriever/lib/retriever/fetch.rb +268 -0
- package/rubyretriever/lib/retriever/fetchfiles.rb +71 -0
- package/rubyretriever/lib/retriever/fetchseo.rb +18 -0
- package/rubyretriever/lib/retriever/fetchsitemap.rb +43 -0
- package/rubyretriever/lib/retriever/link.rb +47 -0
- package/rubyretriever/lib/retriever/openuri_redirect_patch.rb +8 -0
- package/rubyretriever/lib/retriever/page.rb +104 -0
- package/rubyretriever/lib/retriever/page_iterator.rb +21 -0
- package/rubyretriever/lib/retriever/target.rb +47 -0
- package/rubyretriever/lib/retriever/version.rb +4 -0
- package/rubyretriever/lib/retriever.rb +15 -0
- package/rubyretriever/readme.md +166 -0
- package/rubyretriever/rubyretriever.gemspec +41 -0
- package/rubyretriever/spec/link_spec.rb +77 -0
- package/rubyretriever/spec/page_spec.rb +94 -0
- package/rubyretriever/spec/retriever_spec.rb +84 -0
- package/rubyretriever/spec/spec_helper.rb +17 -0
- package/rubyretriever/spec/target_spec.rb +55 -0
package/Spider/README.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+

|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
# Overview
|
|
5
|
+
|
|
6
|
+
This is an open source, multi-threaded website crawler written in Python. There is still a lot of work to do, so feel free to help out with development.
|
|
7
|
+
|
|
8
|
+
***
|
|
9
|
+
|
|
10
|
+
Note: This is part of an open source search engine. The purpose of this tool is to gather links **only**. The analytics, data harvesting, and search algorithms are being created as separate programs.
|
|
11
|
+
|
|
12
|
+
### Links
|
|
13
|
+
|
|
14
|
+
- [Support thenewboston](https://www.patreon.com/thenewboston)
|
|
15
|
+
- [thenewboston.com](https://thenewboston.com/)
|
|
16
|
+
- [Facebook](https://www.facebook.com/TheNewBoston-464114846956315/)
|
|
17
|
+
- [Twitter](https://twitter.com/bucky_roberts)
|
|
18
|
+
- [Google+](https://plus.google.com/+BuckyRoberts)
|
|
19
|
+
- [reddit](https://www.reddit.com/r/thenewboston/)
|
package/Spider/domain.py
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
from urllib.parse import urlparse
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
# Get domain name (example.com)
|
|
5
|
+
def get_domain_name(url):
|
|
6
|
+
try:
|
|
7
|
+
results = get_sub_domain_name(url).split('.')
|
|
8
|
+
return results[-2] + '.' + results[-1]
|
|
9
|
+
except:
|
|
10
|
+
return ''
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
# Get sub domain name (name.example.com)
|
|
14
|
+
def get_sub_domain_name(url):
|
|
15
|
+
try:
|
|
16
|
+
return urlparse(url).netloc
|
|
17
|
+
except:
|
|
18
|
+
return ''
|
|
@@ -0,0 +1,51 @@
|
|
|
1
|
+
import os
|
|
2
|
+
|
|
3
|
+
|
|
4
|
+
# Each website is a separate project (folder)
|
|
5
|
+
def create_project_dir(directory):
|
|
6
|
+
if not os.path.exists(directory):
|
|
7
|
+
print('Creating directory ' + directory)
|
|
8
|
+
os.makedirs(directory)
|
|
9
|
+
|
|
10
|
+
|
|
11
|
+
# Create queue and crawled files (if not created)
|
|
12
|
+
def create_data_files(project_name, base_url):
|
|
13
|
+
queue = os.path.join(project_name , 'queue.txt')
|
|
14
|
+
crawled = os.path.join(project_name,"crawled.txt")
|
|
15
|
+
if not os.path.isfile(queue):
|
|
16
|
+
write_file(queue, base_url)
|
|
17
|
+
if not os.path.isfile(crawled):
|
|
18
|
+
write_file(crawled, '')
|
|
19
|
+
|
|
20
|
+
|
|
21
|
+
# Create a new file
|
|
22
|
+
def write_file(path, data):
|
|
23
|
+
with open(path, 'w') as f:
|
|
24
|
+
f.write(data)
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
# Add data onto an existing file
|
|
28
|
+
def append_to_file(path, data):
|
|
29
|
+
with open(path, 'a') as file:
|
|
30
|
+
file.write(data + '\n')
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
# Delete the contents of a file
|
|
34
|
+
def delete_file_contents(path):
|
|
35
|
+
open(path, 'w').close()
|
|
36
|
+
|
|
37
|
+
|
|
38
|
+
# Read a file and convert each line to set items
|
|
39
|
+
def file_to_set(file_name):
|
|
40
|
+
results = set()
|
|
41
|
+
with open(file_name, 'rt') as f:
|
|
42
|
+
for line in f:
|
|
43
|
+
results.add(line.replace('\n', ''))
|
|
44
|
+
return results
|
|
45
|
+
|
|
46
|
+
|
|
47
|
+
# Iterate through a set, each item will be a line in a file
|
|
48
|
+
def set_to_file(links, file_name):
|
|
49
|
+
with open(file_name,"w") as f:
|
|
50
|
+
for l in sorted(links):
|
|
51
|
+
f.write(l+"\n")
|
|
@@ -0,0 +1,25 @@
|
|
|
1
|
+
from html.parser import HTMLParser
|
|
2
|
+
from urllib import parse
|
|
3
|
+
|
|
4
|
+
|
|
5
|
+
class LinkFinder(HTMLParser):
|
|
6
|
+
|
|
7
|
+
def __init__(self, base_url, page_url):
|
|
8
|
+
super().__init__()
|
|
9
|
+
self.base_url = base_url
|
|
10
|
+
self.page_url = page_url
|
|
11
|
+
self.links = set()
|
|
12
|
+
|
|
13
|
+
# When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
|
|
14
|
+
def handle_starttag(self, tag, attrs):
|
|
15
|
+
if tag == 'a':
|
|
16
|
+
for (attribute, value) in attrs:
|
|
17
|
+
if attribute == 'href':
|
|
18
|
+
url = parse.urljoin(self.base_url, value)
|
|
19
|
+
self.links.add(url)
|
|
20
|
+
|
|
21
|
+
def page_links(self):
|
|
22
|
+
return self.links
|
|
23
|
+
|
|
24
|
+
def error(self, message):
|
|
25
|
+
pass
|
package/Spider/main.py
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
import threading
|
|
2
|
+
from queue import Queue
|
|
3
|
+
from spider import Spider
|
|
4
|
+
from domain import *
|
|
5
|
+
from general import *
|
|
6
|
+
|
|
7
|
+
PROJECT_NAME = 'viper-seo'
|
|
8
|
+
HOMEPAGE = 'http://viper-seo.com/'
|
|
9
|
+
DOMAIN_NAME = get_domain_name(HOMEPAGE)
|
|
10
|
+
QUEUE_FILE = PROJECT_NAME + '/queue.txt'
|
|
11
|
+
CRAWLED_FILE = PROJECT_NAME + '/crawled.txt'
|
|
12
|
+
NUMBER_OF_THREADS = 8
|
|
13
|
+
queue = Queue()
|
|
14
|
+
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
# Create worker threads (will die when main exits)
|
|
18
|
+
def create_workers():
|
|
19
|
+
for _ in range(NUMBER_OF_THREADS):
|
|
20
|
+
t = threading.Thread(target=work)
|
|
21
|
+
t.daemon = True
|
|
22
|
+
t.start()
|
|
23
|
+
|
|
24
|
+
|
|
25
|
+
# Do the next job in the queue
|
|
26
|
+
def work():
|
|
27
|
+
while True:
|
|
28
|
+
url = queue.get()
|
|
29
|
+
Spider.crawl_page(threading.current_thread().name, url)
|
|
30
|
+
queue.task_done()
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
# Each queued link is a new job
|
|
34
|
+
def create_jobs():
|
|
35
|
+
for link in file_to_set(QUEUE_FILE):
|
|
36
|
+
queue.put(link)
|
|
37
|
+
queue.join()
|
|
38
|
+
crawl()
|
|
39
|
+
|
|
40
|
+
|
|
41
|
+
# Check if there are items in the queue, if so crawl them
|
|
42
|
+
def crawl():
|
|
43
|
+
queued_links = file_to_set(QUEUE_FILE)
|
|
44
|
+
if len(queued_links) > 0:
|
|
45
|
+
print(str(len(queued_links)) + ' links in the queue')
|
|
46
|
+
create_jobs()
|
|
47
|
+
|
|
48
|
+
|
|
49
|
+
create_workers()
|
|
50
|
+
crawl()
|
package/Spider/spider.py
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
1
|
+
from urllib.request import urlopen
|
|
2
|
+
from link_finder import LinkFinder
|
|
3
|
+
from domain import *
|
|
4
|
+
from general import *
|
|
5
|
+
|
|
6
|
+
|
|
7
|
+
class Spider:
|
|
8
|
+
|
|
9
|
+
project_name = ''
|
|
10
|
+
base_url = ''
|
|
11
|
+
domain_name = ''
|
|
12
|
+
queue_file = ''
|
|
13
|
+
crawled_file = ''
|
|
14
|
+
queue = set()
|
|
15
|
+
crawled = set()
|
|
16
|
+
|
|
17
|
+
def __init__(self, project_name, base_url, domain_name):
|
|
18
|
+
Spider.project_name = project_name
|
|
19
|
+
Spider.base_url = base_url
|
|
20
|
+
Spider.domain_name = domain_name
|
|
21
|
+
Spider.queue_file = Spider.project_name + '/queue.txt'
|
|
22
|
+
Spider.crawled_file = Spider.project_name + '/crawled.txt'
|
|
23
|
+
self.boot()
|
|
24
|
+
self.crawl_page('First spider', Spider.base_url)
|
|
25
|
+
|
|
26
|
+
# Creates directory and files for project on first run and starts the spider
|
|
27
|
+
@staticmethod
|
|
28
|
+
def boot():
|
|
29
|
+
create_project_dir(Spider.project_name)
|
|
30
|
+
create_data_files(Spider.project_name, Spider.base_url)
|
|
31
|
+
Spider.queue = file_to_set(Spider.queue_file)
|
|
32
|
+
Spider.crawled = file_to_set(Spider.crawled_file)
|
|
33
|
+
|
|
34
|
+
# Updates user display, fills queue and updates files
|
|
35
|
+
@staticmethod
|
|
36
|
+
def crawl_page(thread_name, page_url):
|
|
37
|
+
if page_url not in Spider.crawled:
|
|
38
|
+
print(thread_name + ' now crawling ' + page_url)
|
|
39
|
+
print('Queue ' + str(len(Spider.queue)) + ' | Crawled ' + str(len(Spider.crawled)))
|
|
40
|
+
Spider.add_links_to_queue(Spider.gather_links(page_url))
|
|
41
|
+
Spider.queue.remove(page_url)
|
|
42
|
+
Spider.crawled.add(page_url)
|
|
43
|
+
Spider.update_files()
|
|
44
|
+
|
|
45
|
+
# Converts raw response data into readable information and checks for proper html formatting
|
|
46
|
+
@staticmethod
|
|
47
|
+
def gather_links(page_url):
|
|
48
|
+
html_string = ''
|
|
49
|
+
try:
|
|
50
|
+
response = urlopen(page_url)
|
|
51
|
+
if 'text/html' in response.getheader('Content-Type'):
|
|
52
|
+
html_bytes = response.read()
|
|
53
|
+
html_string = html_bytes.decode("utf-8")
|
|
54
|
+
finder = LinkFinder(Spider.base_url, page_url)
|
|
55
|
+
finder.feed(html_string)
|
|
56
|
+
except Exception as e:
|
|
57
|
+
print(str(e))
|
|
58
|
+
return set()
|
|
59
|
+
return finder.page_links()
|
|
60
|
+
|
|
61
|
+
# Saves queue data to project files
|
|
62
|
+
@staticmethod
|
|
63
|
+
def add_links_to_queue(links):
|
|
64
|
+
for url in links:
|
|
65
|
+
if (url in Spider.queue) or (url in Spider.crawled):
|
|
66
|
+
continue
|
|
67
|
+
if Spider.domain_name != get_domain_name(url):
|
|
68
|
+
continue
|
|
69
|
+
Spider.queue.add(url)
|
|
70
|
+
|
|
71
|
+
@staticmethod
|
|
72
|
+
def update_files():
|
|
73
|
+
set_to_file(Spider.queue, Spider.queue_file)
|
|
74
|
+
set_to_file(Spider.crawled, Spider.crawled_file)
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
name: CI
|
|
2
|
+
on: push
|
|
3
|
+
jobs:
|
|
4
|
+
build:
|
|
5
|
+
runs-on: ubuntu-latest
|
|
6
|
+
steps:
|
|
7
|
+
- uses: actions/checkout@v3
|
|
8
|
+
- name: Set up Elixir
|
|
9
|
+
uses: erlef/setup-beam@v1
|
|
10
|
+
with:
|
|
11
|
+
version-type: strict
|
|
12
|
+
version-file: .tool-versions
|
|
13
|
+
- name: Restore dependencies cache
|
|
14
|
+
id: mix-cache
|
|
15
|
+
uses: actions/cache@v3
|
|
16
|
+
with:
|
|
17
|
+
path: |
|
|
18
|
+
deps
|
|
19
|
+
_build
|
|
20
|
+
key: ${{ runner.os }}-mix-${{ hashFiles('**/mix.lock') }}
|
|
21
|
+
restore-keys: ${{ runner.os }}-mix-
|
|
22
|
+
- name: Install dependencies
|
|
23
|
+
if: steps.mix-cache.outputs.cache-hit != 'true'
|
|
24
|
+
run: |
|
|
25
|
+
mix local.rebar --force
|
|
26
|
+
mix local.hex --force
|
|
27
|
+
mix deps.get
|
|
28
|
+
- name: Run tests
|
|
29
|
+
run: mix test
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
[
|
|
2
|
+
version: "0.6.4",
|
|
3
|
+
# Can also be set/reset with `--autocorrect`/`--no-autocorrect`.
|
|
4
|
+
autocorrect: true,
|
|
5
|
+
# With "--dry" no changes will be written to the files.
|
|
6
|
+
# Can also be set/reset with `--dry`/`--no-dry`.
|
|
7
|
+
# If dry is true then verbose is also active.
|
|
8
|
+
dry: false,
|
|
9
|
+
# Can also be set/reset with `--verbose`/`--no-verbose`.
|
|
10
|
+
verbose: false,
|
|
11
|
+
# Can be overwritten by calling `mix recode "lib/**/*.ex"`.
|
|
12
|
+
inputs: ["{mix,.formatter}.exs", "{apps,config,lib,test}/**/*.{ex,exs}"],
|
|
13
|
+
formatter: {Recode.Formatter, []},
|
|
14
|
+
tasks: [
|
|
15
|
+
# Tasks could be added by a tuple of the tasks module name and an options
|
|
16
|
+
# keyword list. A task can be deactivated by `active: false`. The execution of
|
|
17
|
+
# a deactivated task can be forced by calling `mix recode --task ModuleName`.
|
|
18
|
+
{Recode.Task.AliasExpansion, []},
|
|
19
|
+
{Recode.Task.AliasOrder, []},
|
|
20
|
+
{Recode.Task.Dbg, [autocorrect: false]},
|
|
21
|
+
{Recode.Task.EnforceLineLength, [active: false]},
|
|
22
|
+
{Recode.Task.FilterCount, []},
|
|
23
|
+
{Recode.Task.IOInspect, [autocorrect: false]},
|
|
24
|
+
{Recode.Task.Nesting, []},
|
|
25
|
+
{Recode.Task.PipeFunOne, []},
|
|
26
|
+
{Recode.Task.SinglePipe, []},
|
|
27
|
+
{Recode.Task.Specs, [active: false, exclude: "test/**/*.{ex,exs}", config: [only: :visible]]},
|
|
28
|
+
{Recode.Task.TagFIXME, [exit_code: 2]},
|
|
29
|
+
{Recode.Task.TagTODO, [exit_code: 4]},
|
|
30
|
+
{Recode.Task.TestFileExt, []},
|
|
31
|
+
{Recode.Task.UnusedVariable, [active: false]}
|
|
32
|
+
]
|
|
33
|
+
]
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
# Crawler Changelog
|
|
2
|
+
|
|
3
|
+
## master
|
|
4
|
+
|
|
5
|
+
- [Added] Add `:retries` option
|
|
6
|
+
|
|
7
|
+
## v1.5.0 [2023-10-10]
|
|
8
|
+
|
|
9
|
+
- [Added] Add `:force` option
|
|
10
|
+
- [Added] Add `:scope` option
|
|
11
|
+
|
|
12
|
+
## v1.4.0 [2023-10-07]
|
|
13
|
+
|
|
14
|
+
- [Added] Allow multiple instances of Crawler sharing the same queue
|
|
15
|
+
- [Improved] Logger will now log entries as `debug` or `warn`
|
|
16
|
+
|
|
17
|
+
## v1.3.0 [2023-09-30]
|
|
18
|
+
|
|
19
|
+
- [Added] `:store` option, defaults to `nil` to save memory usage
|
|
20
|
+
- [Added] `:max_pages` option
|
|
21
|
+
- [Added] `Crawler.running?/1` to check whether Crawler is running
|
|
22
|
+
- [Improved] The queue is being supervised now
|
|
23
|
+
|
|
24
|
+
## v1.2.0 [2023-09-29]
|
|
25
|
+
|
|
26
|
+
- [Added] `Crawler.Store.all_urls/0` to find all scraped URLs
|
|
27
|
+
- [Improved] Memory usage optimisations
|
|
28
|
+
|
|
29
|
+
## v1.1.2 [2021-10-14]
|
|
30
|
+
|
|
31
|
+
- [Improved] Documentation improvements (thanks @kianmeng)
|
|
32
|
+
|
|
33
|
+
## v1.1.1 [2020-05-15]
|
|
34
|
+
|
|
35
|
+
- [Improved] Updated `floki` and other dependencies
|
|
36
|
+
|
|
37
|
+
## v1.1.0 [2019-02-25]
|
|
38
|
+
|
|
39
|
+
- [Added] `:modifier` option
|
|
40
|
+
- [Added] `:encode_uri` option
|
|
41
|
+
- [Improved] Varies small fixes and improvements
|
|
42
|
+
|
|
43
|
+
## v1.0.0 [2017-08-31]
|
|
44
|
+
|
|
45
|
+
- [Added] Pause / resume / stop Crawler
|
|
46
|
+
- [Improved] Varies small fixes and improvements
|
|
47
|
+
|
|
48
|
+
## v0.4.0 [2017-08-28]
|
|
49
|
+
|
|
50
|
+
- [Added] `:scraper` option to allow scraping content
|
|
51
|
+
- [Improved] Varies small fixes and improvements
|
|
52
|
+
|
|
53
|
+
## v0.3.1 [2017-08-28]
|
|
54
|
+
|
|
55
|
+
- [Improved] `Crawler.Store.DB` now stores the `opts` meta data
|
|
56
|
+
- [Improved] Code documentation
|
|
57
|
+
- [Improved] Varies small fixes and improvements
|
|
58
|
+
|
|
59
|
+
## v0.3.0 [2017-08-27]
|
|
60
|
+
|
|
61
|
+
- [Added] `:retrier` option to allow custom fetch retrying logic
|
|
62
|
+
- [Added] `:url_filter` option to allow custom url filtering logic
|
|
63
|
+
- [Improved] Parser is now more stable and skips unparsable files
|
|
64
|
+
- [Improved] Varies small fixes and improvements
|
|
65
|
+
|
|
66
|
+
## v0.2.0 [2017-08-21]
|
|
67
|
+
|
|
68
|
+
- [Added] `:workers` option
|
|
69
|
+
- [Added] `:interval` option
|
|
70
|
+
- [Added] `:timeout` option
|
|
71
|
+
- [Added] `:user_agent` option
|
|
72
|
+
- [Added] `:save_to` option
|
|
73
|
+
- [Added] `:assets` option
|
|
74
|
+
- [Added] `:parser` option to allow custom parsing logic
|
|
75
|
+
- [Improved] Renamed `:max_levels` to `:max_depths`
|
|
76
|
+
- [Improved] Varies small fixes and improvements
|
|
77
|
+
|
|
78
|
+
## v0.1.0 [2017-07-30]
|
|
79
|
+
|
|
80
|
+
- [Added] A semi-functioning prototype
|
|
81
|
+
- [Added] Finished the very basic crawling function
|
|
82
|
+
- [Added] `:max_levels` option
|
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
# Crawler
|
|
2
|
+
|
|
3
|
+
[](https://github.com/fredwu/crawler/actions)
|
|
4
|
+
[](https://codebeat.co/projects/github-com-fredwu-crawler-master)
|
|
5
|
+
[](https://coveralls.io/github/fredwu/crawler?branch=master)
|
|
6
|
+
[](https://hex.pm/packages/crawler)
|
|
7
|
+
[](https://hexdocs.pm/crawler/)
|
|
8
|
+
[](https://hex.pm/packages/crawler)
|
|
9
|
+
[](https://github.com/fredwu/crawler/blob/master/LICENSE.md)
|
|
10
|
+
[](https://github.com/fredwu/crawler/commits/master)
|
|
11
|
+
|
|
12
|
+
A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via [OPQ](https://github.com/fredwu/opq).
|
|
13
|
+
|
|
14
|
+
## Features
|
|
15
|
+
|
|
16
|
+
- Crawl assets (javascript, css and images).
|
|
17
|
+
- Save to disk.
|
|
18
|
+
- Hook for scraping content.
|
|
19
|
+
- Restrict crawlable domains, paths or content types.
|
|
20
|
+
- Limit concurrent crawlers.
|
|
21
|
+
- Limit rate of crawling.
|
|
22
|
+
- Set the maximum crawl depth.
|
|
23
|
+
- Set timeouts.
|
|
24
|
+
- Set retries strategy.
|
|
25
|
+
- Set crawler's user agent.
|
|
26
|
+
- Manually pause/resume/stop the crawler.
|
|
27
|
+
|
|
28
|
+
See [Hex documentation](https://hexdocs.pm/crawler/).
|
|
29
|
+
|
|
30
|
+
## Architecture
|
|
31
|
+
|
|
32
|
+
Below is a very high level architecture diagram demonstrating how Crawler works.
|
|
33
|
+
|
|
34
|
+

|
|
35
|
+
|
|
36
|
+
## Usage
|
|
37
|
+
|
|
38
|
+
```elixir
|
|
39
|
+
Crawler.crawl("http://elixir-lang.org", max_depths: 2)
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
There are several ways to access the crawled page data:
|
|
43
|
+
|
|
44
|
+
1. Use [`Crawler.Store`](https://hexdocs.pm/crawler/Crawler.Store.html)
|
|
45
|
+
2. Tap into the registry([?](https://hexdocs.pm/elixir/Registry.html)) [`Crawler.Store.DB`](lib/crawler/store.ex)
|
|
46
|
+
3. Use your own [scraper](#custom-modules)
|
|
47
|
+
4. If the `:save_to` option is set, pages will be saved to disk in addition to the above mentioned places
|
|
48
|
+
5. Provide your own [custom parser](#custom-modules) and manage how data is stored and accessed yourself
|
|
49
|
+
|
|
50
|
+
## Configurations
|
|
51
|
+
|
|
52
|
+
| Option | Type | Default Value | Description |
|
|
53
|
+
| ------------- | ------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
54
|
+
| `:assets` | list | `[]` | Whether to fetch any asset files, available options: `"css"`, `"js"`, `"images"`. |
|
|
55
|
+
| `:save_to` | string | `nil` | When provided, the path for saving crawled pages. |
|
|
56
|
+
| `:workers` | integer | `10` | Maximum number of concurrent workers for crawling. |
|
|
57
|
+
| `:interval` | integer | `0` | Rate limit control - number of milliseconds before crawling more pages, defaults to `0` which is effectively no rate limit. |
|
|
58
|
+
| `:max_depths` | integer | `3` | Maximum nested depth of pages to crawl. |
|
|
59
|
+
| `:max_pages` | integer | `:infinity` | Maximum amount of pages to crawl. |
|
|
60
|
+
| `:timeout` | integer | `5000` | Timeout value for fetching a page, in ms. Can also be set to `:infinity`, useful when combined with `Crawler.pause/1`. |
|
|
61
|
+
| `:retries` | integer | `2` | Number of times to retry a fetch. |
|
|
62
|
+
| `:store` | module | `nil` | Module for storing the crawled page data and crawling metadata. You can set it to `Crawler.Store` or use your own module, see `Crawler.Store.add_page_data/3` for implementation details. |
|
|
63
|
+
| `:force` | boolean | `false` | Force crawling URLs even if they have already been crawled, useful if you want to refresh the crawled data. |
|
|
64
|
+
| `:scope` | term | `nil` | Similar to `:force`, but you can pass a custom `:scope` to determine how Crawler should perform on links already seen. |
|
|
65
|
+
| `:user_agent` | string | `Crawler/x.x.x (...)` | User-Agent value sent by the fetch requests. |
|
|
66
|
+
| `:url_filter` | module | `Crawler.Fetcher.UrlFilter` | Custom URL filter, useful for restricting crawlable domains, paths or content types. |
|
|
67
|
+
| `:retrier` | module | `Crawler.Fetcher.Retrier` | Custom fetch retrier, useful for retrying failed crawls, nullifies the `:retries` option. |
|
|
68
|
+
| `:modifier` | module | `Crawler.Fetcher.Modifier` | Custom modifier, useful for adding custom request headers or options. |
|
|
69
|
+
| `:scraper` | module | `Crawler.Scraper` | Custom scraper, useful for scraping content as soon as the parser parses it. |
|
|
70
|
+
| `:parser` | module | `Crawler.Parser` | Custom parser, useful for handling parsing differently or to add extra functionalities. |
|
|
71
|
+
| `:encode_uri` | boolean | `false` | When set to `true` apply the `URI.encode` to the URL to be crawled. |
|
|
72
|
+
| `:queue` | pid | `nil` | You can pass in an `OPQ` pid so that multiple crawlers can share the same queue. |
|
|
73
|
+
|
|
74
|
+
## Custom Modules
|
|
75
|
+
|
|
76
|
+
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
|
|
77
|
+
|
|
78
|
+
### Retrier
|
|
79
|
+
|
|
80
|
+
See [`Crawler.Fetcher.Retrier`](lib/crawler/fetcher/retrier.ex).
|
|
81
|
+
|
|
82
|
+
Crawler uses [ElixirRetry](https://github.com/safwank/ElixirRetry)'s exponential backoff strategy by default.
|
|
83
|
+
|
|
84
|
+
```elixir
|
|
85
|
+
defmodule CustomRetrier do
|
|
86
|
+
@behaviour Crawler.Fetcher.Retrier.Spec
|
|
87
|
+
end
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
### URL Filter
|
|
91
|
+
|
|
92
|
+
See [`Crawler.Fetcher.UrlFilter`](lib/crawler/fetcher/url_filter.ex).
|
|
93
|
+
|
|
94
|
+
```elixir
|
|
95
|
+
defmodule CustomUrlFilter do
|
|
96
|
+
@behaviour Crawler.Fetcher.UrlFilter.Spec
|
|
97
|
+
end
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### Scraper
|
|
101
|
+
|
|
102
|
+
See [`Crawler.Scraper`](lib/crawler/scraper.ex).
|
|
103
|
+
|
|
104
|
+
```elixir
|
|
105
|
+
defmodule CustomScraper do
|
|
106
|
+
@behaviour Crawler.Scraper.Spec
|
|
107
|
+
end
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
### Parser
|
|
111
|
+
|
|
112
|
+
See [`Crawler.Parser`](lib/crawler/parser.ex).
|
|
113
|
+
|
|
114
|
+
```elixir
|
|
115
|
+
defmodule CustomParser do
|
|
116
|
+
@behaviour Crawler.Parser.Spec
|
|
117
|
+
end
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Modifier
|
|
121
|
+
|
|
122
|
+
See [`Crawler.Fetcher.Modifier`](lib/crawler/fetcher/modifier.ex).
|
|
123
|
+
|
|
124
|
+
```elixir
|
|
125
|
+
defmodule CustomModifier do
|
|
126
|
+
@behaviour Crawler.Fetcher.Modifier.Spec
|
|
127
|
+
end
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
## Pause / Resume / Stop Crawler
|
|
131
|
+
|
|
132
|
+
Crawler provides `pause/1`, `resume/1` and `stop/1`, see below.
|
|
133
|
+
|
|
134
|
+
```elixir
|
|
135
|
+
{:ok, opts} = Crawler.crawl("https://elixir-lang.org")
|
|
136
|
+
|
|
137
|
+
Crawler.running?(opts) # => true
|
|
138
|
+
|
|
139
|
+
Crawler.pause(opts)
|
|
140
|
+
|
|
141
|
+
Crawler.running?(opts) # => false
|
|
142
|
+
|
|
143
|
+
Crawler.resume(opts)
|
|
144
|
+
|
|
145
|
+
Crawler.running?(opts) # => true
|
|
146
|
+
|
|
147
|
+
Crawler.stop(opts)
|
|
148
|
+
|
|
149
|
+
Crawler.running?(opts) # => false
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
Please note that when pausing Crawler, you would need to set a large enough `:timeout` (or even set it to `:infinity`) otherwise parser would timeout due to unprocessed links.
|
|
153
|
+
|
|
154
|
+
## Multiple Crawlers
|
|
155
|
+
|
|
156
|
+
It is possible to start multiple crawlers sharing the same queue.
|
|
157
|
+
|
|
158
|
+
```elixir
|
|
159
|
+
{:ok, queue} = OPQ.init(worker: Crawler.Dispatcher.Worker, workers: 2)
|
|
160
|
+
|
|
161
|
+
Crawler.crawl("https://elixir-lang.org", queue: queue)
|
|
162
|
+
Crawler.crawl("https://github.com", queue: queue)
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## Find All Scraped URLs
|
|
166
|
+
|
|
167
|
+
```elixir
|
|
168
|
+
Crawler.Store.all_urls() # => ["https://elixir-lang.org", "https://google.com", ...]
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
## Examples
|
|
172
|
+
|
|
173
|
+
### Google Search + Github
|
|
174
|
+
|
|
175
|
+
This example performs a Google search, then scrapes the results to find Github projects and output their name and description.
|
|
176
|
+
|
|
177
|
+
See the [source code](examples/google_search.ex).
|
|
178
|
+
|
|
179
|
+
You can run the example by cloning the repo and run the command:
|
|
180
|
+
|
|
181
|
+
```shell
|
|
182
|
+
mix run -e "Crawler.Example.GoogleSearch.run()"
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
## API Reference
|
|
186
|
+
|
|
187
|
+
Please see https://hexdocs.pm/crawler.
|
|
188
|
+
|
|
189
|
+
## Changelog
|
|
190
|
+
|
|
191
|
+
Please see [CHANGELOG.md](CHANGELOG.md).
|
|
192
|
+
|
|
193
|
+
## Copyright and License
|
|
194
|
+
|
|
195
|
+
Copyright (c) 2016 Fred Wu
|
|
196
|
+
|
|
197
|
+
This work is free. You can redistribute it and/or modify it under the
|
|
198
|
+
terms of the [MIT License](http://fredwu.mit-license.org/).
|