powerdlz23 1.2.3 → 1.2.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (207) hide show
  1. package/Spider/README.md +19 -0
  2. package/Spider/domain.py +18 -0
  3. package/Spider/general.py +51 -0
  4. package/Spider/link_finder.py +25 -0
  5. package/Spider/main.py +50 -0
  6. package/Spider/spider.py +74 -0
  7. package/crawler/.formatter.exs +5 -0
  8. package/crawler/.github/workflows/ci.yml +29 -0
  9. package/crawler/.recode.exs +33 -0
  10. package/crawler/.tool-versions +2 -0
  11. package/crawler/CHANGELOG.md +82 -0
  12. package/crawler/README.md +198 -0
  13. package/crawler/architecture.svg +4 -0
  14. package/crawler/config/config.exs +9 -0
  15. package/crawler/config/dev.exs +5 -0
  16. package/crawler/config/test.exs +5 -0
  17. package/crawler/examples/google_search/scraper.ex +37 -0
  18. package/crawler/examples/google_search/url_filter.ex +11 -0
  19. package/crawler/examples/google_search.ex +77 -0
  20. package/crawler/lib/crawler/dispatcher/worker.ex +14 -0
  21. package/crawler/lib/crawler/dispatcher.ex +20 -0
  22. package/crawler/lib/crawler/fetcher/header_preparer.ex +60 -0
  23. package/crawler/lib/crawler/fetcher/modifier.ex +45 -0
  24. package/crawler/lib/crawler/fetcher/policer.ex +77 -0
  25. package/crawler/lib/crawler/fetcher/recorder.ex +55 -0
  26. package/crawler/lib/crawler/fetcher/requester.ex +32 -0
  27. package/crawler/lib/crawler/fetcher/retrier.ex +43 -0
  28. package/crawler/lib/crawler/fetcher/url_filter.ex +26 -0
  29. package/crawler/lib/crawler/fetcher.ex +81 -0
  30. package/crawler/lib/crawler/http.ex +7 -0
  31. package/crawler/lib/crawler/linker/path_builder.ex +71 -0
  32. package/crawler/lib/crawler/linker/path_expander.ex +59 -0
  33. package/crawler/lib/crawler/linker/path_finder.ex +106 -0
  34. package/crawler/lib/crawler/linker/path_offliner.ex +59 -0
  35. package/crawler/lib/crawler/linker/path_prefixer.ex +46 -0
  36. package/crawler/lib/crawler/linker.ex +173 -0
  37. package/crawler/lib/crawler/options.ex +127 -0
  38. package/crawler/lib/crawler/parser/css_parser.ex +37 -0
  39. package/crawler/lib/crawler/parser/guarder.ex +38 -0
  40. package/crawler/lib/crawler/parser/html_parser.ex +41 -0
  41. package/crawler/lib/crawler/parser/link_parser/link_expander.ex +32 -0
  42. package/crawler/lib/crawler/parser/link_parser.ex +50 -0
  43. package/crawler/lib/crawler/parser.ex +122 -0
  44. package/crawler/lib/crawler/queue_handler.ex +45 -0
  45. package/crawler/lib/crawler/scraper.ex +28 -0
  46. package/crawler/lib/crawler/snapper/dir_maker.ex +45 -0
  47. package/crawler/lib/crawler/snapper/link_replacer.ex +95 -0
  48. package/crawler/lib/crawler/snapper.ex +82 -0
  49. package/crawler/lib/crawler/store/counter.ex +19 -0
  50. package/crawler/lib/crawler/store/page.ex +7 -0
  51. package/crawler/lib/crawler/store.ex +87 -0
  52. package/crawler/lib/crawler/worker.ex +62 -0
  53. package/crawler/lib/crawler.ex +91 -0
  54. package/crawler/mix.exs +78 -0
  55. package/crawler/mix.lock +40 -0
  56. package/crawler/test/fixtures/introducing-elixir.jpg +0 -0
  57. package/crawler/test/integration_test.exs +135 -0
  58. package/crawler/test/lib/crawler/dispatcher/worker_test.exs +7 -0
  59. package/crawler/test/lib/crawler/dispatcher_test.exs +5 -0
  60. package/crawler/test/lib/crawler/fetcher/header_preparer_test.exs +7 -0
  61. package/crawler/test/lib/crawler/fetcher/policer_test.exs +71 -0
  62. package/crawler/test/lib/crawler/fetcher/recorder_test.exs +9 -0
  63. package/crawler/test/lib/crawler/fetcher/requester_test.exs +9 -0
  64. package/crawler/test/lib/crawler/fetcher/retrier_test.exs +7 -0
  65. package/crawler/test/lib/crawler/fetcher/url_filter_test.exs +7 -0
  66. package/crawler/test/lib/crawler/fetcher_test.exs +153 -0
  67. package/crawler/test/lib/crawler/http_test.exs +47 -0
  68. package/crawler/test/lib/crawler/linker/path_builder_test.exs +7 -0
  69. package/crawler/test/lib/crawler/linker/path_expander_test.exs +7 -0
  70. package/crawler/test/lib/crawler/linker/path_finder_test.exs +7 -0
  71. package/crawler/test/lib/crawler/linker/path_offliner_test.exs +7 -0
  72. package/crawler/test/lib/crawler/linker/path_prefixer_test.exs +7 -0
  73. package/crawler/test/lib/crawler/linker_test.exs +7 -0
  74. package/crawler/test/lib/crawler/options_test.exs +7 -0
  75. package/crawler/test/lib/crawler/parser/css_parser_test.exs +7 -0
  76. package/crawler/test/lib/crawler/parser/guarder_test.exs +7 -0
  77. package/crawler/test/lib/crawler/parser/html_parser_test.exs +7 -0
  78. package/crawler/test/lib/crawler/parser/link_parser/link_expander_test.exs +7 -0
  79. package/crawler/test/lib/crawler/parser/link_parser_test.exs +7 -0
  80. package/crawler/test/lib/crawler/parser_test.exs +8 -0
  81. package/crawler/test/lib/crawler/queue_handler_test.exs +7 -0
  82. package/crawler/test/lib/crawler/scraper_test.exs +7 -0
  83. package/crawler/test/lib/crawler/snapper/dir_maker_test.exs +7 -0
  84. package/crawler/test/lib/crawler/snapper/link_replacer_test.exs +7 -0
  85. package/crawler/test/lib/crawler/snapper_test.exs +9 -0
  86. package/crawler/test/lib/crawler/worker_test.exs +5 -0
  87. package/crawler/test/lib/crawler_test.exs +295 -0
  88. package/crawler/test/support/test_case.ex +24 -0
  89. package/crawler/test/support/test_helpers.ex +28 -0
  90. package/crawler/test/test_helper.exs +7 -0
  91. package/grell/.rspec +2 -0
  92. package/grell/.travis.yml +28 -0
  93. package/grell/CHANGELOG.md +111 -0
  94. package/grell/Gemfile +7 -0
  95. package/grell/LICENSE.txt +22 -0
  96. package/grell/README.md +213 -0
  97. package/grell/Rakefile +2 -0
  98. package/grell/grell.gemspec +36 -0
  99. package/grell/lib/grell/capybara_driver.rb +44 -0
  100. package/grell/lib/grell/crawler.rb +83 -0
  101. package/grell/lib/grell/crawler_manager.rb +84 -0
  102. package/grell/lib/grell/grell_logger.rb +10 -0
  103. package/grell/lib/grell/page.rb +275 -0
  104. package/grell/lib/grell/page_collection.rb +62 -0
  105. package/grell/lib/grell/rawpage.rb +62 -0
  106. package/grell/lib/grell/reader.rb +18 -0
  107. package/grell/lib/grell/version.rb +3 -0
  108. package/grell/lib/grell.rb +11 -0
  109. package/grell/spec/lib/capybara_driver_spec.rb +38 -0
  110. package/grell/spec/lib/crawler_manager_spec.rb +174 -0
  111. package/grell/spec/lib/crawler_spec.rb +361 -0
  112. package/grell/spec/lib/page_collection_spec.rb +159 -0
  113. package/grell/spec/lib/page_spec.rb +418 -0
  114. package/grell/spec/lib/reader_spec.rb +43 -0
  115. package/grell/spec/spec_helper.rb +66 -0
  116. package/heartmagic/config.py +1 -0
  117. package/heartmagic/heart.py +3 -0
  118. package/heartmagic/pytransform/__init__.py +483 -0
  119. package/heartmagic/pytransform/_pytransform.dll +0 -0
  120. package/heartmagic/pytransform/_pytransform.so +0 -0
  121. package/httpStatusCode/README.md +2 -0
  122. package/httpStatusCode/httpStatusCode.js +4 -0
  123. package/httpStatusCode/reasonPhrases.js +344 -0
  124. package/httpStatusCode/statusCodes.js +344 -0
  125. package/package.json +1 -1
  126. package/rubyretriever/.rspec +2 -0
  127. package/rubyretriever/.travis.yml +7 -0
  128. package/rubyretriever/Gemfile +3 -0
  129. package/rubyretriever/Gemfile.lock +64 -0
  130. package/rubyretriever/LICENSE +20 -0
  131. package/rubyretriever/Rakefile +7 -0
  132. package/rubyretriever/bin/rr +79 -0
  133. package/rubyretriever/lib/retriever/cli.rb +25 -0
  134. package/rubyretriever/lib/retriever/core_ext.rb +13 -0
  135. package/rubyretriever/lib/retriever/fetch.rb +268 -0
  136. package/rubyretriever/lib/retriever/fetchfiles.rb +71 -0
  137. package/rubyretriever/lib/retriever/fetchseo.rb +18 -0
  138. package/rubyretriever/lib/retriever/fetchsitemap.rb +43 -0
  139. package/rubyretriever/lib/retriever/link.rb +47 -0
  140. package/rubyretriever/lib/retriever/openuri_redirect_patch.rb +8 -0
  141. package/rubyretriever/lib/retriever/page.rb +104 -0
  142. package/rubyretriever/lib/retriever/page_iterator.rb +21 -0
  143. package/rubyretriever/lib/retriever/target.rb +47 -0
  144. package/rubyretriever/lib/retriever/version.rb +4 -0
  145. package/rubyretriever/lib/retriever.rb +15 -0
  146. package/rubyretriever/readme.md +166 -0
  147. package/rubyretriever/rubyretriever.gemspec +41 -0
  148. package/rubyretriever/spec/link_spec.rb +77 -0
  149. package/rubyretriever/spec/page_spec.rb +94 -0
  150. package/rubyretriever/spec/retriever_spec.rb +84 -0
  151. package/rubyretriever/spec/spec_helper.rb +17 -0
  152. package/rubyretriever/spec/target_spec.rb +55 -0
  153. package/snapcrawl/.changelog.old.md +157 -0
  154. package/snapcrawl/.gitattributes +1 -0
  155. package/snapcrawl/.github/workflows/test.yml +41 -0
  156. package/snapcrawl/.rspec +3 -0
  157. package/snapcrawl/.rubocop.yml +23 -0
  158. package/snapcrawl/CHANGELOG.md +182 -0
  159. package/snapcrawl/Gemfile +15 -0
  160. package/snapcrawl/LICENSE +21 -0
  161. package/snapcrawl/README.md +135 -0
  162. package/snapcrawl/Runfile +35 -0
  163. package/snapcrawl/bin/snapcrawl +25 -0
  164. package/snapcrawl/lib/snapcrawl/cli.rb +52 -0
  165. package/snapcrawl/lib/snapcrawl/config.rb +60 -0
  166. package/snapcrawl/lib/snapcrawl/crawler.rb +98 -0
  167. package/snapcrawl/lib/snapcrawl/dependencies.rb +21 -0
  168. package/snapcrawl/lib/snapcrawl/exceptions.rb +5 -0
  169. package/snapcrawl/lib/snapcrawl/log_helpers.rb +36 -0
  170. package/snapcrawl/lib/snapcrawl/page.rb +118 -0
  171. package/snapcrawl/lib/snapcrawl/pretty_logger.rb +11 -0
  172. package/snapcrawl/lib/snapcrawl/refinements/pair_split.rb +26 -0
  173. package/snapcrawl/lib/snapcrawl/refinements/string_refinements.rb +13 -0
  174. package/snapcrawl/lib/snapcrawl/screenshot.rb +73 -0
  175. package/snapcrawl/lib/snapcrawl/templates/config.yml +49 -0
  176. package/snapcrawl/lib/snapcrawl/templates/docopt.txt +26 -0
  177. package/snapcrawl/lib/snapcrawl/version.rb +3 -0
  178. package/snapcrawl/lib/snapcrawl.rb +20 -0
  179. package/snapcrawl/snapcrawl.gemspec +27 -0
  180. package/snapcrawl/snapcrawl.yml +41 -0
  181. package/snapcrawl/spec/README.md +16 -0
  182. package/snapcrawl/spec/approvals/bin/help +26 -0
  183. package/snapcrawl/spec/approvals/bin/usage +4 -0
  184. package/snapcrawl/spec/approvals/cli/usage +4 -0
  185. package/snapcrawl/spec/approvals/config/defaults +15 -0
  186. package/snapcrawl/spec/approvals/config/minimal +15 -0
  187. package/snapcrawl/spec/approvals/integration/blacklist +14 -0
  188. package/snapcrawl/spec/approvals/integration/default-config +14 -0
  189. package/snapcrawl/spec/approvals/integration/depth-0 +6 -0
  190. package/snapcrawl/spec/approvals/integration/depth-3 +6 -0
  191. package/snapcrawl/spec/approvals/integration/log-color-no +6 -0
  192. package/snapcrawl/spec/approvals/integration/screenshot-error +3 -0
  193. package/snapcrawl/spec/approvals/integration/whitelist +14 -0
  194. package/snapcrawl/spec/approvals/models/pretty_logger/colors +1 -0
  195. package/snapcrawl/spec/fixtures/config/minimal.yml +4 -0
  196. package/snapcrawl/spec/server/config.ru +97 -0
  197. package/snapcrawl/spec/snapcrawl/bin_spec.rb +15 -0
  198. package/snapcrawl/spec/snapcrawl/cli_spec.rb +9 -0
  199. package/snapcrawl/spec/snapcrawl/config_spec.rb +26 -0
  200. package/snapcrawl/spec/snapcrawl/integration_spec.rb +65 -0
  201. package/snapcrawl/spec/snapcrawl/page_spec.rb +89 -0
  202. package/snapcrawl/spec/snapcrawl/pretty_logger_spec.rb +19 -0
  203. package/snapcrawl/spec/snapcrawl/refinements/pair_split_spec.rb +27 -0
  204. package/snapcrawl/spec/snapcrawl/refinements/string_refinements_spec.rb +29 -0
  205. package/snapcrawl/spec/snapcrawl/screenshot_spec.rb +62 -0
  206. package/snapcrawl/spec/spec_helper.rb +22 -0
  207. package/snapcrawl/spec/spec_mixin.rb +10 -0
@@ -0,0 +1,19 @@
1
+ ![](http://i.imgur.com/wYi2CkD.png)
2
+
3
+
4
+ # Overview
5
+
6
+ This is an open source, multi-threaded website crawler written in Python. There is still a lot of work to do, so feel free to help out with development.
7
+
8
+ ***
9
+
10
+ Note: This is part of an open source search engine. The purpose of this tool is to gather links **only**. The analytics, data harvesting, and search algorithms are being created as separate programs.
11
+
12
+ ### Links
13
+
14
+ - [Support thenewboston](https://www.patreon.com/thenewboston)
15
+ - [thenewboston.com](https://thenewboston.com/)
16
+ - [Facebook](https://www.facebook.com/TheNewBoston-464114846956315/)
17
+ - [Twitter](https://twitter.com/bucky_roberts)
18
+ - [Google+](https://plus.google.com/+BuckyRoberts)
19
+ - [reddit](https://www.reddit.com/r/thenewboston/)
@@ -0,0 +1,18 @@
1
+ from urllib.parse import urlparse
2
+
3
+
4
+ # Get domain name (example.com)
5
+ def get_domain_name(url):
6
+ try:
7
+ results = get_sub_domain_name(url).split('.')
8
+ return results[-2] + '.' + results[-1]
9
+ except:
10
+ return ''
11
+
12
+
13
+ # Get sub domain name (name.example.com)
14
+ def get_sub_domain_name(url):
15
+ try:
16
+ return urlparse(url).netloc
17
+ except:
18
+ return ''
@@ -0,0 +1,51 @@
1
+ import os
2
+
3
+
4
+ # Each website is a separate project (folder)
5
+ def create_project_dir(directory):
6
+ if not os.path.exists(directory):
7
+ print('Creating directory ' + directory)
8
+ os.makedirs(directory)
9
+
10
+
11
+ # Create queue and crawled files (if not created)
12
+ def create_data_files(project_name, base_url):
13
+ queue = os.path.join(project_name , 'queue.txt')
14
+ crawled = os.path.join(project_name,"crawled.txt")
15
+ if not os.path.isfile(queue):
16
+ write_file(queue, base_url)
17
+ if not os.path.isfile(crawled):
18
+ write_file(crawled, '')
19
+
20
+
21
+ # Create a new file
22
+ def write_file(path, data):
23
+ with open(path, 'w') as f:
24
+ f.write(data)
25
+
26
+
27
+ # Add data onto an existing file
28
+ def append_to_file(path, data):
29
+ with open(path, 'a') as file:
30
+ file.write(data + '\n')
31
+
32
+
33
+ # Delete the contents of a file
34
+ def delete_file_contents(path):
35
+ open(path, 'w').close()
36
+
37
+
38
+ # Read a file and convert each line to set items
39
+ def file_to_set(file_name):
40
+ results = set()
41
+ with open(file_name, 'rt') as f:
42
+ for line in f:
43
+ results.add(line.replace('\n', ''))
44
+ return results
45
+
46
+
47
+ # Iterate through a set, each item will be a line in a file
48
+ def set_to_file(links, file_name):
49
+ with open(file_name,"w") as f:
50
+ for l in sorted(links):
51
+ f.write(l+"\n")
@@ -0,0 +1,25 @@
1
+ from html.parser import HTMLParser
2
+ from urllib import parse
3
+
4
+
5
+ class LinkFinder(HTMLParser):
6
+
7
+ def __init__(self, base_url, page_url):
8
+ super().__init__()
9
+ self.base_url = base_url
10
+ self.page_url = page_url
11
+ self.links = set()
12
+
13
+ # When we call HTMLParser feed() this function is called when it encounters an opening tag <a>
14
+ def handle_starttag(self, tag, attrs):
15
+ if tag == 'a':
16
+ for (attribute, value) in attrs:
17
+ if attribute == 'href':
18
+ url = parse.urljoin(self.base_url, value)
19
+ self.links.add(url)
20
+
21
+ def page_links(self):
22
+ return self.links
23
+
24
+ def error(self, message):
25
+ pass
package/Spider/main.py ADDED
@@ -0,0 +1,50 @@
1
+ import threading
2
+ from queue import Queue
3
+ from spider import Spider
4
+ from domain import *
5
+ from general import *
6
+
7
+ PROJECT_NAME = 'viper-seo'
8
+ HOMEPAGE = 'http://viper-seo.com/'
9
+ DOMAIN_NAME = get_domain_name(HOMEPAGE)
10
+ QUEUE_FILE = PROJECT_NAME + '/queue.txt'
11
+ CRAWLED_FILE = PROJECT_NAME + '/crawled.txt'
12
+ NUMBER_OF_THREADS = 8
13
+ queue = Queue()
14
+ Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)
15
+
16
+
17
+ # Create worker threads (will die when main exits)
18
+ def create_workers():
19
+ for _ in range(NUMBER_OF_THREADS):
20
+ t = threading.Thread(target=work)
21
+ t.daemon = True
22
+ t.start()
23
+
24
+
25
+ # Do the next job in the queue
26
+ def work():
27
+ while True:
28
+ url = queue.get()
29
+ Spider.crawl_page(threading.current_thread().name, url)
30
+ queue.task_done()
31
+
32
+
33
+ # Each queued link is a new job
34
+ def create_jobs():
35
+ for link in file_to_set(QUEUE_FILE):
36
+ queue.put(link)
37
+ queue.join()
38
+ crawl()
39
+
40
+
41
+ # Check if there are items in the queue, if so crawl them
42
+ def crawl():
43
+ queued_links = file_to_set(QUEUE_FILE)
44
+ if len(queued_links) > 0:
45
+ print(str(len(queued_links)) + ' links in the queue')
46
+ create_jobs()
47
+
48
+
49
+ create_workers()
50
+ crawl()
@@ -0,0 +1,74 @@
1
+ from urllib.request import urlopen
2
+ from link_finder import LinkFinder
3
+ from domain import *
4
+ from general import *
5
+
6
+
7
+ class Spider:
8
+
9
+ project_name = ''
10
+ base_url = ''
11
+ domain_name = ''
12
+ queue_file = ''
13
+ crawled_file = ''
14
+ queue = set()
15
+ crawled = set()
16
+
17
+ def __init__(self, project_name, base_url, domain_name):
18
+ Spider.project_name = project_name
19
+ Spider.base_url = base_url
20
+ Spider.domain_name = domain_name
21
+ Spider.queue_file = Spider.project_name + '/queue.txt'
22
+ Spider.crawled_file = Spider.project_name + '/crawled.txt'
23
+ self.boot()
24
+ self.crawl_page('First spider', Spider.base_url)
25
+
26
+ # Creates directory and files for project on first run and starts the spider
27
+ @staticmethod
28
+ def boot():
29
+ create_project_dir(Spider.project_name)
30
+ create_data_files(Spider.project_name, Spider.base_url)
31
+ Spider.queue = file_to_set(Spider.queue_file)
32
+ Spider.crawled = file_to_set(Spider.crawled_file)
33
+
34
+ # Updates user display, fills queue and updates files
35
+ @staticmethod
36
+ def crawl_page(thread_name, page_url):
37
+ if page_url not in Spider.crawled:
38
+ print(thread_name + ' now crawling ' + page_url)
39
+ print('Queue ' + str(len(Spider.queue)) + ' | Crawled ' + str(len(Spider.crawled)))
40
+ Spider.add_links_to_queue(Spider.gather_links(page_url))
41
+ Spider.queue.remove(page_url)
42
+ Spider.crawled.add(page_url)
43
+ Spider.update_files()
44
+
45
+ # Converts raw response data into readable information and checks for proper html formatting
46
+ @staticmethod
47
+ def gather_links(page_url):
48
+ html_string = ''
49
+ try:
50
+ response = urlopen(page_url)
51
+ if 'text/html' in response.getheader('Content-Type'):
52
+ html_bytes = response.read()
53
+ html_string = html_bytes.decode("utf-8")
54
+ finder = LinkFinder(Spider.base_url, page_url)
55
+ finder.feed(html_string)
56
+ except Exception as e:
57
+ print(str(e))
58
+ return set()
59
+ return finder.page_links()
60
+
61
+ # Saves queue data to project files
62
+ @staticmethod
63
+ def add_links_to_queue(links):
64
+ for url in links:
65
+ if (url in Spider.queue) or (url in Spider.crawled):
66
+ continue
67
+ if Spider.domain_name != get_domain_name(url):
68
+ continue
69
+ Spider.queue.add(url)
70
+
71
+ @staticmethod
72
+ def update_files():
73
+ set_to_file(Spider.queue, Spider.queue_file)
74
+ set_to_file(Spider.crawled, Spider.crawled_file)
@@ -0,0 +1,5 @@
1
+ # Used by "mix format"
2
+ [
3
+ inputs: ["{mix,.formatter}.exs", "{config,lib,test}/**/*.{ex,exs}"],
4
+ plugins: [Recode.FormatterPlugin]
5
+ ]
@@ -0,0 +1,29 @@
1
+ name: CI
2
+ on: push
3
+ jobs:
4
+ build:
5
+ runs-on: ubuntu-latest
6
+ steps:
7
+ - uses: actions/checkout@v3
8
+ - name: Set up Elixir
9
+ uses: erlef/setup-beam@v1
10
+ with:
11
+ version-type: strict
12
+ version-file: .tool-versions
13
+ - name: Restore dependencies cache
14
+ id: mix-cache
15
+ uses: actions/cache@v3
16
+ with:
17
+ path: |
18
+ deps
19
+ _build
20
+ key: ${{ runner.os }}-mix-${{ hashFiles('**/mix.lock') }}
21
+ restore-keys: ${{ runner.os }}-mix-
22
+ - name: Install dependencies
23
+ if: steps.mix-cache.outputs.cache-hit != 'true'
24
+ run: |
25
+ mix local.rebar --force
26
+ mix local.hex --force
27
+ mix deps.get
28
+ - name: Run tests
29
+ run: mix test
@@ -0,0 +1,33 @@
1
+ [
2
+ version: "0.6.4",
3
+ # Can also be set/reset with `--autocorrect`/`--no-autocorrect`.
4
+ autocorrect: true,
5
+ # With "--dry" no changes will be written to the files.
6
+ # Can also be set/reset with `--dry`/`--no-dry`.
7
+ # If dry is true then verbose is also active.
8
+ dry: false,
9
+ # Can also be set/reset with `--verbose`/`--no-verbose`.
10
+ verbose: false,
11
+ # Can be overwritten by calling `mix recode "lib/**/*.ex"`.
12
+ inputs: ["{mix,.formatter}.exs", "{apps,config,lib,test}/**/*.{ex,exs}"],
13
+ formatter: {Recode.Formatter, []},
14
+ tasks: [
15
+ # Tasks could be added by a tuple of the tasks module name and an options
16
+ # keyword list. A task can be deactivated by `active: false`. The execution of
17
+ # a deactivated task can be forced by calling `mix recode --task ModuleName`.
18
+ {Recode.Task.AliasExpansion, []},
19
+ {Recode.Task.AliasOrder, []},
20
+ {Recode.Task.Dbg, [autocorrect: false]},
21
+ {Recode.Task.EnforceLineLength, [active: false]},
22
+ {Recode.Task.FilterCount, []},
23
+ {Recode.Task.IOInspect, [autocorrect: false]},
24
+ {Recode.Task.Nesting, []},
25
+ {Recode.Task.PipeFunOne, []},
26
+ {Recode.Task.SinglePipe, []},
27
+ {Recode.Task.Specs, [active: false, exclude: "test/**/*.{ex,exs}", config: [only: :visible]]},
28
+ {Recode.Task.TagFIXME, [exit_code: 2]},
29
+ {Recode.Task.TagTODO, [exit_code: 4]},
30
+ {Recode.Task.TestFileExt, []},
31
+ {Recode.Task.UnusedVariable, [active: false]}
32
+ ]
33
+ ]
@@ -0,0 +1,2 @@
1
+ erlang 26.1.1
2
+ elixir 1.15.6
@@ -0,0 +1,82 @@
1
+ # Crawler Changelog
2
+
3
+ ## master
4
+
5
+ - [Added] Add `:retries` option
6
+
7
+ ## v1.5.0 [2023-10-10]
8
+
9
+ - [Added] Add `:force` option
10
+ - [Added] Add `:scope` option
11
+
12
+ ## v1.4.0 [2023-10-07]
13
+
14
+ - [Added] Allow multiple instances of Crawler sharing the same queue
15
+ - [Improved] Logger will now log entries as `debug` or `warn`
16
+
17
+ ## v1.3.0 [2023-09-30]
18
+
19
+ - [Added] `:store` option, defaults to `nil` to save memory usage
20
+ - [Added] `:max_pages` option
21
+ - [Added] `Crawler.running?/1` to check whether Crawler is running
22
+ - [Improved] The queue is being supervised now
23
+
24
+ ## v1.2.0 [2023-09-29]
25
+
26
+ - [Added] `Crawler.Store.all_urls/0` to find all scraped URLs
27
+ - [Improved] Memory usage optimisations
28
+
29
+ ## v1.1.2 [2021-10-14]
30
+
31
+ - [Improved] Documentation improvements (thanks @kianmeng)
32
+
33
+ ## v1.1.1 [2020-05-15]
34
+
35
+ - [Improved] Updated `floki` and other dependencies
36
+
37
+ ## v1.1.0 [2019-02-25]
38
+
39
+ - [Added] `:modifier` option
40
+ - [Added] `:encode_uri` option
41
+ - [Improved] Varies small fixes and improvements
42
+
43
+ ## v1.0.0 [2017-08-31]
44
+
45
+ - [Added] Pause / resume / stop Crawler
46
+ - [Improved] Varies small fixes and improvements
47
+
48
+ ## v0.4.0 [2017-08-28]
49
+
50
+ - [Added] `:scraper` option to allow scraping content
51
+ - [Improved] Varies small fixes and improvements
52
+
53
+ ## v0.3.1 [2017-08-28]
54
+
55
+ - [Improved] `Crawler.Store.DB` now stores the `opts` meta data
56
+ - [Improved] Code documentation
57
+ - [Improved] Varies small fixes and improvements
58
+
59
+ ## v0.3.0 [2017-08-27]
60
+
61
+ - [Added] `:retrier` option to allow custom fetch retrying logic
62
+ - [Added] `:url_filter` option to allow custom url filtering logic
63
+ - [Improved] Parser is now more stable and skips unparsable files
64
+ - [Improved] Varies small fixes and improvements
65
+
66
+ ## v0.2.0 [2017-08-21]
67
+
68
+ - [Added] `:workers` option
69
+ - [Added] `:interval` option
70
+ - [Added] `:timeout` option
71
+ - [Added] `:user_agent` option
72
+ - [Added] `:save_to` option
73
+ - [Added] `:assets` option
74
+ - [Added] `:parser` option to allow custom parsing logic
75
+ - [Improved] Renamed `:max_levels` to `:max_depths`
76
+ - [Improved] Varies small fixes and improvements
77
+
78
+ ## v0.1.0 [2017-07-30]
79
+
80
+ - [Added] A semi-functioning prototype
81
+ - [Added] Finished the very basic crawling function
82
+ - [Added] `:max_levels` option
@@ -0,0 +1,198 @@
1
+ # Crawler
2
+
3
+ [![Build Status](https://github.com/fredwu/crawler/actions/workflows/ci.yml/badge.svg)](https://github.com/fredwu/crawler/actions)
4
+ [![CodeBeat](https://codebeat.co/badges/76916047-5b66-466d-91d3-7131a269899a)](https://codebeat.co/projects/github-com-fredwu-crawler-master)
5
+ [![Coverage](https://img.shields.io/coveralls/fredwu/crawler.svg)](https://coveralls.io/github/fredwu/crawler?branch=master)
6
+ [![Module Version](https://img.shields.io/hexpm/v/crawler.svg)](https://hex.pm/packages/crawler)
7
+ [![Hex Docs](https://img.shields.io/badge/hex-docs-lightgreen.svg)](https://hexdocs.pm/crawler/)
8
+ [![Total Download](https://img.shields.io/hexpm/dt/crawler.svg)](https://hex.pm/packages/crawler)
9
+ [![License](https://img.shields.io/hexpm/l/crawler.svg)](https://github.com/fredwu/crawler/blob/master/LICENSE.md)
10
+ [![Last Updated](https://img.shields.io/github/last-commit/fredwu/crawler.svg)](https://github.com/fredwu/crawler/commits/master)
11
+
12
+ A high performance web crawler / scraper in Elixir, with worker pooling and rate limiting via [OPQ](https://github.com/fredwu/opq).
13
+
14
+ ## Features
15
+
16
+ - Crawl assets (javascript, css and images).
17
+ - Save to disk.
18
+ - Hook for scraping content.
19
+ - Restrict crawlable domains, paths or content types.
20
+ - Limit concurrent crawlers.
21
+ - Limit rate of crawling.
22
+ - Set the maximum crawl depth.
23
+ - Set timeouts.
24
+ - Set retries strategy.
25
+ - Set crawler's user agent.
26
+ - Manually pause/resume/stop the crawler.
27
+
28
+ See [Hex documentation](https://hexdocs.pm/crawler/).
29
+
30
+ ## Architecture
31
+
32
+ Below is a very high level architecture diagram demonstrating how Crawler works.
33
+
34
+ ![](architecture.svg)
35
+
36
+ ## Usage
37
+
38
+ ```elixir
39
+ Crawler.crawl("http://elixir-lang.org", max_depths: 2)
40
+ ```
41
+
42
+ There are several ways to access the crawled page data:
43
+
44
+ 1. Use [`Crawler.Store`](https://hexdocs.pm/crawler/Crawler.Store.html)
45
+ 2. Tap into the registry([?](https://hexdocs.pm/elixir/Registry.html)) [`Crawler.Store.DB`](lib/crawler/store.ex)
46
+ 3. Use your own [scraper](#custom-modules)
47
+ 4. If the `:save_to` option is set, pages will be saved to disk in addition to the above mentioned places
48
+ 5. Provide your own [custom parser](#custom-modules) and manage how data is stored and accessed yourself
49
+
50
+ ## Configurations
51
+
52
+ | Option | Type | Default Value | Description |
53
+ | ------------- | ------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
54
+ | `:assets` | list | `[]` | Whether to fetch any asset files, available options: `"css"`, `"js"`, `"images"`. |
55
+ | `:save_to` | string | `nil` | When provided, the path for saving crawled pages. |
56
+ | `:workers` | integer | `10` | Maximum number of concurrent workers for crawling. |
57
+ | `:interval` | integer | `0` | Rate limit control - number of milliseconds before crawling more pages, defaults to `0` which is effectively no rate limit. |
58
+ | `:max_depths` | integer | `3` | Maximum nested depth of pages to crawl. |
59
+ | `:max_pages` | integer | `:infinity` | Maximum amount of pages to crawl. |
60
+ | `:timeout` | integer | `5000` | Timeout value for fetching a page, in ms. Can also be set to `:infinity`, useful when combined with `Crawler.pause/1`. |
61
+ | `:retries` | integer | `2` | Number of times to retry a fetch. |
62
+ | `:store` | module | `nil` | Module for storing the crawled page data and crawling metadata. You can set it to `Crawler.Store` or use your own module, see `Crawler.Store.add_page_data/3` for implementation details. |
63
+ | `:force` | boolean | `false` | Force crawling URLs even if they have already been crawled, useful if you want to refresh the crawled data. |
64
+ | `:scope` | term | `nil` | Similar to `:force`, but you can pass a custom `:scope` to determine how Crawler should perform on links already seen. |
65
+ | `:user_agent` | string | `Crawler/x.x.x (...)` | User-Agent value sent by the fetch requests. |
66
+ | `:url_filter` | module | `Crawler.Fetcher.UrlFilter` | Custom URL filter, useful for restricting crawlable domains, paths or content types. |
67
+ | `:retrier` | module | `Crawler.Fetcher.Retrier` | Custom fetch retrier, useful for retrying failed crawls, nullifies the `:retries` option. |
68
+ | `:modifier` | module | `Crawler.Fetcher.Modifier` | Custom modifier, useful for adding custom request headers or options. |
69
+ | `:scraper` | module | `Crawler.Scraper` | Custom scraper, useful for scraping content as soon as the parser parses it. |
70
+ | `:parser` | module | `Crawler.Parser` | Custom parser, useful for handling parsing differently or to add extra functionalities. |
71
+ | `:encode_uri` | boolean | `false` | When set to `true` apply the `URI.encode` to the URL to be crawled. |
72
+ | `:queue` | pid | `nil` | You can pass in an `OPQ` pid so that multiple crawlers can share the same queue. |
73
+
74
+ ## Custom Modules
75
+
76
+ It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
77
+
78
+ ### Retrier
79
+
80
+ See [`Crawler.Fetcher.Retrier`](lib/crawler/fetcher/retrier.ex).
81
+
82
+ Crawler uses [ElixirRetry](https://github.com/safwank/ElixirRetry)'s exponential backoff strategy by default.
83
+
84
+ ```elixir
85
+ defmodule CustomRetrier do
86
+ @behaviour Crawler.Fetcher.Retrier.Spec
87
+ end
88
+ ```
89
+
90
+ ### URL Filter
91
+
92
+ See [`Crawler.Fetcher.UrlFilter`](lib/crawler/fetcher/url_filter.ex).
93
+
94
+ ```elixir
95
+ defmodule CustomUrlFilter do
96
+ @behaviour Crawler.Fetcher.UrlFilter.Spec
97
+ end
98
+ ```
99
+
100
+ ### Scraper
101
+
102
+ See [`Crawler.Scraper`](lib/crawler/scraper.ex).
103
+
104
+ ```elixir
105
+ defmodule CustomScraper do
106
+ @behaviour Crawler.Scraper.Spec
107
+ end
108
+ ```
109
+
110
+ ### Parser
111
+
112
+ See [`Crawler.Parser`](lib/crawler/parser.ex).
113
+
114
+ ```elixir
115
+ defmodule CustomParser do
116
+ @behaviour Crawler.Parser.Spec
117
+ end
118
+ ```
119
+
120
+ ### Modifier
121
+
122
+ See [`Crawler.Fetcher.Modifier`](lib/crawler/fetcher/modifier.ex).
123
+
124
+ ```elixir
125
+ defmodule CustomModifier do
126
+ @behaviour Crawler.Fetcher.Modifier.Spec
127
+ end
128
+ ```
129
+
130
+ ## Pause / Resume / Stop Crawler
131
+
132
+ Crawler provides `pause/1`, `resume/1` and `stop/1`, see below.
133
+
134
+ ```elixir
135
+ {:ok, opts} = Crawler.crawl("https://elixir-lang.org")
136
+
137
+ Crawler.running?(opts) # => true
138
+
139
+ Crawler.pause(opts)
140
+
141
+ Crawler.running?(opts) # => false
142
+
143
+ Crawler.resume(opts)
144
+
145
+ Crawler.running?(opts) # => true
146
+
147
+ Crawler.stop(opts)
148
+
149
+ Crawler.running?(opts) # => false
150
+ ```
151
+
152
+ Please note that when pausing Crawler, you would need to set a large enough `:timeout` (or even set it to `:infinity`) otherwise parser would timeout due to unprocessed links.
153
+
154
+ ## Multiple Crawlers
155
+
156
+ It is possible to start multiple crawlers sharing the same queue.
157
+
158
+ ```elixir
159
+ {:ok, queue} = OPQ.init(worker: Crawler.Dispatcher.Worker, workers: 2)
160
+
161
+ Crawler.crawl("https://elixir-lang.org", queue: queue)
162
+ Crawler.crawl("https://github.com", queue: queue)
163
+ ```
164
+
165
+ ## Find All Scraped URLs
166
+
167
+ ```elixir
168
+ Crawler.Store.all_urls() # => ["https://elixir-lang.org", "https://google.com", ...]
169
+ ```
170
+
171
+ ## Examples
172
+
173
+ ### Google Search + Github
174
+
175
+ This example performs a Google search, then scrapes the results to find Github projects and output their name and description.
176
+
177
+ See the [source code](examples/google_search.ex).
178
+
179
+ You can run the example by cloning the repo and run the command:
180
+
181
+ ```shell
182
+ mix run -e "Crawler.Example.GoogleSearch.run()"
183
+ ```
184
+
185
+ ## API Reference
186
+
187
+ Please see https://hexdocs.pm/crawler.
188
+
189
+ ## Changelog
190
+
191
+ Please see [CHANGELOG.md](CHANGELOG.md).
192
+
193
+ ## Copyright and License
194
+
195
+ Copyright (c) 2016 Fred Wu
196
+
197
+ This work is free. You can redistribute it and/or modify it under the
198
+ terms of the [MIT License](http://fredwu.mit-license.org/).