logstash-input-crawler 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 25f7f64d83707e4bc2584693fafae5a741eb21b32dbfcec3ecd109f4d47bd0dc
4
+ data.tar.gz: a48c227043804fbc72336c4a2acf1a5893e4d6646d7df7fb73b56bdf1d002161
5
+ SHA512:
6
+ metadata.gz: 3729fb3cb17afda7f9ca16fc20fe75d7ae3a4524f282927f40cb9284fdc399819e57d8887e49bf986e50c1977c97d38c6ad3c326bde02312a46c4cb4ac68e9ab
7
+ data.tar.gz: 0266a0c5c777c34885881eeda588aeab373a301f168faa92bf6e1e7771e84ae3d2aab52a3775bffc9e2f48d5edee3345caef58f2b1e651e045d37c0b3a79b302
@@ -0,0 +1,2 @@
1
+ ## 0.1.0
2
+ - Plugin created with the logstash plugin generator
@@ -0,0 +1,10 @@
1
+ The following is a list of people who have contributed ideas, code, bug
2
+ reports, or in general have helped logstash along its way.
3
+
4
+ Contributors:
5
+ * -
6
+
7
+ Note: If you've sent us patches, bug reports, or otherwise contributed to
8
+ Logstash, and you aren't on the list above and want to be, please let us know
9
+ and we'll make sure you're here. Contributions from folks like you are what make
10
+ open source awesome.
@@ -0,0 +1,2 @@
1
+ # logstash-input-crawler
2
+ Example input plugin. This should help bootstrap your effort to write your own input plugin!
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
4
+
5
+ gem 'mechanize'
6
+ gem 'nokogiri'
data/LICENSE ADDED
@@ -0,0 +1,11 @@
1
+ Licensed under the Apache License, Version 2.0 (the "License");
2
+ you may not use this file except in compliance with the License.
3
+ You may obtain a copy of the License at
4
+
5
+ http://www.apache.org/licenses/LICENSE-2.0
6
+
7
+ Unless required by applicable law or agreed to in writing, software
8
+ distributed under the License is distributed on an "AS IS" BASIS,
9
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10
+ See the License for the specific language governing permissions and
11
+ limitations under the License.
@@ -0,0 +1,86 @@
1
+ # Logstash Plugin
2
+
3
+ This is a plugin for [Logstash](https://github.com/elastic/logstash).
4
+
5
+ It is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.
6
+
7
+ ## Documentation
8
+
9
+ Logstash provides infrastructure to automatically generate documentation for this plugin. We use the asciidoc format to write documentation so any comments in the source code will be first converted into asciidoc and then into html. All plugin documentation are placed under one [central location](http://www.elastic.co/guide/en/logstash/current/).
10
+
11
+ - For formatting code or config example, you can use the asciidoc `[source,ruby]` directive
12
+ - For more asciidoc formatting tips, see the excellent reference here https://github.com/elastic/docs#asciidoc-guide
13
+
14
+ ## Need Help?
15
+
16
+ Need help? Try #logstash on freenode IRC or the https://discuss.elastic.co/c/logstash discussion forum.
17
+
18
+ ## Developing
19
+
20
+ ### 1. Plugin Developement and Testing
21
+
22
+ #### Code
23
+ - To get started, you'll need JRuby with the Bundler gem installed.
24
+
25
+ - Create a new plugin or clone and existing from the GitHub [logstash-plugins](https://github.com/logstash-plugins) organization. We also provide [example plugins](https://github.com/logstash-plugins?query=example).
26
+
27
+ - Install dependencies
28
+ ```sh
29
+ bundle install
30
+ ```
31
+
32
+ #### Test
33
+
34
+ - Update your dependencies
35
+
36
+ ```sh
37
+ bundle install
38
+ ```
39
+
40
+ - Run tests
41
+
42
+ ```sh
43
+ bundle exec rspec
44
+ ```
45
+
46
+ ### 2. Running your unpublished Plugin in Logstash
47
+
48
+ #### 2.1 Run in a local Logstash clone
49
+
50
+ - Edit Logstash `Gemfile` and add the local plugin path, for example:
51
+ ```ruby
52
+ gem "logstash-filter-awesome", :path => "/your/local/logstash-filter-awesome"
53
+ ```
54
+ - Install plugin
55
+ ```sh
56
+ bin/logstash-plugin install --no-verify
57
+ ```
58
+ - Run Logstash with your plugin
59
+ ```sh
60
+ bin/logstash -e 'filter {awesome {}}'
61
+ ```
62
+ At this point any modifications to the plugin code will be applied to this local Logstash setup. After modifying the plugin, simply rerun Logstash.
63
+
64
+ #### 2.2 Run in an installed Logstash
65
+
66
+ You can use the same **2.1** method to run your plugin in an installed Logstash by editing its `Gemfile` and pointing the `:path` to your local plugin development directory or you can build the gem and install it using:
67
+
68
+ - Build your plugin gem
69
+ ```sh
70
+ gem build logstash-filter-awesome.gemspec
71
+ ```
72
+ - Install the plugin from the Logstash home
73
+ ```sh
74
+ bin/logstash-plugin install /your/local/plugin/logstash-filter-awesome.gem
75
+ ```
76
+ - Start Logstash and proceed to test the plugin
77
+
78
+ ## Contributing
79
+
80
+ All contributions are welcome: ideas, patches, documentation, bug reports, complaints, and even something you drew up on a napkin.
81
+
82
+ Programming is not a required skill. Whatever you've seen about open source and maintainers or community members saying "send patches or die" - you will not see that here.
83
+
84
+ It is more important to the community that you are able to contribute.
85
+
86
+ For more information about contributing, see the [CONTRIBUTING](https://github.com/elastic/logstash/blob/master/CONTRIBUTING.md) file.
@@ -0,0 +1,103 @@
1
+ # encoding: utf-8
2
+ require "logstash/inputs/base"
3
+ require "logstash/namespace"
4
+ require "stud/interval"
5
+ require "net/http"
6
+ require "uri"
7
+ require "mechanize"
8
+
9
+ class LogStash::Inputs::Crawler < LogStash::Inputs::Base
10
+ config_name "crawler"
11
+
12
+ # If undefined, Logstash will complain, even if codec is unused.
13
+ default :codec, "plain"
14
+
15
+ # The message string to use in the event.
16
+ config :url, :validate => :string, :required => true
17
+
18
+ #Set de interval for stoppable_sleep
19
+ config :interval, :validate => :number, :default => 86400
20
+
21
+ #Set de max number of urls that go to sniff
22
+ config :url_max, :validate => :number, :default => 10
23
+
24
+ public
25
+ def register
26
+ @urls = []
27
+ @agent = Mechanize.new
28
+ @agent.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
29
+ end # def register
30
+
31
+
32
+ def run(queue)
33
+ # we can abort the loop if stop? becomes true
34
+ while !stop?
35
+ start_crawl(queue)
36
+ Stud.stoppable_sleep(@interval) { stop? }
37
+ end # loop
38
+ end # def run
39
+
40
+
41
+ def stop
42
+ # nothing to do in this case so it is not necessary to define stop
43
+ # examples of common "stop" tasks:
44
+ # * close sockets (unblocking blocking reads/accets)
45
+ # * cleanup temporary files
46
+ # * terminate spawned threads
47
+ end
48
+
49
+ def start_crawl(queue)
50
+ begin
51
+ get_urls_for_page(@url,queue)
52
+ rescue Exception => e
53
+ puts "FALLO DE CRAWL"
54
+ end
55
+ end
56
+
57
+ def get_urls_for_page(url,queue)
58
+ page_content = get_page_content url
59
+ # Regex to get all "links" in the page
60
+ urlsa = page_content.scan(/\<a href\=(\"(http|https)\:.*?\")/)
61
+ urlsa.each { |u|
62
+ sanitized_url = u.first.gsub(/\"/, '').strip
63
+ if (@urls.include?(sanitized_url) == false) && (@urls.length <= @url_max)
64
+ @urls.push(sanitized_url)
65
+ pagina = @agent.get(sanitized_url)
66
+ content = pagina.body
67
+ evento = LogStash::Event.new("link" => sanitized_url , "contenido" => content)
68
+ decorate(evento)
69
+ queue << evento
70
+ #puts "/*******************************************************************************/"
71
+ #puts @urls.length
72
+ #puts "/*******************************************************************************/"
73
+ # If Unexpected Error happens when trying to fetch URLs move on to the next URL
74
+ begin
75
+ get_urls_for_page(sanitized_url,queue)
76
+ rescue Exception => e
77
+ #puts "/*******************************************************************************/"
78
+ #puts "Problema al obtener el contenido de : " + sanitized_url
79
+ #puts "/*******************************************************************************/"
80
+ next
81
+ end
82
+ end
83
+ }
84
+ return @urls
85
+ end
86
+
87
+ def get_page_content url
88
+ uri = URI(url)
89
+ request = Net::HTTP::Get.new(uri)
90
+ http = Net::HTTP.new(uri.host, uri.port)
91
+ # Neet to enable use of SSL if the URL protocol is HTTPS
92
+ http.use_ssl = (uri.scheme == "https")
93
+ response = http.request(request)
94
+ # Check if URL needs to be forwarded because of redirect
95
+ case response
96
+ when Net::HTTPSuccess
97
+ return response.body
98
+ when Net::HTTPMovedPermanently || Net::HTTPRedirection
99
+ get_page_content response['location']
100
+ end
101
+ end
102
+
103
+ end # class LogStash::Inputs::Crawler
@@ -0,0 +1,93 @@
1
+ # encoding: utf-8
2
+ require "logstash/inputs/base"
3
+ require "logstash/namespace"
4
+ require "stud/interval"
5
+ require "set"
6
+ require "uri"
7
+ require "nokogiri"
8
+ require "open-uri"
9
+
10
+ class LogStash::Inputs::Crawler < LogStash::Inputs::Base
11
+ config_name "crawler"
12
+
13
+ # If undefined, Logstash will complain, even if codec is unused.
14
+ default :codec, "plain"
15
+
16
+ # The message string to use in the event.
17
+ config :url, :validate => :string, :required => true
18
+
19
+ #Set de interval for stoppable_sleep
20
+ config :interval, :validate => :number, :default => 86400
21
+
22
+ public
23
+ def register
24
+ @seen_pages = Set.new # Keep track of what we've seen
25
+ end # def register
26
+
27
+
28
+ def run(queue)
29
+ # we can abort the loop if stop? becomes true
30
+ while !stop?
31
+
32
+ crawl_site(@url) do |page,uri|
33
+ event = LogStash::Event.new("link" => uri.to_s)
34
+ decorate(event)
35
+ queue << event
36
+ end
37
+
38
+ evento = LogStash::Event.new("paginas_exploradas" => @seen_pages.length)
39
+ decorate(evento)
40
+ queue << evento
41
+
42
+ Stud.stoppable_sleep(@interval) { stop? }
43
+ end # loop
44
+ end # def run
45
+
46
+
47
+ def stop
48
+ # nothing to do in this case so it is not necessary to define stop
49
+ # examples of common "stop" tasks:
50
+ # * close sockets (unblocking blocking reads/accets)
51
+ # * cleanup temporary files
52
+ # * terminate spawned threads
53
+ end
54
+
55
+ def crawl_site( starting_at, &each_page )
56
+ files = %w[png jpeg jpg gif svg txt js css zip gz asp PNG JPEG JPG GIF SVG TXT JS CSS ZIP GZ ASP]
57
+ starting_uri = URI.parse(starting_at)
58
+
59
+ crawl_page = ->(page_uri) do # A re-usable mini-function
60
+ unless @seen_pages.include?(page_uri)
61
+ @seen_pages << page_uri # Record that we've seen this
62
+ begin
63
+ doc = Nokogiri.HTML(open(page_uri)) # Get the page
64
+ each_page.call(doc,page_uri) # Yield page and URI to the block
65
+
66
+ # Find all the links on the page
67
+ hrefs = doc.css('a[href]').map{ |a| a['href'] }
68
+
69
+ # Make these URIs, throwing out problem ones like mailto:
70
+ uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact.uniq
71
+
72
+ # Pare it down to only those pages that are on the same site
73
+ uris.select!{ |uri| uri.host == starting_uri.host }
74
+
75
+ # Throw out links to files (this could be more efficient with regex)
76
+ uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
77
+
78
+ # Remove #foo fragments so that sub-page links aren't differentiated
79
+ uris.each{ |uri| uri.fragment = nil }
80
+
81
+ # Recursively crawl the child URIs
82
+ uris.each{ |uri| crawl_page.call(uri) }
83
+
84
+ rescue OpenURI::HTTPError # Guard against 404s
85
+ warn "Skipping invalid link #{page_uri}"
86
+ end
87
+ end
88
+ end
89
+ crawl_page.call( starting_uri ) # Kick it all off!
90
+ end
91
+
92
+
93
+ end # class LogStash::Inputs::Crawler
@@ -0,0 +1,91 @@
1
+ # encoding: utf-8
2
+ require "logstash/inputs/base"
3
+ require "logstash/namespace"
4
+ require "stud/interval"
5
+ require "mechanize"
6
+
7
+ class LogStash::Inputs::Crawler < LogStash::Inputs::Base
8
+ config_name "crawler"
9
+
10
+ # If undefined, Logstash will complain, even if codec is unused.
11
+ default :codec, "plain"
12
+
13
+ # The message string to use in the event.
14
+ config :url, :validate => :string, :required => true
15
+
16
+ # Set how depth should be explore.
17
+ config :deep, :validate => :number, :default => 3
18
+
19
+ #Set de interval for stoppable_sleep
20
+ config :interval, :validate => :number, :default => 86400
21
+
22
+ public
23
+ def register
24
+ @prof = 1
25
+ @links = []
26
+ @cuenta = [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
27
+ @agent = Mechanize.new
28
+ @agent.agent.http.verify_mode = OpenSSL::SSL::VERIFY_NONE
29
+ @agent.redirection_limit = 500
30
+ @cont = 0
31
+ @url_actual = @url
32
+ @cola = []
33
+ @cola << @url
34
+ end # def register
35
+
36
+
37
+ def run(queue)
38
+ # we can abort the loop if stop? becomes true
39
+ while !stop?
40
+
41
+ loop do
42
+ @url_actual = @cola.shift
43
+ if (!@links.include?(@url_actual))
44
+ begin
45
+ @page = @agent.get(@url_actual)
46
+ rescue Mechanize::ResponseCodeError => exception
47
+ if exception.response_code != '200'
48
+ @url_actual = @cola.shift
49
+ end
50
+ retry
51
+ end
52
+ @page.links_with(:href => /^https?/).each do |link|
53
+ @cola << link.href
54
+ @cuenta[@prof] = @cuenta[@prof] + 1
55
+ end
56
+ @links << @url_actual
57
+ end
58
+
59
+ if (@cuenta[@prof-1] == @links.length)
60
+ @prof = @prof + 1
61
+ end
62
+
63
+ break if @prof >= @deep
64
+ end
65
+
66
+ @links.each do |link|
67
+ pagina = @agent.get(link)
68
+ #content = pagina.body
69
+ event = LogStash::Event.new("link" => link)
70
+ decorate(event)
71
+ queue << event
72
+ end
73
+
74
+ event = LogStash::Event.new("numero_de_links" => @links.length)
75
+ decorate(event)
76
+ queue << event
77
+
78
+
79
+ Stud.stoppable_sleep(@interval) { stop? }
80
+ end # loop
81
+ end # def run
82
+
83
+
84
+ def stop
85
+ # nothing to do in this case so it is not necessary to define stop
86
+ # examples of common "stop" tasks:
87
+ # * close sockets (unblocking blocking reads/accets)
88
+ # * cleanup temporary files
89
+ # * terminate spawned threads
90
+ end
91
+ end # class LogStash::Inputs::Crawler
@@ -0,0 +1,27 @@
1
+ Gem::Specification.new do |s|
2
+ s.name = 'logstash-input-crawler'
3
+ s.version = '1.0.0'
4
+ s.licenses = ['Apache-2.0']
5
+ s.summary = 'This plugin get the links and the html content from a initial page .'
6
+ s.description = 'This plugin need set the initial url.'
7
+ s.homepage = 'https://github.com/felixramirezgarcia/logstash-input-crawler'
8
+ s.authors = ['Felix R G']
9
+ s.email = 'felixramirezgarcia@correo.ugr.es'
10
+ s.require_paths = ['lib']
11
+
12
+ # Files
13
+ s.files = Dir['lib/**/*','spec/**/*','vendor/**/*','*.gemspec','*.md','CONTRIBUTORS','Gemfile','LICENSE','NOTICE.TXT']
14
+ # Tests
15
+ s.test_files = s.files.grep(%r{^(test|spec|features)/})
16
+
17
+ # Special flag to let us know this is actually a logstash plugin
18
+ s.metadata = { "logstash_plugin" => "true", "logstash_group" => "input" }
19
+
20
+ # Gem dependencies
21
+ s.add_runtime_dependency "logstash-core"
22
+ s.add_runtime_dependency 'logstash-codec-plain'
23
+ s.add_runtime_dependency 'stud', '>= 0.0.22'
24
+ s.add_development_dependency 'logstash-devutils', '>= 0.0.16'
25
+ s.add_runtime_dependency "mechanize"
26
+ s.add_runtime_dependency "nokogiri"
27
+ end
@@ -0,0 +1,11 @@
1
+ # encoding: utf-8
2
+ require "logstash/devutils/rspec/spec_helper"
3
+ require "logstash/inputs/crawler"
4
+
5
+ describe LogStash::Inputs::Crawler do
6
+
7
+ it_behaves_like "an interruptible input plugin" do
8
+ let(:config) { { "interval" => 100 } }
9
+ end
10
+
11
+ end
metadata ADDED
@@ -0,0 +1,141 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: logstash-input-crawler
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Felix R G
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2018-07-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ requirement: !ruby/object:Gem::Requirement
15
+ requirements:
16
+ - - ">="
17
+ - !ruby/object:Gem::Version
18
+ version: '0'
19
+ name: logstash-core
20
+ prerelease: false
21
+ type: :runtime
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ requirement: !ruby/object:Gem::Requirement
29
+ requirements:
30
+ - - ">="
31
+ - !ruby/object:Gem::Version
32
+ version: '0'
33
+ name: logstash-codec-plain
34
+ prerelease: false
35
+ type: :runtime
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - ">="
45
+ - !ruby/object:Gem::Version
46
+ version: 0.0.22
47
+ name: stud
48
+ prerelease: false
49
+ type: :runtime
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: 0.0.22
55
+ - !ruby/object:Gem::Dependency
56
+ requirement: !ruby/object:Gem::Requirement
57
+ requirements:
58
+ - - ">="
59
+ - !ruby/object:Gem::Version
60
+ version: 0.0.16
61
+ name: logstash-devutils
62
+ prerelease: false
63
+ type: :development
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - ">="
67
+ - !ruby/object:Gem::Version
68
+ version: 0.0.16
69
+ - !ruby/object:Gem::Dependency
70
+ requirement: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - ">="
73
+ - !ruby/object:Gem::Version
74
+ version: '0'
75
+ name: mechanize
76
+ prerelease: false
77
+ type: :runtime
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ - !ruby/object:Gem::Dependency
84
+ requirement: !ruby/object:Gem::Requirement
85
+ requirements:
86
+ - - ">="
87
+ - !ruby/object:Gem::Version
88
+ version: '0'
89
+ name: nokogiri
90
+ prerelease: false
91
+ type: :runtime
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ description: This plugin need set the initial url.
98
+ email: felixramirezgarcia@correo.ugr.es
99
+ executables: []
100
+ extensions: []
101
+ extra_rdoc_files: []
102
+ files:
103
+ - CHANGELOG.md
104
+ - CONTRIBUTORS
105
+ - DEVELOPER.md
106
+ - Gemfile
107
+ - LICENSE
108
+ - README.md
109
+ - lib/logstash/inputs/crawler.rb
110
+ - lib/logstash/inputs/crawler.rb.BK
111
+ - lib/logstash/inputs/crawler.rb.bk
112
+ - logstash-input-crawler.gemspec
113
+ - spec/inputs/crawler_spec.rb
114
+ homepage: https://github.com/felixramirezgarcia/logstash-input-crawler
115
+ licenses:
116
+ - Apache-2.0
117
+ metadata:
118
+ logstash_plugin: 'true'
119
+ logstash_group: input
120
+ post_install_message:
121
+ rdoc_options: []
122
+ require_paths:
123
+ - lib
124
+ required_ruby_version: !ruby/object:Gem::Requirement
125
+ requirements:
126
+ - - ">="
127
+ - !ruby/object:Gem::Version
128
+ version: '0'
129
+ required_rubygems_version: !ruby/object:Gem::Requirement
130
+ requirements:
131
+ - - ">="
132
+ - !ruby/object:Gem::Version
133
+ version: '0'
134
+ requirements: []
135
+ rubyforge_project:
136
+ rubygems_version: 2.6.13
137
+ signing_key:
138
+ specification_version: 4
139
+ summary: This plugin get the links and the html content from a initial page .
140
+ test_files:
141
+ - spec/inputs/crawler_spec.rb