tag_crawler 0.1.2 → 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +7 -28
- data/lib/tag_crawler/version.rb +1 -1
- data/lib/web_scraper.rb +2 -1
- data/terminal_shot.png +0 -0
- metadata +2 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 13ed4f2f4d8ccf9a2372c8a480f0e233d47b86b5
|
4
|
+
data.tar.gz: 5574e0f74fef547ef0fbf2b562d63ebdc2fa1a32
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e375536e3a849ba3afc32e3d3ab305fb31987095bb13581b26f146491b22a750c3f9fc2ba7ccc99ab9505d527291bbd40c19be21a03f56c2c37d9e4c29966dfb
|
7
|
+
data.tar.gz: d562440f17a682590677410c7a2b0a4e892d1b9a0261972eb8a6b970cb3126c8c2912581e43b30581363aa2fdc5af35b64d0c73b582881031ef92f34dadd37db
|
data/README.md
CHANGED
@@ -1,41 +1,20 @@
|
|
1
1
|
# TagCrawler
|
2
2
|
|
3
|
-
|
4
|
-
|
5
|
-
TODO: Delete this and the text above, and describe your gem
|
3
|
+
Tagcrawler will crawl a URL and output all the links, HTML tags, and sequences on the page.
|
4
|
+
Sequences are two or more words that have the first letter in each word capitalized.
|
6
5
|
|
7
6
|
## Installation
|
8
7
|
|
9
|
-
Add this line to your application's Gemfile:
|
10
|
-
|
11
|
-
```ruby
|
12
|
-
gem 'tag_crawler'
|
13
|
-
```
|
14
|
-
|
15
|
-
And then execute:
|
16
|
-
|
17
|
-
$ bundle
|
18
|
-
|
19
|
-
Or install it yourself as:
|
20
|
-
|
21
8
|
$ gem install tag_crawler
|
22
9
|
|
23
10
|
## Usage
|
24
11
|
|
25
|
-
|
26
|
-
|
27
|
-
## Development
|
28
|
-
|
29
|
-
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
30
|
-
|
31
|
-
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
32
|
-
|
33
|
-
## Contributing
|
34
|
-
|
35
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/tag_crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
|
12
|
+
The first argument is the URL to crawl. If the URL is missing the transport protocol, it will assume http://. Only http:// and https:// protocols are valid.
|
36
13
|
|
14
|
+
The second argument is the OUTPUT file that the extracted features will be written to.
|
37
15
|
|
38
|
-
|
16
|
+
$ tag_crawler https://github.com output.txt
|
39
17
|
|
40
|
-
The
|
18
|
+
The output file name will be in the form YYYYMMDDHHMMSS_output, a timestamp followed by the output file name provided.
|
41
19
|
|
20
|
+

|
data/lib/tag_crawler/version.rb
CHANGED
data/lib/web_scraper.rb
CHANGED
@@ -8,6 +8,7 @@ module TagCrawler
|
|
8
8
|
OPENING_TAG = /\<(\w+)(\>|\s([^\/]+)\>)/
|
9
9
|
CLOSING_TAG = /<\/(\w+)>/
|
10
10
|
SELF_CLOSING_TAG = /\<(\w+)(\/\>|\s*(.*)\/\>)/
|
11
|
+
CAPITAL_LETTER = /[A-Z]/
|
11
12
|
|
12
13
|
def initialize(url)
|
13
14
|
begin
|
@@ -61,7 +62,7 @@ module TagCrawler
|
|
61
62
|
words = node.split(" ")
|
62
63
|
current_sequence = []
|
63
64
|
words.each_with_index do |word, idx|
|
64
|
-
if(word
|
65
|
+
if(word.length >= 2 && CAPITAL_LETTER.match(word[0]))
|
65
66
|
current_sequence << word
|
66
67
|
elsif(current_sequence.length >= 2)
|
67
68
|
sequences << current_sequence.join(" ")
|
data/terminal_shot.png
ADDED
Binary file
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tag_crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- David Jiang
|
@@ -105,6 +105,7 @@ files:
|
|
105
105
|
- lib/tag_crawler/version.rb
|
106
106
|
- lib/web_scraper.rb
|
107
107
|
- tag_crawler.gemspec
|
108
|
+
- terminal_shot.png
|
108
109
|
homepage: ''
|
109
110
|
licenses:
|
110
111
|
- MIT
|