wriggler 1.0.0 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +18 -4
- data/bin/console +1 -0
- data/dirtest/nested_fldr/test5.xml +1 -0
- data/dirtest/tag_content.csv +5 -0
- data/dirtest/test1.xml +31 -0
- data/dirtest/test2.xml +30 -0
- data/dirtest/test3.xml +30 -0
- data/dirtest/test4.html +7 -0
- data/lib/wriggler/version.rb +1 -1
- data/lib/wriggler.rb +11 -28
- data/wriggler.gemspec +2 -1
- metadata +23 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: a0bec94fbdd0a26ab8850e55099a46cee2a7a7a3
|
4
|
+
data.tar.gz: a3213bec797d9e3b9f6de806289f2f6e3008a36e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4dafba93323d6f876b9b80dcc103c908df0d8b87ca0bd40d3c7209d2dfb81d852d039f546976fe2cc2e23168a7ab87fc2c2511a6c179b83dbc1f7b39ad8f6589
|
7
|
+
data.tar.gz: 03f53045bc46845145ddc0642e3d76f3df98ed33d9a0d5ca8c4b76337ccef52d78c2e3ab69058517005ddb5d2a7ca8f9a02129f617be1f7dc759d93799e4f93d
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# Wriggler
|
2
2
|
|
3
|
-
Wriggler was created to serve and the crawler for a search engine, moving its way through HTML and/or XML files and grabbing data based on pre determined tags then
|
3
|
+
Wriggler was created to serve and the crawler for a search engine, moving its way through HTML and/or XML files and grabbing data based on pre determined tags then exporting it in a manipulatable format. Wriggler acts similarly to a spider, but was designed to be used with any number of local files, not as an actual web scraper.
|
4
4
|
|
5
5
|
## Installation
|
6
6
|
|
@@ -23,14 +23,28 @@ Or install it yourself as:
|
|
23
23
|
You only need to run one command to use Wriggler, run:
|
24
24
|
|
25
25
|
```ruby
|
26
|
-
Wriggler.crawl([array, of, HTML/XML, tags], directory)
|
26
|
+
Wriggler.crawl(["array", "of", "HTML/XML", "tags"], directory)
|
27
27
|
```
|
28
28
|
|
29
|
-
Note: The directory in this should be the top level directory that your HTML/XML files are in. Wriggler will account for any nested directories within this directory that also contain HTML/XML files
|
29
|
+
Note: The directory in this should be the top level directory that your HTML/XML files are in. Wriggler will account for any nested directories within this directory that also contain HTML/XML files. At the end you will have a data structure that resembles this:
|
30
|
+
|
31
|
+
```ruby
|
32
|
+
===============
|
33
|
+
Files Found: 2
|
34
|
+
===============
|
35
|
+
content = {
|
36
|
+
tag1: ["Content", "Found", "in", "the", "First", "Opened", "File"], ["Content", "Found", "in", "the", "Second", "Opened", "File"]
|
37
|
+
tag2: [], []
|
38
|
+
tag3: ["Content", "Found", "in", "the", "First", "Opened", "File"], []
|
39
|
+
tag4: [], ["Content", "Found", "in", "the", "Second", "Opened", "File"]
|
40
|
+
}
|
41
|
+
```
|
42
|
+
|
43
|
+
Where tag2 has no content found between both files, tag3 only found content in the first of the two files, tag4 only found content in the second of two files, and tag1 found content in both.
|
30
44
|
|
31
45
|
## Contributing
|
32
46
|
|
33
|
-
Bug reports and pull requests are welcome on GitHub at https://github.com/
|
47
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/elliottayoung/wriggler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](contributor-covenant.org) code of conduct.
|
34
48
|
|
35
49
|
On top of that, please contribute. I built this for a very specific reason, but I would very much like to see it become something bigger, so if you can assist with that please do!
|
36
50
|
|
data/bin/console
CHANGED
@@ -0,0 +1 @@
|
|
1
|
+
<test>If this appears it works</test>
|
@@ -0,0 +1,5 @@
|
|
1
|
+
character,test,name,sitcom
|
2
|
+
"[""Al Bundy"", ""Bud Bundy"", ""Marcy Darcy"", ""Larry Appleton"", ""Balki Bartokomous"", ""John 'Hannibal' Smith"", ""Templeton 'Face' Peck"", ""'B.A.' Baracus"", ""'Howling Mad' Murdock""]","[""Al Bundy"", ""Bud Bundy"", ""Marcy Darcy"", ""Larry Appleton"", ""Balki Bartokomous"", ""John 'Hannibal' Smith"", ""Templeton 'Face' Peck"", ""'B.A.' Baracus"", ""'Howling Mad' Murdock""]","[""Al Bundy"", ""Bud Bundy"", ""Marcy Darcy"", ""Larry Appleton"", ""Balki Bartokomous"", ""John 'Hannibal' Smith"", ""Templeton 'Face' Peck"", ""'B.A.' Baracus"", ""'Howling Mad' Murdock""]"
|
3
|
+
"[""If this appears it works""]","[""This is different""]"
|
4
|
+
"[""Married with Children"", ""Perfect Strangers"", ""The A-Team""]","[""Married with Children"", ""Perfect Strangers"", ""The A-Team""]","[""Married with Children"", ""Perfect Strangers"", ""The A-Team""]"
|
5
|
+
"[""This is different\n Married with Children\n \n Al Bundy\n Bud Bundy\n Marcy Darcy\n \n "", ""Perfect Strangers\n \n Larry Appleton\n Balki Bartokomous\n \n ""]","[""Married with Children\n \n Al Bundy\n Bud Bundy\n Marcy Darcy\n \n "", ""Perfect Strangers\n \n Larry Appleton\n Balki Bartokomous\n \n ""]","[""Married with Children\n \n Al Bundy\n Bud Bundy\n Marcy Darcy\n \n "", ""Perfect Strangers\n \n Larry Appleton\n Balki Bartokomous\n \n ""]"
|
data/dirtest/test1.xml
ADDED
@@ -0,0 +1,31 @@
|
|
1
|
+
<root>
|
2
|
+
<sitcoms>
|
3
|
+
<sitcom>
|
4
|
+
<test>This is different</test>
|
5
|
+
<name>Married with Children</name>
|
6
|
+
<characters>
|
7
|
+
<character>Al Bundy</character>
|
8
|
+
<character>Bud Bundy</character>
|
9
|
+
<character>Marcy Darcy</character>
|
10
|
+
</characters>
|
11
|
+
</sitcom>
|
12
|
+
<sitcom>
|
13
|
+
<name>Perfect Strangers</name>
|
14
|
+
<characters>
|
15
|
+
<character>Larry Appleton</character>
|
16
|
+
<character>Balki Bartokomous</character>
|
17
|
+
</characters>
|
18
|
+
</sitcom>
|
19
|
+
</sitcoms>
|
20
|
+
<dramas>
|
21
|
+
<drama>
|
22
|
+
<name>The A-Team</name>
|
23
|
+
<characters>
|
24
|
+
<character>John "Hannibal" Smith</character>
|
25
|
+
<character>Templeton "Face" Peck</character>
|
26
|
+
<character>"B.A." Baracus</character>
|
27
|
+
<character>"Howling Mad" Murdock</character>
|
28
|
+
</characters>
|
29
|
+
</drama>
|
30
|
+
</dramas>
|
31
|
+
</root>
|
data/dirtest/test2.xml
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
<root>
|
2
|
+
<sitcoms>
|
3
|
+
<sitcom>
|
4
|
+
<name>Married with Children</name>
|
5
|
+
<characters>
|
6
|
+
<character>Al Bundy</character>
|
7
|
+
<character>Bud Bundy</character>
|
8
|
+
<character>Marcy Darcy</character>
|
9
|
+
</characters>
|
10
|
+
</sitcom>
|
11
|
+
<sitcom>
|
12
|
+
<name>Perfect Strangers</name>
|
13
|
+
<characters>
|
14
|
+
<character>Larry Appleton</character>
|
15
|
+
<character>Balki Bartokomous</character>
|
16
|
+
</characters>
|
17
|
+
</sitcom>
|
18
|
+
</sitcoms>
|
19
|
+
<dramas>
|
20
|
+
<drama>
|
21
|
+
<name>The A-Team</name>
|
22
|
+
<characters>
|
23
|
+
<character>John "Hannibal" Smith</character>
|
24
|
+
<character>Templeton "Face" Peck</character>
|
25
|
+
<character>"B.A." Baracus</character>
|
26
|
+
<character>"Howling Mad" Murdock</character>
|
27
|
+
</characters>
|
28
|
+
</drama>
|
29
|
+
</dramas>
|
30
|
+
</root>
|
data/dirtest/test3.xml
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
<root>
|
2
|
+
<sitcoms>
|
3
|
+
<sitcom>
|
4
|
+
<name>Married with Children</name>
|
5
|
+
<characters>
|
6
|
+
<character>Al Bundy</character>
|
7
|
+
<character>Bud Bundy</character>
|
8
|
+
<character>Marcy Darcy</character>
|
9
|
+
</characters>
|
10
|
+
</sitcom>
|
11
|
+
<sitcom>
|
12
|
+
<name>Perfect Strangers</name>
|
13
|
+
<characters>
|
14
|
+
<character>Larry Appleton</character>
|
15
|
+
<character>Balki Bartokomous</character>
|
16
|
+
</characters>
|
17
|
+
</sitcom>
|
18
|
+
</sitcoms>
|
19
|
+
<dramas>
|
20
|
+
<drama>
|
21
|
+
<name>The A-Team</name>
|
22
|
+
<characters>
|
23
|
+
<character>John "Hannibal" Smith</character>
|
24
|
+
<character>Templeton "Face" Peck</character>
|
25
|
+
<character>"B.A." Baracus</character>
|
26
|
+
<character>"Howling Mad" Murdock</character>
|
27
|
+
</characters>
|
28
|
+
</drama>
|
29
|
+
</dramas>
|
30
|
+
</root>
|
data/dirtest/test4.html
ADDED
data/lib/wriggler/version.rb
CHANGED
data/lib/wriggler.rb
CHANGED
@@ -10,7 +10,7 @@ module Wriggler
|
|
10
10
|
@directory = directory #Current top-level directory
|
11
11
|
|
12
12
|
navigate_directory
|
13
|
-
|
13
|
+
@content
|
14
14
|
end
|
15
15
|
|
16
16
|
private
|
@@ -27,6 +27,9 @@ module Wriggler
|
|
27
27
|
Find.find(@directory) do |file|
|
28
28
|
file_array << file if file.match(/\.xml\Z/) || file.match(/\.html\Z/)
|
29
29
|
end
|
30
|
+
puts "==============="
|
31
|
+
puts "Files Found: #{file_array.length}"
|
32
|
+
puts "==============="
|
30
33
|
file_array
|
31
34
|
end
|
32
35
|
|
@@ -71,14 +74,14 @@ module Wriggler
|
|
71
74
|
end
|
72
75
|
|
73
76
|
def self.crawl_file(doc)
|
74
|
-
|
75
|
-
|
77
|
+
#Crawl the Nokogiri Object for the file
|
78
|
+
@content.each_key do |key|
|
76
79
|
arr = []
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
80
|
+
if !doc.xpath("//#{key}").empty? #Returns an empty array if tag is not present
|
81
|
+
doc.xpath("//#{key}").map{ |tag| arr << sanitize(tag.text) }
|
82
|
+
end
|
83
|
+
@content.fetch(key) << arr
|
84
|
+
end
|
82
85
|
end
|
83
86
|
|
84
87
|
def self.sanitize(text)
|
@@ -86,24 +89,4 @@ module Wriggler
|
|
86
89
|
text.gsub(/"/, "'").lstrip.chomp
|
87
90
|
end
|
88
91
|
|
89
|
-
def self.fill_content(arr, key)
|
90
|
-
#Doesn't shovel if there is no content found for the specific tag
|
91
|
-
!arr.empty? ? (@content.fetch(key) << arr) : nil
|
92
|
-
end
|
93
|
-
end
|
94
|
-
|
95
|
-
require 'CSV'
|
96
|
-
|
97
|
-
module Writer
|
98
|
-
def self.write(content)
|
99
|
-
#Write to a CSV file now
|
100
|
-
column_names = content.keys
|
101
|
-
s = CSV.generate do |csv|
|
102
|
-
csv << column_names
|
103
|
-
content.keys.each do |key|
|
104
|
-
csv << content.fetch(key)
|
105
|
-
end
|
106
|
-
end
|
107
|
-
File.write('tag_content.csv', s)
|
108
|
-
end
|
109
92
|
end
|
data/wriggler.gemspec
CHANGED
@@ -10,7 +10,7 @@ Gem::Specification.new do |spec|
|
|
10
10
|
spec.email = ["elliott.a.young@gmail.com"]
|
11
11
|
|
12
12
|
spec.summary = "A Gem designed to crawl through a local directory of HTML/XML files and pull out content based on pre-specified tag"
|
13
|
-
spec.description = "A Gem designed to crawl through a local directory of HTML/XML files and pull out content based on pre-specified tag, which will
|
13
|
+
spec.description = "A Gem designed to crawl through a local directory of HTML/XML files and pull out content based on pre-specified tag, which will be exported as a manipulatable object"
|
14
14
|
spec.homepage = "https://github.com/ElliottAYoung/wriggler"
|
15
15
|
spec.license = "MIT"
|
16
16
|
|
@@ -31,4 +31,5 @@ Gem::Specification.new do |spec|
|
|
31
31
|
spec.add_development_dependency "rake", "~> 10.0"
|
32
32
|
spec.add_development_dependency "rspec"
|
33
33
|
spec.add_development_dependency "nokogiri"
|
34
|
+
spec.add_development_dependency "awesome_print"
|
34
35
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wriggler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Elliott Young
|
@@ -66,9 +66,23 @@ dependencies:
|
|
66
66
|
- - '>='
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '0'
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: awesome_print
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - '>='
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: '0'
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - '>='
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: '0'
|
69
83
|
description: A Gem designed to crawl through a local directory of HTML/XML files and
|
70
|
-
pull out content based on pre-specified tag, which will
|
71
|
-
|
84
|
+
pull out content based on pre-specified tag, which will be exported as a manipulatable
|
85
|
+
object
|
72
86
|
email:
|
73
87
|
- elliott.a.young@gmail.com
|
74
88
|
executables: []
|
@@ -85,6 +99,12 @@ files:
|
|
85
99
|
- Rakefile
|
86
100
|
- bin/console
|
87
101
|
- bin/setup
|
102
|
+
- dirtest/nested_fldr/test5.xml
|
103
|
+
- dirtest/tag_content.csv
|
104
|
+
- dirtest/test1.xml
|
105
|
+
- dirtest/test2.xml
|
106
|
+
- dirtest/test3.xml
|
107
|
+
- dirtest/test4.html
|
88
108
|
- lib/wriggler.rb
|
89
109
|
- lib/wriggler/version.rb
|
90
110
|
- wriggler.gemspec
|
@@ -115,4 +135,3 @@ specification_version: 4
|
|
115
135
|
summary: A Gem designed to crawl through a local directory of HTML/XML files and pull
|
116
136
|
out content based on pre-specified tag
|
117
137
|
test_files: []
|
118
|
-
has_rdoc:
|