wombat 1.0.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +13 -30
- data/Rakefile +1 -1
- data/VERSION +1 -1
- data/fixtures/vcr_cassettes/follow_links.yml +2143 -0
- data/lib/wombat/crawler.rb +7 -17
- data/lib/wombat/dsl/follower.rb +19 -0
- data/lib/wombat/dsl/iterator.rb +19 -0
- data/lib/wombat/dsl/metadata.rb +27 -0
- data/lib/wombat/dsl/property.rb +27 -0
- data/lib/wombat/dsl/property_group.rb +48 -0
- data/lib/wombat/processing/node_selector.rb +12 -0
- data/lib/wombat/processing/parser.rb +48 -0
- data/lib/wombat/property/locators/base.rb +33 -0
- data/lib/wombat/property/locators/factory.rb +39 -0
- data/lib/wombat/property/locators/follow.rb +25 -0
- data/lib/wombat/property/locators/html.rb +14 -0
- data/lib/wombat/property/locators/iterator.rb +23 -0
- data/lib/wombat/property/locators/list.rb +17 -0
- data/lib/wombat/property/locators/property_group.rb +20 -0
- data/lib/wombat/property/locators/text.rb +22 -0
- data/lib/wombat.rb +8 -4
- data/spec/crawler_spec.rb +38 -48
- data/spec/dsl/property_spec.rb +12 -0
- data/spec/helpers/sample_crawler.rb +2 -15
- data/spec/integration/integration_spec.rb +61 -33
- data/spec/processing/parser_spec.rb +32 -0
- data/spec/property/locators/factory_spec.rb +18 -0
- data/spec/property/locators/follow_spec.rb +4 -0
- data/spec/property/locators/html_spec.rb +15 -0
- data/spec/property/locators/iterator_spec.rb +4 -0
- data/spec/property/locators/list_spec.rb +13 -0
- data/spec/property/locators/text_spec.rb +49 -0
- data/spec/sample_crawler_spec.rb +7 -11
- data/spec/wombat_spec.rb +13 -1
- data/wombat.gemspec +27 -16
- metadata +27 -16
- data/lib/wombat/iterator.rb +0 -38
- data/lib/wombat/metadata.rb +0 -24
- data/lib/wombat/node_selector.rb +0 -10
- data/lib/wombat/parser.rb +0 -59
- data/lib/wombat/property.rb +0 -21
- data/lib/wombat/property_container.rb +0 -70
- data/lib/wombat/property_locator.rb +0 -20
- data/spec/iterator_spec.rb +0 -52
- data/spec/metadata_spec.rb +0 -20
- data/spec/parser_spec.rb +0 -125
- data/spec/property_container_spec.rb +0 -62
- data/spec/property_locator_spec.rb +0 -75
- data/spec/property_spec.rb +0 -16
data/README.md
CHANGED
@@ -1,11 +1,12 @@
|
|
1
1
|
# Wombat
|
2
2
|
|
3
|
-
[][travis] [][gemnasium]
|
3
|
+
[][travis] [][gemnasium] [][codeclimate]
|
4
4
|
|
5
5
|
[travis]: http://travis-ci.org/felipecsl/wombat
|
6
6
|
[gemnasium]: https://gemnasium.com/felipecsl/wombat
|
7
|
+
[codeclimate]: https://codeclimate.com/github/felipecsl/wombat
|
7
8
|
|
8
|
-
|
9
|
+
Web scraper with an elegant DSL that parses structured data from web pages.
|
9
10
|
|
10
11
|
## Usage:
|
11
12
|
|
@@ -13,20 +14,20 @@ Generic Web crawler with an elegant DSL that parses structured data from web pag
|
|
13
14
|
|
14
15
|
Obs: Requires ruby 1.9
|
15
16
|
|
16
|
-
##
|
17
|
+
## Scraping a page:
|
17
18
|
|
18
19
|
The simplest way to use Wombat is by calling ``Wombat.crawl`` and passing it a block:
|
19
20
|
|
20
21
|
```ruby
|
21
22
|
|
22
|
-
# =>
|
23
|
+
# => github_scraper.rb
|
23
24
|
|
24
25
|
#coding: utf-8
|
25
26
|
require 'wombat'
|
26
27
|
|
27
28
|
Wombat.crawl do
|
28
29
|
base_url "http://www.github.com"
|
29
|
-
|
30
|
+
path "/"
|
30
31
|
|
31
32
|
headline "xpath=//h1"
|
32
33
|
|
@@ -36,11 +37,11 @@ Wombat.crawl do
|
|
36
37
|
e.gsub(/Explore/, "LOVE")
|
37
38
|
end
|
38
39
|
|
39
|
-
benefits do
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
40
|
+
benefits do
|
41
|
+
first_benefit "css=.column.leftmost h3"
|
42
|
+
second_benefir "css=.column.leftmid h3"
|
43
|
+
third_benefit "css=.column.rightmid h3"
|
44
|
+
fourth_benefit "css=.column.rightmost h3"
|
44
45
|
end
|
45
46
|
end
|
46
47
|
```
|
@@ -62,7 +63,8 @@ end
|
|
62
63
|
```
|
63
64
|
|
64
65
|
### This is just a sneak peek of what Wombat can do. For the complete documentation, please check the [project Wiki](http://github.com/felipecsl/wombat/wiki).
|
65
|
-
### [API Documentation](http://rubydoc.info/gems/wombat/0.
|
66
|
+
### [API Documentation](http://rubydoc.info/gems/wombat/1.0.0/frames)
|
67
|
+
### [Changelog](https://github.com/felipecsl/wombat/wiki/Changelog)
|
66
68
|
|
67
69
|
|
68
70
|
## Contributing to Wombat
|
@@ -81,25 +83,6 @@ end
|
|
81
83
|
* Daniel Naves de Carvalho ([@danielnc](https://github.com/danielnc))
|
82
84
|
* [@sigi](https://github.com/sigi)
|
83
85
|
|
84
|
-
## Changelog
|
85
|
-
|
86
|
-
### version 1.0.0
|
87
|
-
|
88
|
-
* Breaking change: Metadata#format renamed to Metadata#document_format due to method name clash with [Kernel#format](http://www.ruby-doc.org/core-1.9.3/Kernel.html#method-i-format)
|
89
|
-
|
90
|
-
### version 0.5.0
|
91
|
-
|
92
|
-
* [Fixed a bug on malformed selectors](https://github.com/felipecsl/wombat/commit/e0f4eec20e1e2bb07a1813a1edd019933edeceaa)
|
93
|
-
* [Fixed a bug where multiple calls to #crawl would not clean up previously iterated array results and yield repeated results](https://github.com/felipecsl/wombat/commit/40b09a5bf8b9ba08aa51b6f41f706b7c3c4e4252)
|
94
|
-
|
95
|
-
### version 0.4.0
|
96
|
-
|
97
|
-
* Added utility method ``Wombat.crawl`` that eliminates the need to have a ruby class instance to use Wombat. Now you can use just ``Wombat.crawl`` and start working. The class based format still works as before though.
|
98
|
-
|
99
|
-
### version 0.3.1
|
100
|
-
|
101
|
-
* Added the ability to provide a block to Crawler#crawl and override the default crawler properties for a one off run (thanks to @danielnc)
|
102
|
-
|
103
86
|
## Copyright
|
104
87
|
|
105
88
|
Copyright (c) 2012 Felipe Lima. See LICENSE.txt for further details.
|
data/Rakefile
CHANGED
@@ -12,7 +12,7 @@ Jeweler::Tasks.new do |gem|
|
|
12
12
|
gem.name = "wombat"
|
13
13
|
gem.homepage = "http://github.com/felipecsl/wombat"
|
14
14
|
gem.license = "MIT"
|
15
|
-
gem.summary = %Q{Ruby DSL to
|
15
|
+
gem.summary = %Q{Ruby DSL to scrape web pages}
|
16
16
|
gem.description = %Q{Generic Web crawler with a DSL that parses structured data from web pages}
|
17
17
|
gem.email = "felipe.lima@gmail.com"
|
18
18
|
gem.authors = ["Felipe Lima"]
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
|
1
|
+
2.0.0
|