web_dump 0.0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/.document ADDED
@@ -0,0 +1,5 @@
1
+ README.rdoc
2
+ lib/**/*.rb
3
+ bin/*
4
+ features/**/*.feature
5
+ LICENSE
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2009 Marcel Massana
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.rdoc ADDED
@@ -0,0 +1,59 @@
1
+ = web_dump
2
+
3
+ Little tiny class to easily save and retrieve web pages
4
+
5
+ In web related client applications, such as spiders, it is frequently necessary
6
+ to save pages into files with adecuate naming convention. WebDump comes to the
7
+ rescue. It manages the details of assigning unique readable names and save files
8
+ after URIs that have been visited. Additionally, saving data could also be
9
+ conveniently compressed with gzip for deep web spidering. It only depends on
10
+ telling the correct file extension when saving.
11
+
12
+ Conversely, file read operation is available through convenient methods
13
+ indicating either a pathname or a URI.
14
+
15
+ == Installation
16
+
17
+ $ sudo gem install web_dump
18
+
19
+ The main source repository is http://github.com/syborg/web_dump.
20
+
21
+ == Usage
22
+
23
+ First of all ...
24
+
25
+ require 'rubygems'
26
+ require 'web_dump'
27
+
28
+ Instantiate an object. You may add some options that can be passed through an
29
+ array
30
+
31
+ wd = WebDump,new :base_dir => '~/mydir', :file_ext => '.gz'
32
+
33
+ `wd`, when asked to, will save all files inside expanded directory '~/mydir'
34
+ with an appended file extension at the end '.gz' (if not overwriten later)
35
+
36
+ Other options could be passed when instantiating an object.
37
+
38
+ * `:file_ext => extension` (String that will be appended at the end to every filename if not changed from _save_ method)
39
+
40
+ Most of them are also passed along to an UriPathname object that is created.
41
+
42
+ * `:base_dir => dir_name` (directory where everything will be stored. Defaults to '~/web_dumps')
43
+ * `:pth_sep => psep` (String that will be used to substitute '/' inside URI's path and queries (defaults to UriPathname::PTH_SEP='_|_'))
44
+ * `:host_sep => hsep` (String that will be used separate the URI¡s hostname and path when constructing the pathname. if '/' is used, hostname will actually become a subdirectory -defaults to UriPathname::HOST_SEP='__|'-)
45
+ * `:no_path => nopath` (String that will be used as a path placeholder when no URI's path exists, -default UriPathname::NO_PTH = '_NOPATH_'-)
46
+
47
+ == Note on Patches/Pull Requests
48
+
49
+ * Fork the project.
50
+ * Make your feature addition or bug fix.
51
+ * Add tests for it. This is important so I don't break it in a
52
+ future version unintentionally.
53
+ * Commit, do not mess with rakefile, version, or history.
54
+ (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
55
+ * Send me a pull request. Bonus points for topic branches.
56
+
57
+ == Copyright
58
+
59
+ Copyright (c) 2011 Marcel Massana. See LICENSE for details.
data/Rakefile ADDED
@@ -0,0 +1,58 @@
1
+ require 'rubygems'
2
+ require 'rake'
3
+
4
+ begin
5
+ require 'jeweler'
6
+ require './lib/web_dump/version'
7
+ Jeweler::Tasks.new do |gem|
8
+ gem.name = "web_dump"
9
+ gem.summary = %Q{Saves and Retrieves data in files given an URI}
10
+ gem.description = %Q{Saves and Retrieves data given an URI. The filename
11
+ will be automatically choosed using that URI freeing the user to think
12
+ about that}.gsub(/\s+/,' ')
13
+ gem.email = "xaxaupua@gmail.com"
14
+ gem.homepage = "http://github.com/syborg/web_dump"
15
+ gem.authors = ["Marcel Massana"]
16
+ gem.add_dependency "uri_pathname", ">= 0"
17
+ # gem.add_development_dependency "thoughtbot-shoulda", ">= 0"
18
+ gem.version = WebDump::Version::STRING
19
+ # gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
20
+ end
21
+ Jeweler::GemcutterTasks.new
22
+ rescue LoadError
23
+ puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
24
+ end
25
+
26
+ require 'rake/testtask'
27
+ Rake::TestTask.new(:test) do |test|
28
+ test.libs << 'lib' << 'test'
29
+ test.pattern = 'test/**/test_*.rb'
30
+ test.verbose = true
31
+ end
32
+
33
+ begin
34
+ require 'rcov/rcovtask'
35
+ Rcov::RcovTask.new do |test|
36
+ test.libs << 'test'
37
+ test.pattern = 'test/**/test_*.rb'
38
+ test.verbose = true
39
+ end
40
+ rescue LoadError
41
+ task :rcov do
42
+ abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
43
+ end
44
+ end
45
+
46
+ task :test => :check_dependencies
47
+
48
+ task :default => :test
49
+
50
+ require 'rake/rdoctask'
51
+ Rake::RDocTask.new do |rdoc|
52
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
53
+
54
+ rdoc.rdoc_dir = 'rdoc'
55
+ rdoc.title = "web_dump #{version}"
56
+ rdoc.rdoc_files.include('README*')
57
+ rdoc.rdoc_files.include('lib/**/*.rb')
58
+ end
@@ -0,0 +1,16 @@
1
+ # it's only a version number
2
+
3
+ class WebDump
4
+
5
+ module Version
6
+
7
+ MAJOR = 0
8
+ MINOR = 0
9
+ PATCH = 1
10
+ BUILD = 0
11
+
12
+ STRING = [MAJOR, MINOR, PATCH, BUILD].compact.join(".")
13
+
14
+ end
15
+
16
+ end
data/lib/web_dump.rb ADDED
@@ -0,0 +1,104 @@
1
+ # WebDump
2
+ # MME 31/8/2011
3
+ #
4
+ # Allows saving and reading data related to URIs (i.e. pages)
5
+
6
+ require 'web_dump/version'
7
+ require 'uri'
8
+ require 'zlib'
9
+ require 'fileutils'
10
+ require 'rubygems'
11
+ require 'uri_pathname'
12
+
13
+ # Allows saving and reading data related to URIs (i.e. pages)
14
+ class WebDump
15
+
16
+ #default attributes
17
+ DEFAULT_ATTRS = {
18
+ :base_dir => '~/web_dumps',
19
+ :file_ext => '.html'
20
+ }
21
+
22
+ attr_accessor :up, *(DEFAULT_ATTRS.keys)
23
+
24
+ # initializes a WebDump object. +options+ should be a hash with options for
25
+ # an UriPathname object that will be internally created. Default UriPathnames
26
+ # options and additionally:
27
+ # :base_dir => directory where everything will be stored (def. '~/web_dumps')
28
+ # :file_ext => extension that will be appended to filenames (def. '.html')
29
+ def initialize(options = {})
30
+
31
+ attributes = DEFAULT_ATTRS.merge options if options.is_a? Hash
32
+ attributes.each { |k,v| instance_eval("@#{k}='#{v}'") if DEFAULT_ATTRS.keys.include?(k) }
33
+
34
+ @up=UriPathname.new attributes # any valid option passed will be delivered
35
+
36
+ end
37
+
38
+ # saves the +content+ (String) into a file named after
39
+ # UriPathname#uri_to_pathname(+uri+).
40
+ # If +extension+ is nil initialize :file_ext option will be used:
41
+ # 'anything'+'.gz' -> gzipped (less storage requirements)
42
+ # other -> as is
43
+ # returns a String containing the complete pathname of the file if OK else nil
44
+ def save(uri, content, extension = nil)
45
+ extension = @file_ext unless extension
46
+ pathname = @up.uri_to_pathname(uri,nil,extension)
47
+ return nil unless pathname
48
+ mkdir_if_not_exists(File.dirname(pathname))
49
+ num_bytes = nil
50
+ case extension
51
+ when /\.gz$/ # ...gz
52
+ File.open(pathname, 'w') do |f|
53
+ gz = Zlib::GzipWriter.new(f)
54
+ # gz.comment="#dumped with web_dump #{Version::STRING}: #{uri}" # no sembla fer res
55
+ num_bytes = gz.write content
56
+ gz.close
57
+ end
58
+ else # any other
59
+ File.open(pathname, 'w') do |f|
60
+ num_bytes = f.write(content)
61
+ end
62
+ end
63
+ num_bytes ? pathname : nil
64
+ end
65
+
66
+ # returns the stored content corresponding to file +pathname+. In case there
67
+ # isn't any file it returns nil.
68
+ def read_pathname(pathname)
69
+ content = nil
70
+ arr = @up.parse pathname
71
+ complete_pathname = File.expand_path(pathname)
72
+ extension = arr[2]
73
+ case extension
74
+ when /.gz/
75
+ File.open(complete_pathname, 'r') do |f|
76
+ gz = Zlib::GzipReader.new(f)
77
+ content = gz.read
78
+ gz.close
79
+ end
80
+ else # others as is
81
+ File.open(complete_pathname, 'r') do |f|
82
+ content = f.read
83
+ end
84
+ end
85
+ content
86
+ end
87
+
88
+ # returns the stored content corresponding to +uri+ URI. In case there
89
+ # isn't any file it returns nil.
90
+ def read_uri(uri, filext=nil)
91
+ filext = @file_ext unless filext
92
+ pathname = @up.uri_to_pathname(uri,nil,filext)
93
+ read_pathname(pathname)
94
+ end
95
+
96
+ private
97
+
98
+ # creates +directory+ if it doesn't exist
99
+ def mkdir_if_not_exists(directory)
100
+ dir = File.expand_path(directory)
101
+ FileUtils.mkdir_p(dir) unless (File.exist?(dir) and File.directory?(dir))
102
+ end
103
+
104
+ end
@@ -0,0 +1,47 @@
1
+ require 'test/unit'
2
+ require 'web_dump'
3
+ require 'fileutils'
4
+
5
+ class TC_WebDump < Test::Unit::TestCase
6
+
7
+ TEST_DIR = '~/tmp/web_dump'
8
+
9
+ # called before every test
10
+ def setup
11
+ FileUtils.remove_dir(File.expand_path(TEST_DIR), true)
12
+ @wd = WebDump.new :base_dir => TEST_DIR
13
+ end
14
+
15
+ # called after every test
16
+ def teardown
17
+ FileUtils.remove_dir(File.expand_path(TEST_DIR), true)
18
+ end
19
+
20
+ def test_automatic_dir_and_file_creation
21
+ wd = WebDump.new :base_dir => TEST_DIR, :host_sep => '/'
22
+ pathname = wd.save 'http://www.fake.fak/fakpath', 'Hello World!'
23
+ assert(File.exist?(pathname))
24
+ end
25
+
26
+ def test_raw_file_sr_cycle
27
+ input = 'Hello World!'
28
+ uri = 'http://www.com/prova'
29
+ pathname = @wd.save(uri,input)
30
+ output = @wd.read_pathname(pathname)
31
+ assert_equal input, output, "retrieved through pathname"
32
+ output = @wd.read_uri(uri)
33
+ assert_equal input, output, "retrieved through uri"
34
+ end
35
+
36
+ def test_gzipped_file_sr_cycle
37
+ input = 'Hello World!'
38
+ uri = 'http://www.com/prova'
39
+ pathname = @wd.save(uri,input,".gz")
40
+ output = @wd.read_pathname(pathname)
41
+ assert_equal input, output, "retrieved through pathname"
42
+ output = @wd.read_uri(uri,"gz")
43
+ assert_equal input, output, "retrieved through uri"
44
+ end
45
+
46
+
47
+ end
data/web_dump.gemspec ADDED
@@ -0,0 +1,50 @@
1
+ # Generated by jeweler
2
+ # DO NOT EDIT THIS FILE DIRECTLY
3
+ # Instead, edit Jeweler::Tasks in Rakefile, and run 'rake gemspec'
4
+ # -*- encoding: utf-8 -*-
5
+
6
+ Gem::Specification.new do |s|
7
+ s.name = "web_dump"
8
+ s.version = "0.0.1.0"
9
+
10
+ s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
+ s.authors = ["Marcel Massana"]
12
+ s.date = "2011-08-31"
13
+ s.description = "Saves and Retrieves data given an URI. The filename will be automatically choosed using that URI freeing the user to think about that"
14
+ s.email = "xaxaupua@gmail.com"
15
+ s.extra_rdoc_files = [
16
+ "LICENSE",
17
+ "README.rdoc"
18
+ ]
19
+ s.files = [
20
+ ".document",
21
+ ".goutputstream-6QBL0V",
22
+ ".goutputstream-6X1P0V",
23
+ ".goutputstream-IR2O0V",
24
+ ".goutputstream-TK420V",
25
+ "LICENSE",
26
+ "README.rdoc",
27
+ "Rakefile",
28
+ "lib/web_dump.rb",
29
+ "lib/web_dump/version.rb",
30
+ "test/test_web_dump.rb",
31
+ "web_dump.gemspec"
32
+ ]
33
+ s.homepage = "http://github.com/syborg/web_dump"
34
+ s.require_paths = ["lib"]
35
+ s.rubygems_version = "1.8.10"
36
+ s.summary = "Saves and Retrieves data in files given an URI"
37
+
38
+ if s.respond_to? :specification_version then
39
+ s.specification_version = 3
40
+
41
+ if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then
42
+ s.add_runtime_dependency(%q<uri_pathname>, [">= 0"])
43
+ else
44
+ s.add_dependency(%q<uri_pathname>, [">= 0"])
45
+ end
46
+ else
47
+ s.add_dependency(%q<uri_pathname>, [">= 0"])
48
+ end
49
+ end
50
+
metadata ADDED
@@ -0,0 +1,87 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: web_dump
3
+ version: !ruby/object:Gem::Version
4
+ hash: 75
5
+ prerelease:
6
+ segments:
7
+ - 0
8
+ - 0
9
+ - 1
10
+ - 0
11
+ version: 0.0.1.0
12
+ platform: ruby
13
+ authors:
14
+ - Marcel Massana
15
+ autorequire:
16
+ bindir: bin
17
+ cert_chain: []
18
+
19
+ date: 2011-08-31 00:00:00 Z
20
+ dependencies:
21
+ - !ruby/object:Gem::Dependency
22
+ name: uri_pathname
23
+ prerelease: false
24
+ requirement: &id001 !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ">="
28
+ - !ruby/object:Gem::Version
29
+ hash: 3
30
+ segments:
31
+ - 0
32
+ version: "0"
33
+ type: :runtime
34
+ version_requirements: *id001
35
+ description: Saves and Retrieves data given an URI. The filename will be automatically choosed using that URI freeing the user to think about that
36
+ email: xaxaupua@gmail.com
37
+ executables: []
38
+
39
+ extensions: []
40
+
41
+ extra_rdoc_files:
42
+ - LICENSE
43
+ - README.rdoc
44
+ files:
45
+ - .document
46
+ - LICENSE
47
+ - README.rdoc
48
+ - Rakefile
49
+ - lib/web_dump.rb
50
+ - lib/web_dump/version.rb
51
+ - test/test_web_dump.rb
52
+ - web_dump.gemspec
53
+ homepage: http://github.com/syborg/web_dump
54
+ licenses: []
55
+
56
+ post_install_message:
57
+ rdoc_options: []
58
+
59
+ require_paths:
60
+ - lib
61
+ required_ruby_version: !ruby/object:Gem::Requirement
62
+ none: false
63
+ requirements:
64
+ - - ">="
65
+ - !ruby/object:Gem::Version
66
+ hash: 3
67
+ segments:
68
+ - 0
69
+ version: "0"
70
+ required_rubygems_version: !ruby/object:Gem::Requirement
71
+ none: false
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ hash: 3
76
+ segments:
77
+ - 0
78
+ version: "0"
79
+ requirements: []
80
+
81
+ rubyforge_project:
82
+ rubygems_version: 1.8.10
83
+ signing_key:
84
+ specification_version: 3
85
+ summary: Saves and Retrieves data in files given an URI
86
+ test_files: []
87
+