flatfish 0.3.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,2 @@
1
+ flatfish*gem
2
+ *.swp
data/Gemfile ADDED
@@ -0,0 +1,7 @@
1
+ source 'http://rubygems.org'
2
+
3
+ gem 'nokogiri'
4
+ gem 'activerecord'
5
+ gem 'mysql2'
6
+
7
+ gem 'awesome_print'
@@ -0,0 +1,31 @@
1
+ GEM
2
+ remote: http://rubygems.org/
3
+ specs:
4
+ activemodel (3.2.6)
5
+ activesupport (= 3.2.6)
6
+ builder (~> 3.0.0)
7
+ activerecord (3.2.6)
8
+ activemodel (= 3.2.6)
9
+ activesupport (= 3.2.6)
10
+ arel (~> 3.0.2)
11
+ tzinfo (~> 0.3.29)
12
+ activesupport (3.2.6)
13
+ i18n (~> 0.6)
14
+ multi_json (~> 1.0)
15
+ arel (3.0.2)
16
+ awesome_print (1.0.2)
17
+ builder (3.0.0)
18
+ i18n (0.6.0)
19
+ multi_json (1.3.6)
20
+ mysql2 (0.3.11)
21
+ nokogiri (1.5.5)
22
+ tzinfo (0.3.33)
23
+
24
+ PLATFORMS
25
+ ruby
26
+
27
+ DEPENDENCIES
28
+ activerecord
29
+ awesome_print
30
+ mysql2
31
+ nokogiri
@@ -0,0 +1,25 @@
1
+ # HACKING FLATFISH
2
+
3
+ ## Hello Brave Soul
4
+ Thank you for your interest in contributing to Flatfish!
5
+
6
+ ### Background
7
+ Flatfish is built on Rails (Active Support + Active Record) and Nokogiri. The first thing to note is that many Rails helpers (but not all) are available, but Flatfish is not a webapp--the directory structure is different, there is no MVC, etc. So you will find that Flatfish is *Rails-like*, but does some hackery like creating dynamic models. Active Record enables Flatfish to parse a CSV and save a data model representing it on-the-fly.
8
+
9
+ Nokogiri is an XML/HTML parser that supports CSS Selectors--basically, you get jQuery in Ruby. It's awesome. Flatfish primarily uses Nokogiri for 1) specificity, as in only grabbing the HTML we want and 2) updating the links, images, and files. Links, images, and files have their paths corrected and tokenized: additionally, files and images are saved as blobs in the database.
10
+
11
+ ### Basic Process Flow
12
+ A Flatfish object is instantiated with a YAML config file. The YAML includes DB info, some general options, and specifics on the Types of web pages to be processed. Each Type has a CSV and URL Host. A table for each Type is created if necessary and an Active Record subclass is generated dynamically. Each row of the CSV is parsed, allowing Flatfish to:
13
+ 1. Grab the HTML from a remote host or a local directory
14
+ 2. Attempt to handle both HTTPS Redirects and Basic Authentication
15
+ 3. Correct all link, image, and file paths
16
+ 4. Tokenize all links, images, and files
17
+ 5. Update or create all files and images saving them to the DB--eg, multiple copies of the same image are possible with unique URLs
18
+ 6. Update or create the HTML saving it to the DB--again, the URL is the unique identifier
19
+
20
+ ### Odds and Ends
21
+ Although all Active Record tables have an ID column, the URL is the unique identifier and all business logic is tied to it.
22
+
23
+ If you are new to Ruby, we suggest using rbenv to download the latest MRI Ruby (1.9.3 at the moment) and manage your gems.
24
+
25
+ Please write tests for any areas lacking them and certainly for any new code. We are using Test::Unit.
@@ -0,0 +1,36 @@
1
+ # Flatfish
2
+ Bottom-feeding fun!
3
+
4
+ ## Description
5
+ Flatfish is a lib to scrape HTML based on a CSV w/ CSS selectors and configurable attributes (eg, page titles).
6
+ The ultimate goal of Flatfish is to prep and load the HTML into Drupal.
7
+
8
+ ## INSTALLATION
9
+ Flatfish is still in development, so it's not on Rubygems just yet. You'll need to build and install the gem manually, this is really pretty easy. Assuming you're starting from scratch:
10
+
11
+ 1. We're using Ruby 1.9.3, so install that with RVM, rbenv+ruby-build, or on your own.
12
+ 2. Flatfish has a few dependencies, which are listed in the Gemfile, you can install the bundler gem and then use it to grab the rest of the gems at the versions specified in the Gemfile.lock--this is probably a good idea. The gems can also be installed by hand--there are only a few.
13
+ 3. We've set up a quick Rake task to build and install the Flatfish gem, so if you're using RVM (system-wide flavor) just run 'rake install\_gem'. Otherwise, you can just 'gem build flatfish.gemspec' and 'gem install flatfish-VERSION.gem' according to your setup.
14
+
15
+ ## NOTES
16
+ As Flatfish scrapes the HTML over-the-wire, it can be a bit slow (say 10 minutes for 500 pages), you can speed things up by pointing to a local copy of your site.
17
+
18
+ ## USAGE INSTRUCTIONS
19
+ 1. Create a MySQL database
20
+ 2. Make a directory for you CSV and configuration file
21
+ 3. Create CSVs of URLs w/ CSS selectors (see the example directory), one for each Drupal Content Type
22
+ 4. Configure your yaml w/ project specifics (see Pleuronectiformes.yaml as a sample or the example directory)
23
+ 5. Run 'flatfish' in your project directory (with the CSV and the config file)
24
+ 6. Additional Flatfish runs will update (IE, overwrite) database content based on URL.
25
+
26
+ ## License
27
+
28
+ (The MIT License)
29
+
30
+ Copyright (c) 2012 Tim Loudon
31
+
32
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
33
+
34
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
35
+
36
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,23 @@
1
+ require 'rake'
2
+ require 'rake/testtask'
3
+
4
+ task :default => [:test_units]
5
+
6
+ desc "Run basic tests"
7
+ Rake::TestTask.new("test_units") do |t|
8
+ t.pattern = 'test/*_test.rb'
9
+ t.verbose = false
10
+ t.warning = true
11
+ end
12
+
13
+ RUBY='1.9.3'
14
+
15
+ desc "Build gem"
16
+ task :build_gem do
17
+ system "rvm #{RUBY} do gem build flatfish.gemspec"
18
+ end
19
+
20
+ desc "Install gem"
21
+ task :install_gem => :build_gem do
22
+ system "sudo rvm #{RUBY} do gem install flatfish-*.gem"
23
+ end
data/TODO.md ADDED
@@ -0,0 +1,11 @@
1
+ #TODOS
2
+
3
+ ##Add tests
4
+ need more tests, current test coverage is weak and only covers absolutification functionality
5
+
6
+ ##Restructure the code
7
+ need better overall design and refactor the dupe blocks for images and hrefs
8
+
9
+ ##Add fork
10
+
11
+ ##Add logging
@@ -0,0 +1,48 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'flatfish'
4
+ require 'getoptlong'
5
+
6
+ def banner error_code
7
+ puts "Flatfish scrapes HTML\n\n"
8
+ puts "Run flatfish in a directory with a config.yml file or pass one in"
9
+ puts "For more information see the README or visit the repo https://github.com/drupalstaffing/flatfish\n\n"
10
+ puts "Usage: flatfish myconfig.yml OR flatfish"
11
+ puts "Options:"
12
+ puts "-h|--help"
13
+ puts "-v|--version"
14
+ exit error_code
15
+ end
16
+
17
+ opts = GetoptLong.new(
18
+ [ "--help", "-h", GetoptLong::NO_ARGUMENT ],
19
+ [ "--version", "-v", GetoptLong::NO_ARGUMENT ]
20
+ )
21
+
22
+ begin
23
+ opts.each do |opt, arg|
24
+ case opt
25
+ when "--help"
26
+ banner 0
27
+
28
+ when "--version"
29
+ puts "VERSION " + Flatfish::VERSION
30
+ exit 0
31
+ end
32
+ end
33
+
34
+ rescue
35
+ banner 1
36
+ end
37
+
38
+ if !File.exists?('config.yml') && ARGV.empty? then
39
+ puts "ERROR: You need a configuration file.\n\n"
40
+ banner 1
41
+ end
42
+
43
+ puts "Running Flatfish..."
44
+ puts Time.now
45
+ plueronectiforme = Flatfish.new ARGV[0]
46
+ plueronectiforme.ingurgitate
47
+ puts "All done"
48
+ puts Time.now
@@ -0,0 +1,13 @@
1
+ url,path,title,body
2
+ http://drupalconnect.com/blog/e-commerce-problems-solutions-and-considerations,blog/ecommerce-overview,Ecommerce Overview,.region-content-inner
3
+ http://drupalconnect.com/blog/mobile-design-trends-5-top-tips-kathy-chavez-senior-themer,blog/mobile-tips,Mobile Design Tips,.region-content-inner
4
+ http://drupalconnect.com/blog/drupal-migration-migrate-module-best-practices,blog/migration-best-practices,Drupal Migration Best-Practices,.region-content-inner
5
+ http://drupalconnect.com/blog/interview-roger-soper-senior-developer-drupal-support,blog/interview-soper,Roger Soper Interview,.region-content-inner
6
+ http://drupalconnect.com/blog/interview-john-florez-ceo-drupal-connect,blog/interview-florez,John Florez Interview,.region-content-inner
7
+ http://drupalconnect.com/blog/interview-chris-boag-vp-professional-services,blog/interview-boag,Chris Boag Interview,.region-content-inner
8
+ http://drupalconnect.com/blog/responsive-web-design,blog/responsive-design,Responsive Design Overview,.region-content-inner
9
+ http://drupalconnect.com/blog/how-learn-drupal,blog/learning-drupal-tips,Learning Drupal Tips,.region-content-inner
10
+ http://drupalconnect.com/blog/replacing-drupal-static-navigation-apache-solr-facets,blog/solr-navigation,Using Apache for Navigation,.region-content-inner
11
+ http://drupalconnect.com/blog/creating-drupal-context-plugins,blog/context-plugins,Creating Context Plugins,.region-content-inner
12
+ http://drupalconnect.com/blog/stress-testing-your-servers-maxclients-directive,blog/stress-testing-apache,Stress Testing Apache,.region-content-inner
13
+ http://drupalconnect.com/blog/drupal-check-ubercart-check,blog/ubercart-tips,Ubercart Tips,.region-content-inner
@@ -0,0 +1,17 @@
1
+ local_source: '' #use the web
2
+
3
+ db_user: 'root'
4
+ db_pass: 'root'
5
+ db: 'flatfish_sample'
6
+
7
+ # NOTE: these map to Drupal content types and AR database tables
8
+ types:
9
+ Article:
10
+ csv: '/home/tloudon/workspace/flatfish/example/article.csv'
11
+ host: 'http://drupalconnect.com'
12
+ Page:
13
+ csv: '/home/tloudon/workspace/flatfish/example/page.csv'
14
+ host: 'http://drupalconnect.com'
15
+
16
+ development:
17
+ max_rows: 1000
@@ -0,0 +1,8 @@
1
+ url,path,title,body,field_second_value
2
+ http://drupalconnect.com,front,Drupal Connect,#block-block-13|#block-views-portfolio-block-1,
3
+ http://drupalconnect.com/services,services,Drupal Services,.views-row-first.highlight-box,.views-row-last.highlight-box
4
+ http://drupalconnect.com/services/drupal-data-migration,services/migration,Drupal Migration,.region-content-inner,
5
+ http://drupalconnect.com/services/drupal-development,services/drupal-development,Drupal Development,.region-content-inner,
6
+ http://drupalconnect.com/projects/smosh,portfolio/smosh,Drupal Connect Portfolio: Smosh,.region-content-inner,
7
+ http://drupalconnect.com/projects/greenopoliscom,portfolio/thinkgreenrewards,Drupal Connect Portfolio: Think Green Rewards,.region-content-inner,
8
+ http://drupalconnect.com/projects/stanford-university,portfolio/stanford-school-engineering,Drupal Connect Portfolio: Stanford,.region-content-inner,
@@ -0,0 +1,19 @@
1
+ # -*- encoding: utf-8 -*-
2
+ $:.push File.expand_path("../lib", __FILE__)
3
+ require "flatfish"
4
+
5
+ Gem::Specification.new do |s|
6
+ s.name = 'flatfish'
7
+ s.version = Flatfish::VERSION
8
+ s.date = '2011-12-28'
9
+ s.summary = "Scrape web pages!"
10
+ s.description = "flatfish accepts a CSV of URLS with CSS selectors prepping them for insert into drupal"
11
+ s.authors = ["Tim Loudon", "Mike Crittenden"]
12
+ s.email = 'timl@drupalconnect.com'
13
+ s.homepage = 'https://github.com/drupalstaffing/flatfish'
14
+
15
+ s.files = `git ls-files`.split("\n")
16
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
17
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
18
+ s.require_paths = ["lib"]
19
+ end
@@ -0,0 +1,37 @@
1
+ require 'rubygems'
2
+ require 'yaml'
3
+ require 'csv'
4
+ require 'open-uri'
5
+ require 'uri'
6
+ require 'nokogiri'
7
+ require 'fileutils'
8
+ require 'active_record'
9
+
10
+ require 'awesome_print'
11
+
12
+ require 'flatfish/pleuronectiformes'
13
+ require 'flatfish/page'
14
+ require 'flatfish/media'
15
+ require 'flatfish/create_tables'
16
+ require 'flatfish/url'
17
+
18
+ #API
19
+ module Flatfish
20
+ class << self
21
+ #allow alias Flatfish.new etc
22
+ def new(yml = nil)
23
+ yml ||= './config.yml'
24
+ if !File.exists?(yml) then
25
+ puts "ERROR: No Config file"
26
+ exit 1
27
+ end
28
+ Flatfish::Pleuronectiformes.new(yml)
29
+ end
30
+
31
+ def ingurgitate
32
+ new.ingurgitate
33
+ end
34
+ end
35
+
36
+ VERSION = "0.3.1"
37
+ end
@@ -0,0 +1,29 @@
1
+ module Flatfish
2
+
3
+ class CreateKlass < ActiveRecord::Migration
4
+ # assume every klass has a URL, Path, Title
5
+ # pass in additional columns from CSV
6
+ def self.setup(schema, klass)
7
+ k = klass.tableize.to_sym
8
+ create_table(k) do |t|
9
+ t.string :url
10
+ t.string :path
11
+ t.string :title
12
+ end
13
+ schema.each do |column|
14
+ add_column(k, column.gsub(/\s+/, '_').downcase.to_sym, :text)
15
+ end
16
+ end
17
+ end
18
+
19
+ class CreateMedia < ActiveRecord::Migration
20
+ #create media table
21
+ def self.setup
22
+ create_table :media do |t|
23
+ t.string :url
24
+ t.binary :contents, :limit => 4294967295
25
+ end
26
+ end
27
+ end
28
+
29
+ end
@@ -0,0 +1,5 @@
1
+ module Flatfish
2
+ class Media < ActiveRecord::Base
3
+ attr_reader :url, :contents
4
+ end
5
+ end
@@ -0,0 +1,148 @@
1
+ # -*- coding: utf-8 -*- #specify UTF-8 (unicode) characters
2
+ require_relative 'url'
3
+
4
+ module Flatfish
5
+
6
+ class Page < ActiveRecord::Base
7
+ self.abstract_class = true
8
+ @columns = []
9
+ extend Flatfish::Url
10
+
11
+ attr_reader :url, :data
12
+ attr_accessor :cd
13
+
14
+ # Setup - unpack the vars for the web page to be scraped
15
+ #
16
+ # csv - an array w/ all of the page specific
17
+ # config - has some key deets, where to save images, etc that the page has to know
18
+ # schema - dynamic column headers
19
+ def setup(csv, config, schema, host)
20
+ #parse the csv
21
+ @url, @path, @title = csv[0], csv[1], csv[2]
22
+ @fields = []
23
+ csv[3..-1].each do |field|
24
+ unless field.nil?
25
+ @fields << (field.strip! || field)
26
+ else
27
+ @fields << -1 #flag
28
+ end
29
+ end
30
+
31
+ #current directory, we want http://example.com/about/ or http://example.com/home/
32
+ @cd = (@url[-1,1] == '/')? @url: @url.slice(0..@url.rindex('/'))
33
+ @schema = schema
34
+ @host = host
35
+ @local_source = config['local_source']
36
+
37
+ # handle url == host, fix mangled @cd
38
+ if @url == @host
39
+ @cd = @url + '/'
40
+ end
41
+ Flatfish::Url.creds = {:http_basic_authentication => [config['basic_auth_user'], config['basic_auth_pass']]}
42
+ end
43
+
44
+ def process
45
+ load_html
46
+ self.attributes = prep
47
+ end
48
+
49
+ # load html from local or web
50
+ def load_html
51
+ file = @local_source + @url.sub(@host, '')
52
+ if (@url != @host) && !@local_source.nil? && File.exists?(file)
53
+ f = File.open(file)
54
+ @doc = Nokogiri::XML(f)
55
+ f.close
56
+ else
57
+ html = Flatfish::Url.open_url(@url)
58
+ @doc = Nokogiri::HTML(html)
59
+ end
60
+ end
61
+
62
+ def prep
63
+ #default to csv, fallback to title element
64
+ @title = @title.nil? ? @doc.title: @title
65
+
66
+ #build a hash of field => data
67
+ html = Hash.new
68
+ @fields.each_with_index do |selectors, i|
69
+ next if -1 == selectors
70
+ html[@schema[i]] = ''
71
+ selectors.split('|').each do |selector|
72
+ update_hrefs(selector)
73
+ update_imgs(selector)
74
+ if @doc.css(selector).nil? then
75
+ field = ''
76
+ else
77
+ # sub tokens and gnarly MS Quotes
78
+ field = @doc.css(selector).to_s.gsub("%5BFLATFISH", '[').gsub("FLATFISH%5D", ']').gsub(/[”“]/, '"').gsub(/[‘’]/, "'")
79
+ end
80
+ html[@schema[i]] += field
81
+ end
82
+ end
83
+ @data = {
84
+ 'url' => @url,
85
+ 'title' => @title,
86
+ 'path' => @path
87
+ }
88
+ @data.merge!(html)
89
+ end
90
+
91
+ # processes link tags
92
+ # absolutifies and passes media links on for tokenization
93
+ def update_hrefs(css_selector)
94
+ @doc.css(css_selector + ' a').each do |a|
95
+
96
+ #TODO finalize list of supported file types
97
+ href = Flatfish::Url.absolutify(a['href'], @cd)
98
+ valid_exts = ['.doc', '.docx', '.pdf', '.pptx', '.ppt', '.xls', '.xlsx']
99
+ if href =~ /#{@host}/ && valid_exts.include?(File.extname(href))
100
+ media = get_media(href)
101
+ href = "[FLATFISHmedia:#{media.id}FLATFISH]"
102
+ end
103
+ a['href'] = href
104
+ end
105
+ end
106
+
107
+ # processes image tags
108
+ # absolutifies images and passes internal ones on for tokenization
109
+ def update_imgs(css_selector)
110
+ @doc.css(css_selector + ' img').each do |img|
111
+ next if img['src'].nil?
112
+
113
+ # absolutify and tokenize our images
114
+ src = Flatfish::Url.absolutify(img['src'], @cd)
115
+ if src =~ /#{@host}/
116
+ # check to see if it already exists
117
+ media = get_media(src)
118
+ img['src'] = "[FLATFISHmedia:#{media.id}FLATFISH]"
119
+ end
120
+ end
121
+ end
122
+
123
+ #TODO replace w/ find_or_create
124
+ def get_media(url)
125
+ media = Flatfish::Media.find_by_url(url)
126
+ if media.nil?
127
+ media = Flatfish::Media.create(:url => url) do |m|
128
+ m.contents = read_in_blob(url)
129
+ end
130
+ end
131
+ media
132
+ end
133
+
134
+ # read in blob
135
+ def read_in_blob(url)
136
+ # assume local file
137
+ file = url.sub(@host, @local_source)
138
+
139
+ unless @local_source.nil? || !File.exists?(file)
140
+ blob = file.read
141
+ else
142
+ blob = Flatfish::Url.open_url(URI.escape(url))
143
+ end
144
+ blob
145
+ end
146
+
147
+ end
148
+ end
@@ -0,0 +1,123 @@
1
+ require_relative 'page'
2
+
3
+ module Flatfish
4
+
5
+ class Pleuronectiformes
6
+ attr_reader :config, :schema, :klasses
7
+
8
+ # load in the config
9
+ def initialize(ymal)
10
+ @config = YAML.load_file(ymal)
11
+ db_conn() # establish AR conn
12
+ @klasses = Hash.new
13
+ end
14
+
15
+ # main loop for flatfish
16
+ def ingurgitate
17
+ create_media unless Flatfish::Media.table_exists?
18
+
19
+ @config["types"].each do |k,v|
20
+ next if v["csv"].nil?
21
+ @csv_file = v["csv"]
22
+ @host = v["host"]
23
+ create_klass(k)
24
+ parse(k)
25
+ end
26
+ output_schema
27
+ end
28
+
29
+ # Create the Klass
30
+ # create table if necessary: table must exist!
31
+ # create dynamic model
32
+ def create_klass(k)
33
+ # commence hackery
34
+ create_table(k) unless ActiveRecord::Base.connection.tables.include?(k.tableize)
35
+ @klass = Class.new(Page)
36
+ @klasses[k] = @klass
37
+ @klass.table_name = k.tableize
38
+ end
39
+
40
+ def create_table(klass)
41
+ load_csv
42
+ Flatfish::CreateKlass.setup(@schema, klass)
43
+ end
44
+
45
+ def create_media
46
+ Flatfish::CreateMedia.setup
47
+ end
48
+
49
+ #load csv, set schema
50
+ def load_csv
51
+ csv = CSV.read(@csv_file)
52
+ @schema = csv.shift[3..-1]
53
+ return csv
54
+ end
55
+
56
+ # loop thru csv
57
+ def parse(k)
58
+ csv = load_csv
59
+ @cnt = 0
60
+ csv.each do |row|
61
+ begin
62
+ break if @cnt == @config['max_rows']
63
+ @cnt += 1
64
+ page = @klass.find_or_create_by_url(row[0])
65
+ puts "Processing #{k}.#{page.id} with URL #{row[0]}"
66
+ page.setup(row, @config, @schema, @host)
67
+ page.process
68
+ page.save!
69
+ rescue Exception => e
70
+ if e.message =~ /(redirection forbidden|404 Not Found)/
71
+ ap "URL: #{page.url} #{e}"
72
+ else
73
+ ap "URL: #{page.url} ERROR: #{e} BACKTRACE: #{e.backtrace}"
74
+ end
75
+ end
76
+ end
77
+ end
78
+
79
+
80
+ # generate a dynamic schema.yml for Migrate mapping
81
+ def output_schema
82
+ # TODO REFACTOR THIS ISH
83
+ klasses = @klasses
84
+ File.open('schema.yml', 'w') do |out|
85
+ output = Hash.new
86
+ output["schema"] = Hash.new
87
+ klasses["Media"] = Flatfish::Media
88
+
89
+ klasses.each_pair do |k,v|
90
+ klass_attributes = Hash.new
91
+ v.new.attributes.each { |a| klass_attributes[a[0]] = split_type(v.columns_hash[a[0]].sql_type) }
92
+ output["schema"].merge!({k => {"machine_name" => k.tableize, "fields" => klass_attributes, "primary key" => ["id"]}})
93
+ end
94
+ out.write output.to_yaml
95
+ end
96
+ end
97
+
98
+ # helper function to convert AR sql_type to
99
+ # Drupal format;
100
+ # eg :type => varchar(255) to :type => varchar, :length => 255
101
+ def split_type type
102
+ if type =~ /\(/ then
103
+ x = type.split("(")
104
+ return {"type" => x[0], "length" => x[1].sub(")","").to_i }
105
+ else
106
+ return {"type" => type}
107
+ end
108
+ end
109
+
110
+
111
+ def db_conn
112
+ ActiveRecord::Base.establish_connection(
113
+ :adapter=> "mysql2",
114
+ :host => "localhost",
115
+ :username => @config['db_user'],
116
+ :password => @config['db_pass'],
117
+ :database=> @config['db']
118
+ )
119
+ end
120
+
121
+ end
122
+
123
+ end
@@ -0,0 +1,57 @@
1
+ module Flatfish
2
+ module Url
3
+ #methods for handling URLs
4
+ class << self
5
+ attr_accessor :creds
6
+
7
+
8
+ # Handle SSL Redirects + HTTP Auth
9
+ # to catch linked files @ runtime
10
+ def open_url url
11
+ begin
12
+ html = open(url).read
13
+ rescue Exception => e
14
+ redirect = URI.parse(url)
15
+ if e.message =~ /redirection forbidden/ && redirect.scheme == 'http'
16
+ html = open_url("https://" + redirect.host + redirect.path)
17
+ end
18
+ if e.message =~ /(Authorization Required|Unauthorized)/
19
+ html = open(url, @creds).read
20
+ end
21
+ end
22
+ return html
23
+ end
24
+
25
+ # take a URL, return an absolute URL
26
+ def absolutify url, cd
27
+ url = url.to_s
28
+ # deal w/ bad URLs, already absolute, etc
29
+ begin
30
+ u = URI.parse(url)
31
+ rescue
32
+ # GIGO, no need for alarm
33
+ return url
34
+ end
35
+
36
+ return url if u.absolute? # http://example.com/about
37
+ c = URI.parse(cd)
38
+ return c.scheme + "://" + c.host + url if url.index('/') == 0 # /about
39
+ return cd + url if url.match(/^[a-zA-Z]+/) # about*
40
+
41
+ # only relative from here on in; ../about, ./about, ../../about
42
+ u_dirs = u.path.split('/')
43
+ c_dirs = c.path.split('/')
44
+
45
+ # move up the directory until there are no more relative paths
46
+ u.path.split('/').each do |x|
47
+ break unless (x == '' || x == '..' || x == '.')
48
+ u_dirs.shift
49
+ c_dirs.pop unless x == '.'
50
+ end
51
+ return c.scheme + "://" + c.host + c_dirs.join('/') + '/' + u_dirs.join('/')
52
+ end
53
+
54
+ end
55
+ end
56
+
57
+ end
@@ -0,0 +1,37 @@
1
+ $:.push File.expand_path("../../lib", __FILE__)
2
+ require 'flatfish/url'
3
+ require 'uri'
4
+ require 'test/unit'
5
+
6
+ class URL_Tests < Test::Unit::TestCase
7
+
8
+ def setup
9
+ @cd = "http://example.com/test_dir1/test_dir2/"
10
+ end
11
+
12
+ def test_relative
13
+ tst = Flatfish::Url.absolutify("../images/test.png", @cd)
14
+ assert_equal("http://example.com/test_dir1/images/test.png", tst)
15
+ end
16
+
17
+ def test_double_relative
18
+ tst = Flatfish::Url.absolutify("../../images/test.png", @cd)
19
+ assert_equal("http://example.com/images/test.png", tst)
20
+ end
21
+
22
+ def test_absolute
23
+ tst = Flatfish::Url.absolutify("http://example.com/images/test.png", @cd)
24
+ assert_equal("http://example.com/images/test.png", tst)
25
+ end
26
+
27
+ def test_broken_root
28
+ tst = Flatfish::Url.absolutify("../../../../images/test.png", @cd)
29
+ assert_equal("http://example.com/images/test.png", tst)
30
+ end
31
+
32
+ def test_same_dir
33
+ tst = Flatfish::Url.absolutify("test.png", @cd)
34
+ assert_equal("http://example.com/test_dir1/test_dir2/test.png", tst)
35
+ end
36
+
37
+ end
metadata ADDED
@@ -0,0 +1,66 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: flatfish
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.3.1
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Tim Loudon
9
+ - Mike Crittenden
10
+ autorequire:
11
+ bindir: bin
12
+ cert_chain: []
13
+ date: 2011-12-28 00:00:00.000000000 Z
14
+ dependencies: []
15
+ description: flatfish accepts a CSV of URLS with CSS selectors prepping them for insert
16
+ into drupal
17
+ email: timl@drupalconnect.com
18
+ executables:
19
+ - flatfish
20
+ extensions: []
21
+ extra_rdoc_files: []
22
+ files:
23
+ - .gitignore
24
+ - Gemfile
25
+ - Gemfile.lock
26
+ - HACKING.md
27
+ - README.md
28
+ - Rakefile
29
+ - TODO.md
30
+ - bin/flatfish
31
+ - example/article.csv
32
+ - example/config.yml
33
+ - example/page.csv
34
+ - flatfish.gemspec
35
+ - lib/flatfish.rb
36
+ - lib/flatfish/create_tables.rb
37
+ - lib/flatfish/media.rb
38
+ - lib/flatfish/page.rb
39
+ - lib/flatfish/pleuronectiformes.rb
40
+ - lib/flatfish/url.rb
41
+ - test/url_test.rb
42
+ homepage: https://github.com/drupalstaffing/flatfish
43
+ licenses: []
44
+ post_install_message:
45
+ rdoc_options: []
46
+ require_paths:
47
+ - lib
48
+ required_ruby_version: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ required_rubygems_version: !ruby/object:Gem::Requirement
55
+ none: false
56
+ requirements:
57
+ - - ! '>='
58
+ - !ruby/object:Gem::Version
59
+ version: '0'
60
+ requirements: []
61
+ rubyforge_project:
62
+ rubygems_version: 1.8.24
63
+ signing_key:
64
+ specification_version: 3
65
+ summary: Scrape web pages!
66
+ test_files: []