semantic-crawler 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/README.rdoc CHANGED
@@ -1,5 +1,4 @@
1
- [ATTENTION] This library is under heavy development and currently not working properly. Please be patient
2
- for a usable version.
1
+ [ATTENTION] This library is under heavy development. Please be patient for a usable version.
3
2
 
4
3
  = SemanticCrawler
5
4
 
@@ -12,27 +11,35 @@ to bypass complex NLP (natural language processing).
12
11
  == Supported Sources
13
12
 
14
13
  * CIA Factbook RDF Dump, see http://www4.wiwiss.fu-berlin.de/factbook/directory/countries
14
+ * [Started] LinkedGeoData - LGD (see http://linkedgeodata.org)
15
+ * [Started] GDACS (see http://gdacs.org)
16
+ * [Started] Freebase (see http://freebase.com)
15
17
 
16
18
  === TODO
17
19
 
18
- * LinkedGeoData - LGD(see http://linkedgeodata.org)
19
- * Freebase
20
20
  * Geonames
21
21
  * DBPedia
22
22
  * Different Government Sources
23
23
 
24
+
24
25
  == Installation
25
26
 
27
+ $ gem install semantic-crawler
28
+
29
+ Or from source:
30
+
26
31
  $ git clone git://github.com/obale/semantic_crawler.git
27
32
  $ cd semantic_crawler
28
33
  $ bundle install
29
34
  $ rake build
30
35
  $ rake install pkg/semantic-crawler-*.gem
31
36
 
32
- If the library is stable enough it will be published at rubygems.org. That simplifies the installation.
33
-
34
37
  You can add this library also as dependency in your Gemfile:
35
38
 
39
+ gem "semantic-cralwer"
40
+
41
+ Or from source:
42
+
36
43
  gem "semantic-crawler", :git => "git://github.com/obale/semantic_crawler.git" # for the master branch or
37
44
  gem "semantic-crawler", :git => "git://github.com/obale/semantic_crawler.git", :tags => "xyz" # for the xyz tag
38
45
 
@@ -56,6 +63,10 @@ see CHANGELOG.rdoc
56
63
  * Ruby 1.8.7-p358 and Rails 3.2.2
57
64
  * Ruby 1.9.3-p125 and Rails 3.2.2
58
65
 
66
+ === Development Branch Health
67
+
68
+ {<img src="https://secure.travis-ci.org/obale/semantic_crawler.png?branch=master"/>}[http://travis-ci.org/#!/obale/semantic_crawler]
69
+
59
70
 
60
71
  == License
61
72
 
data/Rakefile CHANGED
@@ -16,13 +16,12 @@ RDoc::Task.new(:rdoc) do |rdoc|
16
16
  rdoc.rdoc_dir = 'rdoc'
17
17
  rdoc.title = 'SemanticCrawler'
18
18
  rdoc.options << '--line-numbers'
19
+ rdoc.options << '--main=README.rdoc'
19
20
  rdoc.rdoc_files.include('README.rdoc')
21
+ rdoc.rdoc_files.include('CHANGELOG.rdoc')
20
22
  rdoc.rdoc_files.include('lib/**/*.rb')
21
23
  end
22
24
 
23
-
24
-
25
-
26
25
  Bundler::GemHelper.install_tasks
27
26
 
28
27
  require 'rake/testtask'
@@ -34,5 +33,4 @@ Rake::TestTask.new(:test) do |t|
34
33
  t.verbose = false
35
34
  end
36
35
 
37
-
38
36
  task :default => :test
@@ -1,13 +1,36 @@
1
+ # encoding: UTF-8
2
+
1
3
  require "httparty"
2
4
 
5
+ # The top level module contains the different data sources
6
+ # as sub-modules. Currently there are the following modules
7
+ # available:
8
+ #
9
+ # * Freebase
10
+ # * Factbook
11
+ # * LinkedGeoData
12
+ # * Gdacs
13
+ #
14
+ # The existing modules are extended stepwise and additional
15
+ # sources are added in the future.
3
16
  module SemanticCrawler
4
17
  end
5
18
 
6
- # Freebase.com
19
+ # Freebase.com - module: Freebase
7
20
  require "semantic_crawler/freebase/crawler"
8
21
  require "semantic_crawler/freebase/entity"
9
22
  require "semantic_crawler/freebase/country"
10
23
 
11
- # CIA Factbook RDF Dump
24
+ # CIA Factbook RDF Dump - module: Factbook
25
+ require "semantic_crawler/factbook"
12
26
  require "semantic_crawler/factbook/country"
13
27
 
28
+ # LinkedGeoData (http://linkedgeodata.org) - module: LinkedGeoData
29
+ require "semantic_crawler/linked_geo_data"
30
+ require "semantic_crawler/linked_geo_data/relevant_node"
31
+
32
+ # GDACS (http://gdacs.org) - module: Gdacs
33
+ require "semantic_crawler/gdacs"
34
+ require "semantic_crawler/gdacs/feed.rb"
35
+ require "semantic_crawler/gdacs/feed_item.rb"
36
+ require "semantic_crawler/gdacs/resource.rb"
@@ -0,0 +1,11 @@
1
+ # The RDF Dump of the CIA Factbook contains country relevant information. The
2
+ # information are maybe deprecated, but for general purpose information that
3
+ # never or infrequent change, this source is perfect.
4
+ #
5
+ # This module encapsulates the access to the underlying RDF files and wrapped
6
+ # the most important properties. Not wrapped properties in the namespace of
7
+ # factbook could be accessed via the following method:
8
+ #
9
+ # * link:SemanticCrawler::Factbook::Country::get_factbook_property
10
+ module SemanticCrawler::Factbook
11
+ end
@@ -1,18 +1,28 @@
1
+ # encoding: UTF-8
2
+
1
3
  require 'nokogiri'
2
4
  require 'open-uri'
3
5
 
4
6
  module SemanticCrawler
5
7
  module Factbook
8
+ # Extracted from the RDF Dump of the CIA Factbook. Contains all
9
+ # relevant, but maybe deprecated information about countries.
6
10
  class Country
11
+ # The prefixed used for each country
7
12
  @@URI_PREFIX = "http://www4.wiwiss.fu-berlin.de/factbook/data/"
8
13
 
14
+ # Predefined RDFS/OWL namespaces used for RDF file parsing.
9
15
  @@NAMESPACES = {
10
16
  "factbook" => "http://www4.wiwiss.fu-berlin.de/factbook/ns#",
11
17
  "rdfs" => "http://www.w3.org/2000/01/rdf-schema#",
12
18
  "rdf" => "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
13
19
  }
14
20
 
21
+ # Country name given as input during the object creation.
15
22
  attr_reader :country_name
23
+
24
+ # The complete URL of the country. Could be also wrong,
25
+ # if the country_name is not valid.
16
26
  attr_reader :url
17
27
 
18
28
  # Get Country Information from the CIA Factbook. see
@@ -23,15 +33,14 @@ module SemanticCrawler
23
33
  # >> puts austria.background
24
34
  #
25
35
  # Argumenst:
26
- # name: (String)
36
+ # new_country_name: (String)
27
37
  def initialize(new_country_name)
28
- new_country_name.gsub!(" ", "_")
29
- @country_name = new_country_name.downcase
30
- @url = @@URI_PREFIX + @country_name
38
+ @country_name = new_country_name
39
+ @url = @@URI_PREFIX + @country_name.downcase.gsub(" ", "_").gsub("usa", "united_states")
31
40
  begin
32
41
  fetch_rdf
33
42
  rescue => e
34
- puts "Not able to get country information, through exception: " + e
43
+ puts "Not able to get country information, through exception: " + e.message
35
44
  end
36
45
  end
37
46
 
@@ -116,6 +125,7 @@ module SemanticCrawler
116
125
  end
117
126
 
118
127
  private
128
+ # Retrieves the RDF stream
119
129
  def fetch_rdf
120
130
  @doc = Nokogiri::XML(open(@url))
121
131
  end
@@ -0,0 +1,4 @@
1
+ # The module Freebase retrieves general information
2
+ # about different types.
3
+ module SemanticCrawler::Freebase
4
+ end
@@ -1,9 +1,14 @@
1
+ # [XXX] The current implementation outputs only an unreadable JSON object.
1
2
  class SemanticCrawler::Freebase::Country < SemanticCrawler::Freebase::Entity
3
+ # The Freebase object that should be retrieved
2
4
  attr_accessor :input_name
5
+ # The URL that points to the JSON object.
3
6
  attr_accessor :json_link
4
7
 
8
+ # The URL prefix of an Freebase experimental JSON entity.
5
9
  @@uri_prefix = "http://www.freebase.com/experimental/topic/standard/en/"
6
10
 
11
+ # Creates a new Freebase object (JSON)
7
12
  def initialize
8
13
  @input_name = nil
9
14
  @html_link = nil
@@ -1,5 +1,6 @@
1
1
  module SemanticCrawler
2
2
  module Freebase
3
+ # [XXX] This class should be deleted. The country object could be called directly.
3
4
  class Crawler
4
5
  # Returns a country object that contains all relevant
5
6
  # information that could be extracted from freebase.com
@@ -1,3 +1,4 @@
1
+ # [XXX] Delete this class. Not necessary.
1
2
  class SemanticCrawler::Freebase::Entity
2
3
  include HTTParty
3
4
  end
@@ -0,0 +1,6 @@
1
+ # encoding: UTF-8
2
+ #
3
+ # The Gdacs module is responsible for the parsing of the
4
+ # current crisis RSS feeds from http://gdacs.org
5
+ module SemanticCrawler::Gdacs
6
+ end
@@ -0,0 +1,100 @@
1
+ # encoding: UTF-8
2
+
3
+ module SemanticCrawler
4
+ module Gdacs
5
+ # The GDACS.org RSS feed contains the latest crisis information about
6
+ # the following crisis types:
7
+ #
8
+ # * Floods
9
+ # * Earthquakes
10
+ # * Tropical Cyclones
11
+ # * Volcanoes
12
+ class Feed
13
+
14
+ @@NAMESPACES = {
15
+ "atom" => "http://www.w3.org/2005/Atom"
16
+ }
17
+
18
+ # The gdacs.org RSS feed URL. (default:
19
+ # http://new.gdacs.org/xml/rss.xml)
20
+ attr_reader :url
21
+
22
+ # Initializes the gdacs.org feed URL. If not specified the default
23
+ # URL (http://new.gdacs.org/xml/rss.xml) is used. Normally the
24
+ # feed URL should not be changed.
25
+ def initialize(new_url = "http://new.gdacs.org/xml/rss.xml")
26
+ @url = new_url
27
+ @root_node = nil
28
+ begin
29
+ fetch_feed
30
+ rescue => e
31
+ puts "Not able to get country information, through exception: " + e.message
32
+ end
33
+ end
34
+
35
+ # Get rss/channel/title
36
+ def title
37
+ query_root_node("title/text()")
38
+ end
39
+
40
+ # Get rss/channel/link
41
+ def link
42
+ query_root_node("link/text()")
43
+ end
44
+
45
+ # Get rss/channel/description
46
+ def description
47
+ query_root_node("description/text()")
48
+ end
49
+
50
+ # Get rss/channel/pubDate
51
+ def pubDate
52
+ query_root_node("pubDate/text()")
53
+ end
54
+
55
+ # Get rss/channel/webMaster
56
+ def webMaster
57
+ query_root_node("webMaster/text()")
58
+ end
59
+
60
+ # Get rss/channel/managingEditor
61
+ def managingEditor
62
+ query_root_node("managingEditor/text()")
63
+ end
64
+
65
+ # Get rss/channel/atom:link
66
+ def atom_link
67
+ query_root_node("atom:link/@href", @@NAMESPACES)
68
+ end
69
+
70
+ # Get rss/channel/item*
71
+ def items
72
+ nodeset = query_root_node("item")
73
+ @items = []
74
+ nodeset.each do |item|
75
+ item_obj = SemanticCrawler::Gdacs::FeedItem.new(item)
76
+ @items << item_obj
77
+ end
78
+ @items
79
+ end
80
+
81
+ # Query the root_node
82
+ def query_root_node(xpath_query, namespaces = {})
83
+ if !@root_node.nil?
84
+ @root_node.xpath(xpath_query, namespaces)
85
+ end
86
+ end
87
+
88
+ def xml_document
89
+ @root_node.to_s
90
+ end
91
+
92
+ private
93
+ # Retrieves the RSS feed
94
+ def fetch_feed
95
+ @doc = Nokogiri::XML(open(@url))
96
+ @root_node = @doc.xpath("/rss/channel")
97
+ end
98
+ end
99
+ end
100
+ end
@@ -0,0 +1,172 @@
1
+ # encoding: UTF-8
2
+
3
+ module SemanticCrawler
4
+ module Gdacs
5
+ # One crisis entity with related resources. Could be one of the
6
+ # following crisis types:
7
+ #
8
+ # * Floods
9
+ # * Earthquakes
10
+ # * Tropical Cyclones
11
+ # * Volcanoes
12
+ class FeedItem
13
+ # XML namespaces used for the parsing process
14
+ @@NAMESPACES = {
15
+ "gdacs" => "http://www.gdacs.org",
16
+ "asgard" => "http://asgard.jrc.it",
17
+ "geo" => "http://www.w3.org/2003/01/geo/wgs84_pos#",
18
+ "dc" => "http://purl.org/dc/elements/1.1/"
19
+ }
20
+
21
+ def initialize(new_root_node)
22
+ @root_node = new_root_node
23
+ end
24
+
25
+ # Returns the crisis title
26
+ def title
27
+ query_root_node("title/text()")
28
+ end
29
+
30
+ # Returns the crisis description
31
+ def description
32
+ query_root_node("description/text()")
33
+ end
34
+
35
+ # Returns the enclosure URL
36
+ def enclosure_url
37
+ query_root_node("enclosure/@url")
38
+ end
39
+
40
+ # Returns the enclosure type, e.g. image/png
41
+ def enclosure_type
42
+ query_root_node("enclosure/@type")
43
+ end
44
+
45
+ # Returns the enclosure length, e.g. 1
46
+ def enclosure_length
47
+ query_root_node("enclosure/@length")
48
+ end
49
+
50
+ # Returns the crisis gdacs link
51
+ def link
52
+ query_root_node("link/text()")
53
+ end
54
+
55
+ # Returns the publication date
56
+ def pubDate
57
+ query_root_node("pubDate/text()")
58
+ end
59
+
60
+ # Returns the crisis start date
61
+ def fromdate
62
+ query_root_node("gdacs:fromdate/text()", @@NAMESPACES)
63
+ end
64
+
65
+ # Returns the crisis end date
66
+ def todate
67
+ query_root_node("gdacs:todate/text()", @@NAMESPACES)
68
+ end
69
+
70
+ # Returns the crisis subject abbreviation
71
+ def subject
72
+ query_root_node("dc:subject/text()", @@NAMESPACES)
73
+ end
74
+
75
+ # Returns an unique crisis identifier (could be non permanent)
76
+ def guid
77
+ query_root_node("guid/text()")
78
+ end
79
+
80
+ # Returns the latitude GPS coordinate where the crisis has occurred
81
+ def latitude
82
+ query_root_node("geo:Point/geo:lat/text()", @@NAMESPACES)
83
+ end
84
+
85
+ # Returns the longitude GPS coordinate where the crisis has occurred
86
+ def longitude
87
+ query_root_node("geo:Point/geo:long/text()", @@NAMESPACES)
88
+ end
89
+
90
+ # Returns the version
91
+ def version
92
+ query_root_node("gdacs:version/text()", @@NAMESPACES)
93
+ end
94
+
95
+ # Returns the event type abbreviation, e.g. VO (for volcanic
96
+ # eruption, EQ (for earthquake), FL (for flood), TC (for tropical
97
+ # cyclone)
98
+ def eventtype
99
+ query_root_node("gdacs:eventtype/text()", @@NAMESPACES)
100
+ end
101
+
102
+ # Returns the alert level, could be GREEN, ORANGE or RED
103
+ def alertlevel
104
+ query_root_node("gdacs:alertlevel/text()", @@NAMESPACES)
105
+ end
106
+
107
+ # Returns the event name if available
108
+ def eventname
109
+ query_root_node("gdacs:eventname/text()", @@NAMESPACES)
110
+ end
111
+
112
+ # Returns the event id
113
+ def eventid
114
+ query_root_node("gdacs:eventid/text()", @@NAMESPACES)
115
+ end
116
+
117
+ # Returns the episode id
118
+ def episodeid
119
+ query_root_node("gdacs:episodeid/text()", @@NAMESPACES)
120
+ end
121
+
122
+ # Returns the severity as human readable string
123
+ def severity
124
+ query_root_node("gdacs:severity/text()", @@NAMESPACES)
125
+ end
126
+
127
+ # Returns the population as human readable string
128
+ def population
129
+ query_root_node("gdacs:population/text()", @@NAMESPACES)
130
+ end
131
+
132
+ # Returns the vulnerability as human readable string
133
+ def vulnerability
134
+ query_root_node("gdacs:vulnerability/text()", @@NAMESPACES)
135
+ end
136
+
137
+ # Returns the country iso3 code if available
138
+ def iso3
139
+ query_root_node("gdacs:iso3/text()", @@NAMESPACES)
140
+ end
141
+
142
+ # Returns the country name(s)
143
+ def country
144
+ query_root_node("gdacs:country/text()", @@NAMESPACES)
145
+ end
146
+
147
+ # Returns ...
148
+ def glide
149
+ query_root_node("gdacs:glide/text()", @@NAMESPACES)
150
+ end
151
+
152
+ # Returns an array of SemanticCrawler::Gdacs::Resource objects
153
+ def resources
154
+ nodeset = query_root_node("gdacs:resources/gdacs:resource", @@NAMESPACES)
155
+ @items = []
156
+ nodeset.each do |item|
157
+ item_obj = SemanticCrawler::Gdacs::Resource.new(item)
158
+ @items << item_obj
159
+ end
160
+ @items
161
+ end
162
+
163
+ # Query the root_node
164
+ def query_root_node(xpath_query, namespaces = {})
165
+ if !@root_node.nil?
166
+ @root_node.xpath(xpath_query, namespaces)
167
+ end
168
+ end
169
+
170
+ end
171
+ end
172
+ end