datacatalog-importer 0.2.0 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -0,0 +1,128 @@
1
+ # Introduction
2
+
3
+ So you want to write an importer for the National Data Catalog (NDC)? You could integrate with the NDC API directly, but we wouldn't recommend it. Instead, we recommend using the [NDC importer framework](http://github.com/sunlightlabs/datacatalog-importer). The framework serves two major purposes:
4
+
5
+ 1. It simplifies the task of writing an importer. In particular, the importer framework handles the API communication, so all an importer has to do is handle the external translation step (such as scraping of a Web site or integration with an API). It also provides utility functions that come in handy.
6
+
7
+ 2. It standardizes importers. This encourages the sharing of best practices
8
+ and it also makes coordination easier. The various importers are automated
9
+ through the use of the [National Data Catalog Importer System](http://github.com/sunlightlabs/datacatalog-imp-system).
10
+
11
+ The importer framework is good at doing a few things and delegating these rest. This document will help you get started. Before long, you'll have an importer ready to liberate government data.
12
+
13
+ # About the National Data Catalog
14
+
15
+ The National Data Catalog builds community around government data sets. At the core, it is a catalog of datasets, APIs, and interactive tools that provide data about government. By "government" we mean any branch of government and any level of government. By "catalog" we mean useful metadata -- how a data set was collected, how often it is updated, where to download the data, and so on.
16
+
17
+ The National Data Catalog (NDC) is powered by a decoupled collection of applications centered around a read-write API. All of the applications
18
+ communicate through the API.
19
+
20
+ # Walkthrough
21
+
22
+ Let's take a look at some example code in the example folder.
23
+
24
+ ## 1. Setup the Rakefile
25
+
26
+ Begin by looking at [example/rakefile.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/rakefile.rb). In this file, you set some configuration information and call out to the importer framework. It will define some rake tasks for you.
27
+
28
+ The importer framework handles quite a few things for you provided that you follow its design correctly. Your importer is responsible for providing a Puller class (as defined with `:puller => Puller` in rakefile.rb).
29
+
30
+ ## 2. Hide The Keys
31
+
32
+ National Data Catalog API keys are private, so please don't store them in code. Actually, don't even store them in source control at all. Separate them out and store them in `config.yml`. Make sure that your `.gitignore` file is setup to ignore `config.yml`. It is a good idea to include `config.example.yml` that demonstrates the format of the file.
33
+
34
+ ## 3. Make the Puller
35
+
36
+ Next, let's look at the [Puller class](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/lib/puller.rb). It is responsible for defining two methods: `initialize` and `run`. (The rake tasks constructed above rely on these methods.)
37
+
38
+ Please note that the example provided here is oversimplified. It is intended to demonstrate how to use the importer framework, but it is not a practical example to borrow heavily from. If you want to steal some importer code, please visit the [Sunlight Labs projects page](http://github.com/sunlightlabs) and filter the projects by 'datacatalog-imp-'.
39
+
40
+ As you would probably expect, `initialize` is called once. Its main purpose is to setup the callback handler (`@handler`) to refer back to the importer framework.
41
+
42
+ Put the main logic / algorithm / voodoo of your importer in the `run` method. The key responsibility of your importer is to call `@handler.source` or `@handler.organization` each time your importer finds a data source or organization, respectively. (Historical note: the 0.1.x version of importer framework worked a little bit differently. This is a more flexible style.)
43
+
44
+ ### source parameter
45
+
46
+ `@handler.source()` expects a hash parameter of this shape:
47
+
48
+ {
49
+ :title => "Budget for...",
50
+ :description => "Congressional budget for...",
51
+ :source_type => "dataset",
52
+ :url => "http://...",
53
+ :documentation_url => "http://...",
54
+ :license => "...",
55
+ :license_url => "http://...",
56
+ :released => Kronos.parse("...").to_hash,
57
+ :frequency => "daily",
58
+ :period_start => Kronos.parse("...").to_hash,
59
+ :period_end => Kronos.parse("...").to_hash,
60
+ :organization => {
61
+ :name => "", # organization that provides data
62
+ }
63
+ :downloads => [{
64
+ :url => "http://..."
65
+ :format => "xml",
66
+ }] # include as many download formats as appropiate
67
+ :custom => {},
68
+ :raw => {},
69
+ :catalog_name => "...",
70
+ :catalog_url => "http://...",
71
+ }
72
+
73
+ Note that most of these parameters match up with the properties defined for a [Source in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/sources.rb). These parameters are just passed along to the API, which will validate the values.
74
+
75
+ The remaining parameters (`organization` and `downloads`) are handled by the importer framework:
76
+
77
+ * The organization sub-hash is used to lookup or create the associated organization for the source. Then a `organization_id` key/value pair is sent to the API.
78
+
79
+ * The downloads array is used to lookup or create the associate download formats for a data source.
80
+
81
+ You may have noticed the use of `Kronos.parse` above. We highly recommend the use of the [kronos library](http://github.com/djsun/kronos) for the parsing of dates.
82
+
83
+ ### organization parameter
84
+
85
+ `@handler.organization()` expects a hash parameter of this shape:
86
+
87
+ {
88
+ :name => "",
89
+ :acronym => "",
90
+ :url => "http://...",
91
+ :description => "",
92
+ :org_type => "governmental",
93
+ :organization => {
94
+ :name => "", # parent organization, if any
95
+ :url => "",
96
+ }
97
+ :catalog_name => "...",
98
+ :catalog_url => "http://...",
99
+ }
100
+
101
+ Note that most of these parameters match up with the properties defined for an [Organization in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/organizations.rb). These parameters are just passed along to the API, which will validate the values.
102
+
103
+ The remaining parameter, `organization`, is handled by the importer framework. The framework just looks up the parent organization using the name or url. It then sends `parent_id` with the associated parent organization id to the API.
104
+
105
+ ## 4. You're Done / Best Practices
106
+
107
+ That's it. But before you go hacking away, let me say a few words about best practices:
108
+
109
+ * If you are scraping a web site, we highly recommend caching the raw HTML files in your importer. Our production importers are queued up using the NDC Importer System, which integrates nicely with git. It keeps a record of the raw HTML files that correspond to each run. This makes it easier to debug when things go wrong.
110
+
111
+ * Take advantage of the utility functions in [/lib/utility.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/lib/utility.rb). If you find want to make a suggestion regarding the utility function, please let us know.
112
+
113
+ * It goes without saying, but please follow best Ruby practices and put some
114
+ thought into writing clean code. Follow the conventions of the community and
115
+ strive to make your code readable by other people.
116
+
117
+ ## 5. Talk to Us
118
+
119
+ Please reach out to us on our [National Data Catalog Google Group](http://groups.google.com/group/datacatalog). We can help you with your importer. Once it works reliably, we will want to add it to our importer system. The more up-to-date, relevant government data we bring in, the more useful our data catalog becomes.
120
+
121
+ # The Team
122
+
123
+ The National Data Catalog includes:
124
+
125
+ * David James of Sunlight Labs
126
+ * Luigi Montanez of Sunlight Labs
127
+ * Ryan Wold, a Sunlight Labs intern
128
+ * Mike Dvorscak, a Google Summer of Code Student
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.2.0
1
+ 0.2.1
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{datacatalog-importer}
8
- s.version = "0.2.0"
8
+ s.version = "0.2.1"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["David James"]
12
- s.date = %q{2010-07-08}
12
+ s.date = %q{2010-08-25}
13
13
  s.description = %q{This framework makes it easier to write importers for the National Data Catalog.}
14
14
  s.email = %q{djames@sunlightfoundation.com}
15
15
  s.extra_rdoc_files = [
@@ -37,6 +37,7 @@ Gem::Specification.new do |s|
37
37
  "lib/sort_yaml_hash.rb",
38
38
  "lib/tasks.rb",
39
39
  "lib/utility.rb",
40
+ "natdat_is_hungry.md",
40
41
  "spec/spec.opts",
41
42
  "spec/spec_helper.rb",
42
43
  "spec/utility_spec.rb"
data/lib/utility.rb CHANGED
@@ -56,7 +56,7 @@ module DataCatalog
56
56
 
57
57
  def self.headers
58
58
  {
59
- "UserAgent" => "National Data Catalog Importer/0.2.0",
59
+ "UserAgent" => "National Data Catalog Importer/0.2.1",
60
60
  }
61
61
  end
62
62
 
@@ -150,8 +150,8 @@ module DataCatalog
150
150
  end
151
151
 
152
152
  def self.parse_xml_from_uri(uri)
153
- puts "Fetching #{uri}..."
154
- Nokogiri::XML(open(uri))
153
+ data = fetch(uri)
154
+ Nokogiri::XML::Document.parse(data)
155
155
  end
156
156
 
157
157
  def self.parse_xml_from_file_or_uri(uri, file, options={})
@@ -0,0 +1,126 @@
1
+ # The National Data Catalog Is Hungry For Data
2
+
3
+ So you've found some government data on the web. Naturally, you are eager to share your findings with the world. Perfect! The Sunlight Labs can help. Our [National Data Catalog (NatDatCat)](http://nationaldatacatalog.com) is hungry for government data, and we have to feed it regularly. Otherwise, it gets grumpy.
4
+
5
+ The first step is to assess what you've found. If it is just a few bits of scattered files, just [fill out a quick form and tell us about it](http://nationaldatacatalog.com/suggest). On the other hand, if it is a collection of data sets, you might consider writing an importer...
6
+
7
+ ### Writing an Importer for NatDatCat
8
+
9
+ Have Ruby and Git skills and a hankering for some Web spelunking? Then writing an importer for NatDatCat might be a perfect [civic hacking project](http://thechangelog.com/post/382418778/episode-0-1-3-civic-hacking-with-luigi-montanez-and-jere) project for you!
10
+
11
+ Since the NatDatCat system is centered around a RESTful API, it is easy to write small standalone programs to work with the data. (Even the Web app is, more-or-less, a presentation layer that communicates through the API.) So, to write your importer, you could integrate with the NDC API directly. We have [API documentation](http://api.nationaldatacatalog.com/docs/) at your service to get you started.
12
+
13
+ But not so fast. There is a better way. We recommend using the [NDC importer framework](http://github.com/sunlightlabs/datacatalog-importer). The framework serves two major purposes:
14
+
15
+ 1. It simplifies the task of writing an importer. In particular, the importer framework handles the API communication, so all an importer has to do is handle the external translation step (such as scraping of a Web site or integration with an API). It also provides utility functions that come in handy.
16
+
17
+ 2. It standardizes importers. This encourages the sharing of best practices
18
+ and it also makes coordination easier. The various importers are automated
19
+ through the use of the [National Data Catalog Importer System](http://github.com/sunlightlabs/datacatalog-imp-system).
20
+
21
+ The importer framework is good at doing a few things and delegating the rest. This document will help you get started. Before long, you'll have an importer ready to liberate government data.
22
+
23
+ As a prerequisite, you'll need to install the [NatDatCat API](http://github.com/sunlightlabs/datacatalog-api) on your system. Doing so lets you test your importer locally in a controlled environment. (Once you get your importer working, let us know and we'll add it to our collection of importers that run against our production API.)
24
+
25
+ ### Importer Walkthrough
26
+
27
+ Let's take a look at some example code in the [example](http://github.com/sunlightlabs/datacatalog-importer/tree/master/example/) folder.
28
+
29
+ #### 1. Setup the Rakefile
30
+
31
+ Begin by looking at [example/rakefile.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/rakefile.rb). In this file, you set some configuration information and call out to the importer framework. It will define some rake tasks for you.
32
+
33
+ The importer framework handles quite a few things for you provided that you follow its design correctly. Your importer is responsible for providing a Puller class (as defined with `:puller => Puller` in rakefile.rb).
34
+
35
+ #### 2. Make Some Keys and Hide Them
36
+
37
+ Use the API to generate a key for your importer. Remember, API keys are private, so please don't store them in code. Actually, don't even store them in source control at all. Separate them out and store them in `config.yml`. Make sure that your `.gitignore` file is setup to ignore `config.yml`. It is a good idea to include `config.example.yml` that demonstrates the format of the file.
38
+
39
+ #### 3. Make the Puller
40
+
41
+ Next, let's look at the [Puller class](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/lib/puller.rb). It is responsible for defining two methods: `initialize` and `run`. (The rake tasks constructed above rely on these methods.)
42
+
43
+ Please note that the example provided here is oversimplified. It is intended to demonstrate how to use the importer framework, not as a practical example to copy verbatim. If you want to steal some importer code, please visit the [Sunlight Labs projects page](http://github.com/sunlightlabs) and filter the projects by 'datacatalog-imp-'.
44
+
45
+ As you would probably expect, `initialize` is called once. Its main purpose is to setup the callback handler (`@handler`) to refer back to the importer framework.
46
+
47
+ Put the main logic / algorithm / secret recipe / voodoo of your importer in the `run` method. The key responsibility of your importer is to call `@handler.source` or `@handler.organization` each time your importer finds a data source or organization, respectively. (Historical note: the 0.1.x version of importer framework worked a little bit differently. This is a more flexible style.)
48
+
49
+ **source parameter**
50
+
51
+ `@handler.source()` expects a hash parameter of this shape:
52
+
53
+ {
54
+ :title => "Budget for...",
55
+ :description => "Congressional budget for...",
56
+ :source_type => "dataset",
57
+ :url => "http://...",
58
+ :documentation_url => "http://...",
59
+ :license => "...",
60
+ :license_url => "http://...",
61
+ :released => Kronos.parse("...").to_hash,
62
+ :frequency => "daily",
63
+ :period_start => Kronos.parse("...").to_hash,
64
+ :period_end => Kronos.parse("...").to_hash,
65
+ :organization => {
66
+ :name => "", # organization that provides data
67
+ }
68
+ :downloads => [{
69
+ :url => "http://..."
70
+ :format => "xml",
71
+ }] # include as many download formats as appropiate
72
+ :custom => {},
73
+ :raw => {},
74
+ :catalog_name => "...",
75
+ :catalog_url => "http://...",
76
+ }
77
+
78
+ Note that most of these parameters match up with the properties defined for a [Source in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/sources.rb). These parameters are just passed along to the API, which will validate the values.
79
+
80
+ The remaining parameters (`organization` and `downloads`) are handled by the importer framework:
81
+
82
+ * The organization sub-hash is used to lookup or create the associated organization for the source. Then a `organization_id` key/value pair is sent to the API.
83
+
84
+ * The downloads array is used to lookup or create the associate download formats for a data source.
85
+
86
+ You may have noticed the use of `Kronos.parse` above. We highly recommend the use of the [kronos library](http://github.com/djsun/kronos) for the parsing of dates.
87
+
88
+ **organization parameter**
89
+
90
+ `@handler.organization()` expects a hash parameter of this shape:
91
+
92
+ {
93
+ :name => "",
94
+ :acronym => "",
95
+ :url => "http://...",
96
+ :description => "",
97
+ :org_type => "governmental",
98
+ :organization => {
99
+ :name => "", # parent organization, if any
100
+ :url => "",
101
+ }
102
+ :catalog_name => "...",
103
+ :catalog_url => "http://...",
104
+ }
105
+
106
+ Note that most of these parameters match up with the properties defined for an [Organization in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/organizations.rb). These parameters are just passed along to the API, which will validate the values.
107
+
108
+ The remaining parameter, `organization`, is handled by the importer framework. The framework just looks up the parent organization using the name or url. It then sends `parent_id` with the associated parent organization id to the API.
109
+
110
+ ### You're Done / Best Practices
111
+
112
+ That's it. But before you go hacking away, let me say a few words about best practices:
113
+
114
+ * If you are scraping a web site, we highly recommend caching the raw HTML files in your importer. Our production importers are queued up using the NDC Importer System, which integrates nicely with git. It keeps a record of the raw HTML files that correspond to each run. This makes it easier to debug if and when things go wrong.
115
+
116
+ * Take advantage of the utility functions in [/lib/utility.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/lib/utility.rb). If you have suggestions about useful utility functions, please let us know.
117
+
118
+ * It goes without saying, but please follow best Ruby practices and make a good faith effort at writing clean code. Follow the conventions of the community and strive to make your code readable by other people.
119
+
120
+ And thanks for helping us feed the National Data Catalog!
121
+
122
+ ### Talk to Us / Stay Up To Date
123
+
124
+ Please reach out to us on our [National Data Catalog Google Group](http://groups.google.com/group/datacatalog). We can help you with your importer. Once it works reliably, we will want to add it to our importer system. The more up-to-date, relevant government data we bring in, the more useful our data catalog becomes.
125
+
126
+ This document is adapted from the README in the [datacatalog-importer source code repository](http://github.com/sunlightlabs/datacatalog-importer). You can find the latest version there.
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: datacatalog-importer
3
3
  version: !ruby/object:Gem::Version
4
- hash: 23
4
+ hash: 21
5
5
  prerelease: false
6
6
  segments:
7
7
  - 0
8
8
  - 2
9
- - 0
10
- version: 0.2.0
9
+ - 1
10
+ version: 0.2.1
11
11
  platform: ruby
12
12
  authors:
13
13
  - David James
@@ -15,7 +15,7 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2010-07-08 00:00:00 -04:00
18
+ date: 2010-08-25 00:00:00 -04:00
19
19
  default_executable:
20
20
  dependencies:
21
21
  - !ruby/object:Gem::Dependency
@@ -96,6 +96,7 @@ files:
96
96
  - lib/sort_yaml_hash.rb
97
97
  - lib/tasks.rb
98
98
  - lib/utility.rb
99
+ - natdat_is_hungry.md
99
100
  - spec/spec.opts
100
101
  - spec/spec_helper.rb
101
102
  - spec/utility_spec.rb