RubyGems - datacatalog-importer - Versions diffs - 0.2.0 → 0.2.1 - Mend

datacatalog-importer 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

data/README.md CHANGED Viewed

@@ -0,0 +1,128 @@
+# Introduction
+So you want to write an importer for the National Data Catalog (NDC)? You could integrate with the NDC API directly, but we wouldn't recommend it. Instead, we recommend using the [NDC importer framework](http://github.com/sunlightlabs/datacatalog-importer). The framework serves two major purposes:
+1. It simplifies the task of writing an importer. In particular, the importer framework handles the API communication, so all an importer has to do is handle the external translation step (such as scraping of a Web site or integration with an API). It also provides utility functions that come in handy.
+2. It standardizes importers. This encourages the sharing of best practices
+and it also makes coordination easier. The various importers are automated
+through the use of the [National Data Catalog Importer System](http://github.com/sunlightlabs/datacatalog-imp-system).
+The importer framework is good at doing a few things and delegating these rest. This document will help you get started. Before long, you'll have an importer ready to liberate government data.
+# About the National Data Catalog
+The National Data Catalog builds community around government data sets. At the core, it is a catalog of datasets, APIs, and interactive tools that provide data about government. By "government" we mean any branch of government and any level of government. By "catalog" we mean useful metadata -- how a data set was collected, how often it is updated, where to download the data, and so on.
+The National Data Catalog (NDC) is powered by a decoupled collection of applications centered around a read-write API. All of the applications
+communicate through the API.
+# Walkthrough
+Let's take a look at some example code in the example folder.
+## 1. Setup the Rakefile
+Begin by looking at [example/rakefile.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/rakefile.rb). In this file, you set some configuration information and call out to the importer framework. It will define some rake tasks for you.
+The importer framework handles quite a few things for you provided that you follow its design correctly. Your importer is responsible for providing a Puller class (as defined with `:puller => Puller` in rakefile.rb).
+## 2. Hide The Keys
+National Data Catalog API keys are private, so please don't store them in code. Actually, don't even store them in source control at all. Separate them out and store them in `config.yml`. Make sure that your `.gitignore` file is setup to ignore `config.yml`.  It is a good idea to include `config.example.yml` that demonstrates the format of the file.
+## 3. Make the Puller
+Next, let's look at the [Puller class](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/lib/puller.rb). It is responsible for defining two methods: `initialize` and `run`. (The rake tasks constructed above rely on these methods.)
+Please note that the example provided here is oversimplified. It is intended to demonstrate how to use the importer framework, but it is not a practical example to borrow heavily from. If you want to steal some importer code, please visit the [Sunlight Labs projects page](http://github.com/sunlightlabs) and filter the projects by 'datacatalog-imp-'.
+As you would probably expect, `initialize` is called once. Its main purpose is to setup the callback handler (`@handler`) to refer back to the importer framework.
+Put the main logic / algorithm / voodoo of your importer in the `run` method. The key responsibility of your importer is to call `@handler.source` or `@handler.organization` each time your importer finds a data source or organization, respectively. (Historical note: the 0.1.x version of importer framework worked a little bit differently. This is a more flexible style.)
+### source parameter
+`@handler.source()` expects a hash parameter of this shape:
+    {
+      :title             => "Budget for...",
+      :description       => "Congressional budget for...",
+      :source_type       => "dataset",
+      :url               => "http://...",
+      :documentation_url => "http://...",
+      :license           => "...",
+      :license_url       => "http://...",
+      :released          => Kronos.parse("...").to_hash,
+      :frequency         => "daily",
+      :period_start      => Kronos.parse("...").to_hash,
+      :period_end        => Kronos.parse("...").to_hash,
+      :organization      => {
+        :name            => "", # organization that provides data
+      }
+      :downloads         => [{
+        :url             => "http://..."
+        :format          => "xml",
+      }] # include as many download formats as appropiate
+      :custom            => {},
+      :raw               => {},
+      :catalog_name      => "...",
+      :catalog_url       => "http://...",
+    }
+Note that most of these parameters match up with the properties defined for a [Source in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/sources.rb). These parameters are just passed along to the API, which will validate the values.
+The remaining parameters (`organization` and `downloads`) are handled by the importer framework:
+* The organization sub-hash is used to lookup or create the associated organization for the source. Then a `organization_id` key/value pair is sent to the API.
+* The downloads array is used to lookup or create the associate download formats for a data source.
+You may have noticed the use of `Kronos.parse` above. We highly recommend the use of the [kronos library](http://github.com/djsun/kronos) for the parsing of dates.
+### organization parameter
+`@handler.organization()` expects a hash parameter of this shape:
+    {
+      :name         => "",
+      :acronym      => "",
+      :url          => "http://...",
+      :description  => "",
+      :org_type     => "governmental",
+      :organization => {
+        :name       => "", # parent organization, if any
+        :url        => "",
+      }
+      :catalog_name      => "...",
+      :catalog_url       => "http://...",
+    }
+Note that most of these parameters match up with the properties defined for an [Organization in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/organizations.rb). These parameters are just passed along to the API, which will validate the values.
+The remaining parameter, `organization`, is handled by the importer framework. The framework just looks up the parent organization using the name or url. It then sends `parent_id` with the associated parent organization id to the API.
+## 4. You're Done / Best Practices
+That's it. But before you go hacking away, let me say a few words about best practices:
+* If you are scraping a web site, we highly recommend caching the raw HTML files in your importer. Our production importers are queued up using the NDC Importer System, which integrates nicely with git. It keeps a record of the raw HTML files that correspond to each run. This makes it easier to debug when things go wrong.
+* Take advantage of the utility functions in [/lib/utility.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/lib/utility.rb). If you find want to make a suggestion regarding the utility function, please let us know.
+* It goes without saying, but please follow best Ruby practices and put some
+thought into writing clean code. Follow the conventions of the community and
+strive to make your code readable by other people.
+## 5. Talk to Us
+Please reach out to us on our [National Data Catalog Google Group](http://groups.google.com/group/datacatalog). We can help you with your importer. Once it works reliably, we will want to add it to our importer system. The more up-to-date, relevant government data we bring in, the more useful our data catalog becomes.
+# The Team
+The National Data Catalog includes:
+* David James of Sunlight Labs
+* Luigi Montanez of Sunlight Labs
+* Ryan Wold, a Sunlight Labs intern
+* Mike Dvorscak, a Google Summer of Code Student

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 0.2.0
1	+ 0.2.1

data/datacatalog-importer.gemspec CHANGED Viewed

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = %q{datacatalog-importer}
-  s.version = "0.2.0"
+  s.version = "0.2.1"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["David James"]
-  s.date = %q{2010-07-08}
+  s.date = %q{2010-08-25}
   s.description = %q{This framework makes it easier to write importers for the National Data Catalog.}
   s.email = %q{djames@sunlightfoundation.com}
   s.extra_rdoc_files = [
@@ -37,6 +37,7 @@ Gem::Specification.new do |s|
      "lib/sort_yaml_hash.rb",
      "lib/tasks.rb",
      "lib/utility.rb",
+     "natdat_is_hungry.md",
      "spec/spec.opts",
      "spec/spec_helper.rb",
      "spec/utility_spec.rb"

data/lib/utility.rb CHANGED Viewed

@@ -56,7 +56,7 @@ module DataCatalog
       def self.headers
         {
-          "UserAgent" => "National Data Catalog Importer/0.2.0",
+          "UserAgent" => "National Data Catalog Importer/0.2.1",
         }
       end
@@ -150,8 +150,8 @@ module DataCatalog
       end
       def self.parse_xml_from_uri(uri)
-        puts "Fetching #{uri}..."
-        Nokogiri::XML(open(uri))
+        data = fetch(uri)
+        Nokogiri::XML::Document.parse(data)
       end
       def self.parse_xml_from_file_or_uri(uri, file, options={})

data/natdat_is_hungry.md ADDED Viewed

@@ -0,0 +1,126 @@
+# The National Data Catalog Is Hungry For Data
+So you've found some government data on the web. Naturally, you are eager to share your findings with the world. Perfect! The Sunlight Labs can help. Our [National Data Catalog (NatDatCat)](http://nationaldatacatalog.com) is hungry for government data, and we have to feed it regularly. Otherwise, it gets grumpy.
+The first step is to assess what you've found. If it is just a few bits of scattered files, just [fill out a quick form and tell us about it](http://nationaldatacatalog.com/suggest). On the other hand, if it is a collection of data sets, you might consider writing an importer...
+### Writing an Importer for NatDatCat
+Have Ruby and Git skills and a hankering for some Web spelunking? Then writing an importer for NatDatCat might be a perfect [civic hacking project](http://thechangelog.com/post/382418778/episode-0-1-3-civic-hacking-with-luigi-montanez-and-jere) project for you!
+Since the NatDatCat system is centered around a RESTful API, it is easy to write small standalone programs to work with the data. (Even the Web app is, more-or-less, a presentation layer that communicates through the API.) So, to write your importer, you could integrate with the NDC API directly. We have [API documentation](http://api.nationaldatacatalog.com/docs/) at your service to get you started.
+But not so fast. There is a better way. We recommend using the [NDC importer framework](http://github.com/sunlightlabs/datacatalog-importer). The framework serves two major purposes:
+1. It simplifies the task of writing an importer. In particular, the importer framework handles the API communication, so all an importer has to do is handle the external translation step (such as scraping of a Web site or integration with an API). It also provides utility functions that come in handy.
+2. It standardizes importers. This encourages the sharing of best practices
+and it also makes coordination easier. The various importers are automated
+through the use of the [National Data Catalog Importer System](http://github.com/sunlightlabs/datacatalog-imp-system).
+The importer framework is good at doing a few things and delegating the rest. This document will help you get started. Before long, you'll have an importer ready to liberate government data.
+As a prerequisite, you'll need to install the [NatDatCat API](http://github.com/sunlightlabs/datacatalog-api) on your system. Doing so lets you test your importer locally in a controlled environment. (Once you get your importer working, let us know and we'll add it to our collection of importers that run against our production API.)
+### Importer Walkthrough
+Let's take a look at some example code in the [example](http://github.com/sunlightlabs/datacatalog-importer/tree/master/example/) folder.
+#### 1. Setup the Rakefile
+Begin by looking at [example/rakefile.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/rakefile.rb). In this file, you set some configuration information and call out to the importer framework. It will define some rake tasks for you.
+The importer framework handles quite a few things for you provided that you follow its design correctly. Your importer is responsible for providing a Puller class (as defined with `:puller => Puller` in rakefile.rb).
+#### 2. Make Some Keys and Hide Them
+Use the API to generate a key for your importer. Remember, API keys are private, so please don't store them in code. Actually, don't even store them in source control at all. Separate them out and store them in `config.yml`. Make sure that your `.gitignore` file is setup to ignore `config.yml`.  It is a good idea to include `config.example.yml` that demonstrates the format of the file.
+#### 3. Make the Puller
+Next, let's look at the [Puller class](http://github.com/sunlightlabs/datacatalog-importer/blob/master/example/lib/puller.rb). It is responsible for defining two methods: `initialize` and `run`. (The rake tasks constructed above rely on these methods.)
+Please note that the example provided here is oversimplified. It is intended to demonstrate how to use the importer framework, not as a practical example to copy verbatim. If you want to steal some importer code, please visit the [Sunlight Labs projects page](http://github.com/sunlightlabs) and filter the projects by 'datacatalog-imp-'.
+As you would probably expect, `initialize` is called once. Its main purpose is to setup the callback handler (`@handler`) to refer back to the importer framework.
+Put the main logic / algorithm / secret recipe / voodoo of your importer in the `run` method. The key responsibility of your importer is to call `@handler.source` or `@handler.organization` each time your importer finds a data source or organization, respectively. (Historical note: the 0.1.x version of importer framework worked a little bit differently. This is a more flexible style.)
+**source parameter**
+`@handler.source()` expects a hash parameter of this shape:
+    {
+      :title             => "Budget for...",
+      :description       => "Congressional budget for...",
+      :source_type       => "dataset",
+      :url               => "http://...",
+      :documentation_url => "http://...",
+      :license           => "...",
+      :license_url       => "http://...",
+      :released          => Kronos.parse("...").to_hash,
+      :frequency         => "daily",
+      :period_start      => Kronos.parse("...").to_hash,
+      :period_end        => Kronos.parse("...").to_hash,
+      :organization      => {
+        :name            => "", # organization that provides data
+      }
+      :downloads         => [{
+        :url             => "http://..."
+        :format          => "xml",
+      }] # include as many download formats as appropiate
+      :custom            => {},
+      :raw               => {},
+      :catalog_name      => "...",
+      :catalog_url       => "http://...",
+    }
+Note that most of these parameters match up with the properties defined for a [Source in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/sources.rb). These parameters are just passed along to the API, which will validate the values.
+The remaining parameters (`organization` and `downloads`) are handled by the importer framework:
+* The organization sub-hash is used to lookup or create the associated organization for the source. Then a `organization_id` key/value pair is sent to the API.
+* The downloads array is used to lookup or create the associate download formats for a data source.
+You may have noticed the use of `Kronos.parse` above. We highly recommend the use of the [kronos library](http://github.com/djsun/kronos) for the parsing of dates.
+**organization parameter**
+`@handler.organization()` expects a hash parameter of this shape:
+    {
+      :name         => "",
+      :acronym      => "",
+      :url          => "http://...",
+      :description  => "",
+      :org_type     => "governmental",
+      :organization => {
+        :name       => "", # parent organization, if any
+        :url        => "",
+      }
+      :catalog_name      => "...",
+      :catalog_url       => "http://...",
+    }
+Note that most of these parameters match up with the properties defined for an [Organization in the National Data Catalog API](http://github.com/sunlightlabs/datacatalog-api/blob/master/resources/organizations.rb). These parameters are just passed along to the API, which will validate the values.
+The remaining parameter, `organization`, is handled by the importer framework. The framework just looks up the parent organization using the name or url. It then sends `parent_id` with the associated parent organization id to the API.
+### You're Done / Best Practices
+That's it. But before you go hacking away, let me say a few words about best practices:
+* If you are scraping a web site, we highly recommend caching the raw HTML files in your importer. Our production importers are queued up using the NDC Importer System, which integrates nicely with git. It keeps a record of the raw HTML files that correspond to each run. This makes it easier to debug if and when things go wrong.
+* Take advantage of the utility functions in [/lib/utility.rb](http://github.com/sunlightlabs/datacatalog-importer/blob/master/lib/utility.rb). If you have suggestions about useful utility functions, please let us know.
+* It goes without saying, but please follow best Ruby practices and make a good faith effort at writing clean code. Follow the conventions of the community and strive to make your code readable by other people.
+And thanks for helping us feed the National Data Catalog!
+### Talk to Us / Stay Up To Date
+Please reach out to us on our [National Data Catalog Google Group](http://groups.google.com/group/datacatalog). We can help you with your importer. Once it works reliably, we will want to add it to our importer system. The more up-to-date, relevant government data we bring in, the more useful our data catalog becomes.
+This document is adapted from the README in the [datacatalog-importer source code repository](http://github.com/sunlightlabs/datacatalog-importer). You can find the latest version there.

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: datacatalog-importer
 version: !ruby/object:Gem::Version
-  hash: 23
+  hash: 21
   prerelease: false
   segments:
   - 0
   - 2
-  - 0
-  version: 0.2.0
+  - 1
+  version: 0.2.1
 platform: ruby
 authors:
 - David James
@@ -15,7 +15,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-07-08 00:00:00 -04:00
+date: 2010-08-25 00:00:00 -04:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -96,6 +96,7 @@ files:
 - lib/sort_yaml_hash.rb
 - lib/tasks.rb
 - lib/utility.rb
+- natdat_is_hungry.md
 - spec/spec.opts
 - spec/spec_helper.rb
 - spec/utility_spec.rb