RubyGems - stead - Versions diffs - 0.0.2 - Mend

stead 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

data/LICENSE +21 -0
data/README.rdoc +75 -0
data/Rakefile +72 -0
data/VERSION +1 -0
data/bin/csv2ead +70 -0
data/examples/ncsu.rb +74 -0
data/lib/stead/ead.rb +270 -0
data/lib/stead/error.rb +6 -0
data/lib/stead/stead.rb +80 -0
data/lib/stead/templates/ead.xml +44 -0
data/lib/stead/templates/ead.xsd +2728 -0
data/lib/stead/templates/ncsu_ead.xml +69 -0
data/lib/stead.rb +56 -0
data/test/helper.rb +25 -0
data/test/test_ead_bad_container_type.rb +42 -0
data/test/test_ead_no_series.rb +89 -0
data/test/test_ead_series.rb +42 -0
data/test/test_stead.rb +43 -0
metadata +167 -0

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+Copyright (c) 2009 North Carolina State University
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.rdoc ADDED Viewed

@@ -0,0 +1,75 @@
+= stead
+Spreadsheets To Encoded Archival Description. Turns CSV files of container lists
+into a stub EAD XML record.
+== Story
+Sometimes donors have spreadsheets which list the contents of their collections.
+Rather than retype all of these into Archivists' Toolkit or an XML editor,
+wouldn't it be nice to automatically generate a stub EAD XML document from the
+spreadsheet?
+With Stead you can. Just edit the headers (first row of the spreadsheet) to
+conform to the Stead schema. This may involve splitting some columns to conform
+to the schema, adding columns, and other editing.  All of this is likely easier,
+faster and more accurate to do in a spreadsheet than trying to do it elsewhere
+retyping the whole thing.
+Once the spreadsheet is ready just save it as a CSV and use the commandline tool
+csv2ead to output an EAD XML document. Import into Archivists' Toolkit.
+== Requirements
+Ruby
+== Examples that follow the schema
+Look in test/contianer_lists/ at the following good examples of the CSV schema:
+mc00000_container_list.csv
+mc00000_container_list_no_series.csv
+The order of the columns does not matter, but the headings must be exactly the
+same case and spaces as those found in these files.
+== Instructions
+Once you have your spreadsheet in the correct schema, do the following:
+- Save the spreadsheet as a CSV file.
+- csv2ead --help for current commandline options.
+= Stead::Extra
+From the commandline you can specify a Stead::Extra class which will be required.
+This class must define a Stead::Extra.run method which accepts an ead and eadid,
+creates a new Stead::Extra object and then does any further processing you'd
+like. See examples/ncsu.rb.
+== Support
+Please let me know what else you need in such a tool and I'll try to work it in.
+== Limitations
+- Some of this is still be NCSU and Archivists' Toolkit specific.
+- This tool has only been used a handful of times so far.
+- Only works with this specific schema.
+- Only known to work with series at the c01 level and files at the c02 level.
+Other deeper levels of nesting will not currently work. ()May work with subseries.)
+- Column values like series must be duplicated for each row.
+== TODO
+- More tests (though there are already lots of tests).
+- Better documentation on the CSV file schema.
+- Rdoc.
+- Automate tests of csv2ead tool.
+- Expand the schema to other parts of the EAD?
+== Author
+Jason Ronallo
+== Copyright
+Copyright (c) 2010 North Carolina State University. See LICENSE for details.

data/Rakefile ADDED Viewed

@@ -0,0 +1,72 @@
+require 'rubygems'
+require 'rake'
+begin
+  require 'jeweler'
+  Jeweler::Tasks.new do |gem|
+    gem.name = "stead"
+    gem.summary = %Q{Spreadsheets To Encoded Archival Description}
+    gem.description = %Q{Converts CSV files of a specific schema into EAD XML.}
+    gem.email = "jronallo@gmail.com"
+    gem.homepage = "http://github.com/jronallo/stead"
+    gem.authors = ["Jason Ronallo"]
+    gem.add_dependency "nokogiri", ">= 1.4.1"
+    gem.add_dependency "fastercsv", ">= 1.5.0"
+    gem.add_dependency "activesupport", ">= 2.3.5"
+    gem.add_dependency "trollop", ">= 1.16.2"
+    gem.add_development_dependency "shoulda", ">= 0"
+    gem.files = FileList["[A-Z]*", "{bin,examples,lib}/**/*"]
+    # gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
+  end
+  Jeweler::GemcutterTasks.new
+rescue LoadError
+  puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
+end
+require 'rake/testtask'
+Rake::TestTask.new(:test) do |test|
+  test.libs << 'lib' << 'test'
+  test.pattern = 'test/**/test_*.rb'
+  test.verbose = true
+end
+begin
+  require 'rcov/rcovtask'
+  Rcov::RcovTask.new do |test|
+    test.libs << 'test'
+    test.pattern = 'test/**/test_*.rb'
+    test.verbose = true
+  end
+rescue LoadError
+  task :rcov do
+    abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
+  end
+end
+task :test => :check_dependencies
+begin
+  require 'reek/adapters/rake_task'
+  Reek::RakeTask.new do |t|
+    t.fail_on_error = true
+    t.verbose = false
+    t.source_files = 'lib/**/*.rb'
+  end
+rescue LoadError
+  task :reek do
+    abort "Reek is not available. In order to run reek, you must: sudo gem install reek"
+  end
+end
+task :default => :test
+require 'rake/rdoctask'
+Rake::RDocTask.new do |rdoc|
+  version = File.exist?('VERSION') ? File.read('VERSION') : ""
+  rdoc.rdoc_dir = 'rdoc'
+  rdoc.title = "stead #{version}"
+  rdoc.rdoc_files.include('README*')
+  rdoc.rdoc_files.include('lib/**/*.rb')
+end

data/VERSION ADDED Viewed

	@@ -0,0 +1 @@
1	+ 0.0.2

data/bin/csv2ead ADDED Viewed

@@ -0,0 +1,70 @@
+#!/usr/bin/env ruby
+$LOAD_PATH.unshift File.join(File.dirname(__FILE__), '..', 'lib')
+require 'pp'
+require 'stead'
+require 'trollop'
+opts = Trollop::options do
+  banner <<-EOS
+This script takes a csv file with a name in the format <eadid>_container_list.csv
+and creates a stub EAD XML document.
+Usage:
+  csv2ead --csv /path/to/<eadid>_container_list.csv [options]
+where options are:
+EOS
+  opt :csv, "A CSV file", :required => true, :type => String
+  opt :baseurl, 'Base URL for adding on the eadid', :type => String
+  opt :url, 'Full URL for this collection guide', :type => String
+  opt :template, 'Specify using a different EAD XML template', :type => String
+  opt :ncsu, 'Use NCSU specific template'
+  opt :extra, 'Full path to a Stead::Extra file to add in other data', :type => String
+  opt :output, 'Save the file by specifying the filename', :type => String
+  opt :pretty, 'If --output is specified this will pretty indent the container list.'
+  opt :stdout, 'Output full EAD to terminal'
+end
+unless opts[:output] or opts[:stdout]
+  puts "You must specify either --output <file> and/or --stdout to direct output to the terminal."
+  exit
+end
+if opts[:ncsu]
+  opts[:template] = File.join(File.dirname(__FILE__), '..', 'lib', 'stead', 'templates', 'ncsu_ead.xml')
+  opts[:baseurl] = 'http://www.lib.ncsu.edu/findingaids'
+  opts[:extra] = File.join(File.dirname(__FILE__), '..', 'examples', 'ncsu.rb')
+end
+ead_options = {}
+# add eadid from filename
+# basename will include _container_list so we need to remove that
+basename = File.basename(opts[:csv], '.csv')
+ead_options[:eadid] = basename.sub(/_container_list.*$/, '')
+ead_options[:base_url] = opts[:baseurl] if opts[:baseurl]
+[:template, :url].each do |key|
+  ead_options[key] = opts[key] if opts[key]
+end
+ead_generator = Stead::EadGenerator.from_csv(File.read(opts[:csv]), ead_options)
+ead = ead_generator.to_ead
+# add any extra content or elements to the EAD before outputting
+if opts[:extra]
+  require opts[:extra]
+  Stead::Extra.run(ead, ead_options[:eadid])
+end
+if opts[:output]
+  File.open(opts[:output], 'w') do |fh|
+    if opts[:pretty]
+      fh.puts Stead.pretty_write(ead)
+    else
+      fh.puts ead
+    end
+  end
+end
+puts Stead.pretty_write(ead) if opts[:stdout]

data/examples/ncsu.rb ADDED Viewed

@@ -0,0 +1,74 @@
+module Stead
+  class Extra
+    attr_accessor :ead, :eadid
+    def initialize(ead,eadid)
+      @ead = ead
+      @eadid = eadid
+    end
+    def self.run(ead, eadid)
+      extra = self.new(ead,eadid)
+      extra.add_collection_specific
+      ead
+    end
+    def add_collection_specific
+      if eadid.include?('ua')
+        # add additional conditions governing use note
+        add_ua_userestrict(ead)
+        append_to_titleproper(ead, eadid, 'Records')
+        archdesc_level(ead, 'subgrp')
+      elsif eadid.include?('mc')
+        append_to_titleproper(ead, eadid, 'Papers')
+        archdesc_level(ead, 'collection')
+      end
+    end
+    def archdesc_level(ead, content)
+      archdesc = ead.xpath('//xmlns:archdesc').first
+      archdesc['level'] = content
+    end
+    def add_ua_userestrict(ead)
+      first_userestrict = ead.xpath('//xmlns:userestrict').first
+      userestrict = Nokogiri::XML::Node.new('userestrict', ead)
+      first_userestrict.add_next_sibling(userestrict)
+      head = Nokogiri::XML::Node.new('head', ead)
+      head.content = 'Confidentiality Notice'
+      p = Nokogiri::XML::Node.new('p', ead)
+      p.content = <<EOF
+          This collection may contain materials with sensitive or confidential
+information that is protected under federal or state right to privacy laws and
+regulations. Researchers are advised that the disclosure of certain information
+pertaining to identifiable living individuals represented in this collection
+without the consent of those individuals may have legal ramifications (e.g.,
+a cause of action under common law for invasion of privacy may arise if facts
+concerning an individual's private life are published that would be deemed
+highly offensive to a reasonable person) for which North Carolina State
+University assumes no responsibility.
+EOF
+      userestrict.add_child(head)
+      userestrict.add_child(p)
+    end
+    def append_to_titleproper(ead, eadid, text)
+      titleproper = ead.xpath('//xmlns:titleproper').first
+      better_titleproper = titleproper.content.strip.chomp + ' ' + text
+      titleproper.content = better_titleproper
+      num = Nokogiri::XML::Node.new('num', ead)
+      better_num = eadid.upcase.gsub('_', '.')
+      num.content = better_num
+      titleproper.add_child(num)
+      # now also add to archdesc did
+      archdesc_did = ead.xpath('//xmlns:archdesc/xmlns:did').first
+      unittitle = archdesc_did.xpath('xmlns:unittitle').first
+      unittitle.content = better_titleproper
+      unitid = archdesc_did.xpath('xmlns:unitid').first
+      unitid.content = better_num
+    end
+  end
+end

data/lib/stead/ead.rb ADDED Viewed

@@ -0,0 +1,270 @@
+module Stead
+  class EadGenerator
+    attr_accessor :csv, :ead, :template, :series, :component_parts
+    def initialize(opts = {})
+      @csv = opts[:csv] || nil
+      @template = pick_template(opts)
+      @eadid = opts[:eadid] if opts[:eadid]
+      @base_url = opts[:base_url] if opts[:base_url]
+      # component_parts are the rows in the csv file
+      @component_parts = csv_to_a
+    end
+    def pick_template(opts)
+      if opts[:template]
+        Nokogiri::XML(File.read(opts[:template]))
+      else
+        Stead.ead_template_xml
+      end
+    end
+    def self.from_csv(csv, opts={})
+      lines = csv.split(/\r\n|\n/)
+      100.times do
+        lines[0] = lines.first.gsub(',,', ',nothing,')
+      end
+      csv = lines.join("\n")
+      self.new(opts.merge(:csv => csv))
+    end
+    def eadid_node
+      @ead.xpath('//xmlns:eadid').first
+    end
+    def add_eadid
+      eadid_node.content = @eadid
+    end
+    def add_eadid_url
+      if @base_url
+        eadid_node['url'] = File.join(@base_url, @eadid)
+      elsif @url
+        eadid_node['url'] = @url
+      end
+    end
+    def to_ead
+      @ead = template.dup
+      add_eadid
+      add_eadid_url
+      @dsc = @ead.xpath('//xmlns:archdesc/xmlns:dsc')[0]
+      if series?
+        add_series
+      end
+      @component_parts.each do |cp|
+        c = node(file_component_part_name)
+        c['level'] = 'file'
+        c['audience'] = 'internal' if !cp['internal only'].blank?
+        did = node('did')
+        c.add_child(did)
+        add_did_nodes(cp, did)
+        add_containers(cp, did)
+        add_scopecontent(cp, did)
+        add_accessrestrict(cp, did)
+        add_file_component_part(cp, c)
+      end
+      begin
+        valid?
+      rescue Stead::InvalidEad
+        warn "Invalid EAD"
+        ead
+      end
+      ead
+    end
+    def add_series
+      add_arrangement
+      series = @component_parts.map do |cp|
+        [cp['series number'], cp['series title'], cp['series dates']]
+      end.uniq
+      series.each do |ser|
+        add_arrangement_item(ser)
+        # create series node and add to dsc
+        series_node = node('c01')
+        @dsc.add_child(series_node)
+        series_node['level'] = 'series'
+        # create series did and add to series node
+        series_did = node('did')
+        series_node.add_child(series_did)
+        unitid = node('unitid')
+        unitid.content = ser[0]
+        unittitle = node('unittitle')
+        unittitle.content = ser[1]
+        unitdate = node('unitdate')
+        unitdate.content = ser[2]
+        series_did.add_child(unitid)
+        series_did.add_child(unittitle)
+        series_did.add_child(unitdate)
+      end
+    end
+    def add_arrangement
+      arrangement = node('arrangement')
+      head = node('head')
+      head.content = 'Organization of the Collection'
+      arrangement.add_child(head)
+      p = node('p')
+      p.content = 'This collection is organized into series:'
+      arrangement.add_child(p)
+      list = node('list')
+      p.add_child(list)
+      @dsc.add_previous_sibling(arrangement)
+    end
+    def add_arrangement_item(ser)
+      list = @ead.xpath('//xmlns:arrangement/xmlns:p/xmlns:list').first
+      item = node('item')
+      contents = []
+      ser.each do |ser_part|
+        contents << ser_part unless ser_part.blank?
+      end
+      item.content = contents.join(', ')
+      list.add_child(item)
+    end
+    # metadata is a hash from the @component_part and c is the actual node
+    def add_file_component_part(metadata, c)
+      if series?
+        current_series = find_current_series(metadata)
+        current_series.add_child(c)
+      else
+        @dsc.add_child(c)
+      end
+    end
+    def find_current_series(cp)
+      series_title = cp['series title']
+      @ead.xpath("//xmlns:c01/xmlns:did/xmlns:unittitle").each do |node|
+        return node.parent.parent if node.content == series_title
+      end
+    end
+    def file_component_part_name
+      if series?
+        'c02'
+      else
+        'c01'
+      end
+    end
+    def add_did_nodes(cp, did)
+      field_map.each do |header, element|
+        if !cp[header].blank?
+          if element.is_a? String
+            node = node(element)
+            node.content = cp[header]
+            did.add_child(node)
+          elsif element.is_a? Array
+            node1 = node(element[0])
+            did.add_child(node1)
+            node2 = node(element[1])
+            node1.add_child(node2)
+            node2.content = cp[header]
+          end
+        end
+      end
+    end
+    def add_containers(cp, did)
+      ['1', '2', '3'].each do |container_number|
+        container_type = cp['container ' + container_number + ' type']
+        container_number = cp['container ' + container_number + ' number']
+        if !container_type.blank? and !container_number.blank?
+          unless valid_container_type?(container_type)
+            raise Stead::InvalidContainerType, container_type
+          end
+          container = node('container')
+          container['type'] = container_type
+          container['label'] = cp['instance type'] if cp['instance type']
+          container.content = container_number
+          did.add_child(container)
+        end
+      end
+    end
+    def valid_container_type?(container_type)
+      if Stead::CONTAINER_TYPES.include?(container_type)
+        return true
+      else
+        return false
+      end
+    end
+    def add_scopecontent(cp, did)
+      unless cp['scopecontent'].blank?
+        scopecontent = node('scopecontent')
+        p = node('p')
+        p.content = cp['scopecontent']
+        scopecontent.add_child(p)
+        did.add_next_sibling(scopecontent)
+      end
+    end
+    def add_accessrestrict(cp, did)
+      unless cp['conditions governing access'].blank?
+        accessrestrict = node('accessrestrict')
+        p = node('p')
+        p.content = cp['conditions governing access']
+        accessrestrict.add_child(p)
+        did.add_next_sibling(accessrestrict)
+      end
+    end
+    def node(element)
+      Nokogiri::XML::Node.new(element, @ead)
+    end
+    def field_map
+      {'file id' => 'unitid',
+        'file title' => 'unittitle',
+        'file dates' => 'unitdate',
+        'extent' => ['physdesc', 'extent'],
+        'note1' => ['note', 'p'],
+        'note2' => ['note', 'p']
+      }
+    end
+    def csv_to_a
+      a = []
+      FasterCSV.parse(csv, :headers => :first_row) do |row|
+        a << row.to_hash
+      end
+      if a.first.keys.include?(nil)
+        raise Stead::InvalidCsv
+      end
+      # TODO invalid if the last row is blank
+      #      a.sort_by do |row|
+      #        [
+      #          row['series number'] || 'z',
+      #          row['subseries number'] || 'z',
+      #          row['container 1 number'] || 'z',
+      #          row['container 2 number'] || 'z',
+      #          row['file title'] || 'z'
+      #        ]
+      #      end
+      a
+    end
+    def valid?
+      unless Stead.xsd.valid?(ead)
+        raise Stead::InvalidEad
+      end
+    end
+    def series?
+      if series_found?
+        series = true
+      end
+    end
+    def series_found?
+      @component_parts.each do |row|
+        return false if row['series number'].blank?
+      end
+    end
+  end
+end

data/lib/stead/error.rb ADDED Viewed

@@ -0,0 +1,6 @@
+module Stead
+  class InvalidContainerType < RuntimeError; end
+  class InvalidEad < RuntimeError; end
+  class InvalidCsv < RuntimeError; end
+end

data/lib/stead/stead.rb ADDED Viewed

@@ -0,0 +1,80 @@
+module Stead
+  def self.ead_schema
+    File.expand_path(File.join(File.dirname(__FILE__), 'templates','ead.xsd'))
+  end
+  def self.xsd
+    Nokogiri::XML::Schema(File.read(Stead.ead_schema))
+  end
+  def self.ead_template
+    File.expand_path(File.join(File.dirname(__FILE__), 'templates','ead.xml'))
+  end
+  def self.ead_template_xml
+    Nokogiri::XML(File.read(self.ead_template))
+  end
+  def self.pretty_write(xml)
+    if xml.is_a? String
+      self.write(xml)
+    elsif xml.is_a? Nokogiri::XML::Document or xml.is_a? Nokogiri::XML::Node
+      self.write(xml.to_xml)
+    end
+  end
+   def self.write(buffer)
+      xsl =<<XSL
+<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
+<xsl:output method="xml" encoding="UTF-8"/>
+<xsl:param name="indent-increment" select="' '"/>
+<xsl:template name="newline">
+<xsl:text disable-output-escaping="yes">
+</xsl:text>
+</xsl:template>
+<xsl:template match="comment() | processing-instruction()">
+<xsl:param name="indent" select="''"/>
+<xsl:call-template name="newline"/>
+<xsl:value-of select="$indent"/>
+<xsl:copy />
+</xsl:template>
+<xsl:template match="text()">
+<xsl:param name="indent" select="''"/>
+<xsl:call-template name="newline"/>
+<xsl:value-of select="$indent"/>
+<xsl:value-of select="normalize-space(.)"/>
+</xsl:template>
+<xsl:template match="text()[normalize-space(.)='']"/>
+<xsl:template match="*">
+<xsl:param name="indent" select="''"/>
+<xsl:call-template name="newline"/>
+<xsl:value-of select="$indent"/>
+<xsl:choose>
+<xsl:when test="count(child::*) > 0">
+<xsl:copy>
+<xsl:copy-of select="@*"/>
+<xsl:apply-templates select="*|text()">
+<xsl:with-param name="indent" select="concat ($indent, $indent-increment)"/>
+</xsl:apply-templates>
+<xsl:call-template name="newline"/>
+<xsl:value-of select="$indent"/>
+</xsl:copy>
+</xsl:when>
+<xsl:otherwise>
+<xsl:copy-of select="."/>
+</xsl:otherwise>
+</xsl:choose>
+</xsl:template>
+</xsl:stylesheet>
+XSL
+      doc = Nokogiri::XML(buffer)
+      xslt = Nokogiri::XSLT(xsl)
+      out = xslt.transform(doc)
+      out.to_xml
+    end
+end