RubyGems - sm-transcript - Versions diffs - 0.0.4 → 0.0.6 - Mend

sm-transcript 0.0.4 → 0.0.6

Files changed (48) hide show

data/README.txt +138 -118
data/Rakefile +21 -10
data/bin/sm-transcript +0 -0
data/lib/sm_transcript/metadata.rb +25 -0
data/lib/sm_transcript/options.rb +9 -3
data/lib/sm_transcript/runner.rb +6 -0
data/lib/sm_transcript/seg_reader.rb +1 -1
data/lib/sm_transcript/transcript.rb +86 -39
data/lib/sm_transcript/ttml_reader.rb +116 -0
data/lib/sm_transcript/word.rb +6 -4
data/lib/sm_transcript/wrd_reader.rb +5 -4
data/test/results/18.03-2004-L01.align2.wrd +6441 -0
data/test/results/8.01-1999-L01.wrd +5182 -0
data/test/results/801-1stLecture.ttml.xml +757 -0
data/test/results/801-lect01-4730.xml +757 -0
data/test/results/801-lect02-4731.xml +886 -0
data/test/results/801-lect03-4732.xml +818 -0
data/test/results/801-lect04-4733.xml +831 -0
data/test/results/801-lect05-4734.xml +879 -0
data/test/results/801-lect06-4735.xml +822 -0
data/test/results/801-lect07-4736.xml +893 -0
data/test/results/801-lect08-4737.xml +809 -0
data/test/results/801-lect09-4738.xml +807 -0
data/test/results/Audio-Open-The_New_Deal_for_Education.xml +4301 -0
data/test/test_metadatareader.rb +8 -3
data/test/test_options.rb +8 -1
data/test/test_runner.rb +34 -1
data/test/test_transcript.rb +109 -12
data/test/test_ttmlreader.rb +104 -0
data/test/test_wrdreader.rb +24 -9
metadata +47 -148
data/lib/sm_transcript/optparseExample.rb +0 -113
data/lib/sm_transcript/process_csv_files_to_html.rb +0 -58
data/lib/sm_transcript/process_seg_files.rb +0 -21
data/lib/sm_transcript/process_seg_files_to_csv.rb +0 -24
data/lib/sm_transcript/process_seg_files_to_html.rb +0 -31
data/lib/sm_transcript/require_relative.rb +0 -14
data/test/transcripts/GardnerRileyInterview.t1.html +0 -247
data/test/transcripts/IIHS_Diane_Davis_Nov2009-t1.html +0 -148
data/test/transcripts/NERCOMP-SpokenMedia4.t1.html +0 -2178
data/test/transcripts/data.js +0 -24
data/test/transcripts/vijay_kumar-1.-t1.html +0 -557
data/test/transcripts/vijay_kumar-1.t1.html +0 -558
data/test/transcripts/vijay_kumar-t1.html +0 -558
data/test/transcripts/vijay_kumar-t1.ttml +0 -570
data/test/transcripts/vijay_kumar.data.js +0 -2
data/test/transcripts/vijay_kumar.t1.html +0 -557
data/test/transcripts/wirehair-beetle.data.js +0 -24

data/README.txt CHANGED

@@ -1,140 +1,160 @@
-$Id: README.txt 194 2010-03-28 00:09:23Z pwilkins $
+$Id: README.txt 196 2010-06-11 18:51:18Z pwilkins $
 sm-transcript reads results of SLS processing and produces transcripts for
 the SpokenMedia browser.  For each file in the source folder whose extension
 matches the source type, a file of destination type is created in the
-destination folder.  All of these parameters have default values.
+destination folder.  All of these parameters have default values.
+Note: Examples of the commands you enter in the terminal are for *nix.  The
+command prompt in the examples is:
+felix$ <command line>
+If you are a Windows user, make the usual adjustments.
 Requirements:
-	sm-transcript is written in Ruby and packaged as a RubyGem.  Since Ruby is
-	not a compiled language, you will need to have Ruby installed on your
-	machine to run sm-transcript.  You can determine if Ruby is installed by
-	typing "ruby -v" at a terminal prompt.  It should return the version of
-	Ruby that is installed.  If Ruby is not installed on your machine, contact
-	me (or your local Ruby wizard) for assistance.
+   sm-transcript is written in Ruby and packaged as a RubyGem.  Since Ruby is
+   not a compiled language, you will need to have Ruby installed on your
+   machine to run sm-transcript.  You can determine if Ruby is installed by
+   typing "ruby -v" at a terminal prompt.  It should return the version of
+   Ruby that is installed.  If Ruby is not installed on your machine, contact
+   me (or your local Ruby wizard) for assistance.
 Installation:
-	You can get sm-transcript as either a RubyGem or as source from svn.
-	The preferred way to install this package is as a Rubygem.  You can
-	download and install the gem with this command:
-	sudo gem install [--verbose] sm-transcript
-	This command downloads the most recent version of the gem from rubygems.org
-	and makes it active.  Previous versions of the gem remain installed, but
-	are deactivated.
-	You must use "sudo" to properly install the gem.  If you execute "gem
-	install" (omitting the "sudo") the gem is installed in your home gem
-	repository and it isn't in your path without additional configuration.
-	Note: You need sudo privileges to run the command as written.  If you
-	can't sudo, then you can install it locally and will need some additional
-	configuration.  Contact me (or your local Ruby wizard) for assistance.
-	The executable is now in your path.
-	You can cleanly uninstall the gem with this command:
-	sudo gem uninstall sm-transcript
-	If you have access to our svn repository, you are welcome to check out the
-	code.  Be warned that the trunk tip is not necessarily stable.  It changes
-	frequently as enhancements (and bug fixes) are added. (note that the
-	'smb_transcript' in the command line below is not a typo. )
-	svn co svn+ssh://svn.mit.edu/oeit-tsa/SMB/smb_transcript/trunk sm_transcript
-	build the gem by running this command from the directory you installed the
-	source.
-	rake gem
-	The gem will be built and put in ./pkg   You can now use the gem
-	installation instructions above.
+   You can get sm-transcript as either a RubyGem or as source from svn.
+   The preferred way to install this package is as a Rubygem.  You can
+   download and install the gem with this command:
+   felix$ sudo gem install [--verbose] sm-transcript
+   This command downloads the most recent version of the gem from rubygems.org
+   and makes it active.  Previous versions of the gem remain installed, but
+   are deactivated.
+   You must use "sudo" to properly install the gem.  If you execute "gem
+   install" (omitting the "sudo") the gem is installed in your home gem
+   repository and it isn't in your path without additional configuration.
+   Note: You need sudo privileges to run the command as written.  If you
+   can't sudo, then you can install it locally and will need some additional
+   configuration.  Contact me (or your local Ruby wizard) for assistance.
+   The executable is now in your path.
+   You can cleanly uninstall the gem with this command:
+   felix$ sudo gem uninstall sm-transcript
+   If you have access to our svn repository, you are welcome to check out the
+   code.  Be warned that the trunk tip is not necessarily stable.  It changes
+   frequently as enhancements (and bug fixes) are added. (note that the
+   'smb_transcript' in the command line below is not a typo.)
+   svn co svn+ssh://svn.mit.edu/oeit-tsa/SMB/smb_transcript/trunk sm_transcript
+   build the gem by running this command from the directory you installed the
+   source.  This is what it looks like on my machine:
+   felix$ rake gem
+   The gem will be built and put in ./pkg   You can now use the gem
+   installation instructions above.
 Using the App:
-	Run with no command line parameters, the app reads *.wrd files out of
-	./results and writes *t1.html files to ./transcripts.  These directories
-	are relative to where sm_transcript is called.
-	Note: destination files are overwritten without a warning prompt.  If you
-	want to preserve an existing output file, rename it before running the app
-	again.
-	For example, run the app by navigating to the bin folder and running
-		projects/sm_transcript/bin felix$ sm_transcript
-	This command run from this folder will read *.wrd files from bin/results
-	and write *-t1.html to bin/transcripts.
-	Usage: sm_transcript [options]
-     --srcdir PATH         Read files from this folder (Default: ./results)
-     --destdir PATH        Write files to this folder (Default: ./transcripts)
-     --srctype wrd | seg   Kind of file to process (Default: wrd)
-     --desttype html | ttml | datajs  Kind of file to output (Default: html)
-     -h, --help            Show this message
+   Run with no command line parameters, the app reads *.wrd files out of
+   ./results and writes *t1.html files to ./transcripts.  These directories
+   are relative to where sm_transcript is called.
+   Note: destination files are overwritten without a warning prompt.  If you
+   want to preserve an existing output file, rename it before running the app
+   again.
+   For example, run the app by navigating to the bin folder and enter
+      projects/sm_transcript/bin felix$ sm_transcript
+   This command run from this folder will read *.wrd files from bin/results
+   and write *-t1.html to bin/transcripts.
+   Usage: sm_transcript [options]
+    --srcdir PATH        Read files from this folder (Default: ./results)
+    --destdir PATH       Write files to this folder (Default: ./transcripts)
+    --srctype wrd | seg | txt | ttml   Kind of file to process (Default: wrd)
+    --desttype html | ttml | datajs | json  Kind of file to output (Default: html)
+    -h, --help            Show this message
 Troubleshooting:
-	sm-transcript requires additional gems to operate.  The RubyGem
-	installation should install dependencies automatically, but when it
-	doesn't, you get an error that includes
-	... no such file to load -- builder (LoadError)
-	in the first few lines when you run sm-transcript, the problem is a
-	missing dependent gem.  (the error above indicates that the Builder
-	gem is missing.)  Try installing the missing gem.  For the error above,
-	command looks like this:
-	sudo gem install builder
-	See "Required Gems" below for more information.
+   sm-transcript requires additional gems to operate.  The RubyGem
+   installation should install dependencies automatically, but when it
+   doesn't, you get an error that includes
+   ... no such file to load -- builder (LoadError)
+   in the first few lines when you run sm-transcript, the problem is a
+   missing dependent gem.  (the error above indicates that the Builder
+   gem is missing.)  Try installing the missing gem.  For the error above,
+   the command looks like this on my computer:
+   felix$ sudo gem install builder
+   See "Required Gems" below for more information.
+   A warning message such as:
+      "WARNING: Nokogiri was built against LibXML version 2.7.6,
+      but has dynamically loaded 2.7.7""
+   may be safely ignored.
 Upgrading:
-	You can easily upgrade by simply executing the same command you used to
-	install the gem.  Running install again will add the newer version and make
-	it active.  By default the most recent version is used, but older versions
-	are still available, simply inactive.
-	If are using svn, you should already know what to do.
+   You can easily upgrade by simply executing the same command you used to
+   install the gem.  Running install again will add the newer version and make
+   it active.  By default the most recent version is used, but older versions
+   are still available, simply inactive.
+   If are using svn, you should already know what to do.
 Required Gems:
-	builder    - create structured data, such as XML
-	extensions - added for the 'require_relative' command.  (To get this
-	             command in Ruby 1.8 you need to install this gem, for Ruby 1.9
-	             the command is already part of the core.)
-	htmlentities - html parsing
-	json       - create JSON structured data
-	optparse   - option parsing of command line
-	ostruct    - open data structures
-	ppcommand  - pp is a pretty printer.  It is used only for debugging
-	rake       - make for Ruby
-	rubygems   - support for gems (shouldn't be needed for Ruby 1.9)
-	shoulda    - enhancement for Test::Unit
-	This command installs gems on OSX and Linux:
-	felix$ sudo gem install <gem name>
+   builder    - create structured data, such as XML
+   extensions - added for the 'require_relative' command.  (To get this
+                command in Ruby 1.8 you need to install this gem, for Ruby 1.9
+                the command is already part of the core.)
+   htmlentities - html parsing
+   json       - create JSON structured data
+   optparse   - option parsing of command line
+   ostruct    - open data structures
+   ppcommand  - pp is a pretty printer.  It is used only for debugging
+   rake       - make for Ruby
+   rubygems   - support for gems (shouldn't be needed for Ruby 1.9)
+   shoulda    - enhancement for Test::Unit
+   This command installs gems on OSX and Linux:
+   felix$ sudo gem install <gem name>
 Unit Tests:
-	You may run all unit tests by navigating to the test folder and running
-	rake with no parameters (the default rake task runs all tests):
+   You may run all unit tests by navigating to the test folder and running
+   rake with no parameters (the default rake task runs all tests).  On my
+   computer, it looks like this:
-	projects/sm_transcript/test felix$ rake
+   projects/sm_transcript/test felix$ rake
 Release Notes:
-	Initial Version - runs under Ruby 1.8.
+   Initial Version - runs under Ruby 1.8.x.
+   version 0.0.4 - fixes bug when processing .WRD files with CRLF line
+   endings.
+   version 0.0.5 - added srctype of ttml and desttype of json, fixed bug
+   where beginning time of word was actually for previous word.
 To Do:
-	update code to run under Ruby 1.9
+   specify individual files for processing rather than folders
+   update code to run under Ruby 1.9
-	Make this a rubygem, making it available from an OEIT server, rather than
-	from a public gem repository like RubyForge.

data/Rakefile CHANGED

@@ -1,31 +1,42 @@
-# $Id: Rakefile 195 2010-04-15 17:29:55Z pwilkins $
+# $Id: Rakefile 196 2010-06-11 18:51:18Z pwilkins $
 require 'rake/gempackagetask'
 require 'rake'
-spec = Gem::Specification.new do |s|
+spec = Gem::Specification.new do |s|
   s.name       = "sm-transcript"
   s.summary    = "Convert word lists to transcripts"
   s.description= File.read(File.join(File.dirname(__FILE__), 'README.txt'))
   s.requirements = [ 'TBD' ]
-  s.version     = "0.0.4"
+  s.version     = "0.0.6"
   s.author      = "Peter Wilkins"
   s.email       = "pwilkins@mit.edu"
   s.homepage    = "http://spokenmedia.mit.edu"
   s.platform    = Gem::Platform::RUBY
   s.required_ruby_version = '>=1.8'
   s.files       = Dir['lib/**/**'] +
-                  Dir['bin/sm-transcript'] +
-                  Dir['bin/results/PLACEHOLDER.txt'] +
-                  Dir['bin/transcripts/PLACEHOLDER.txt'] +
-                  Dir['test/**/**'] +
+                  Dir['bin/sm-transcript'] +
+                  Dir['bin/results/PLACEHOLDER.txt'] +
+                  Dir['bin/transcripts/PLACEHOLDER.txt'] +
+                  Dir['test/*'] +
+                  Dir['test/results/*'] +
+                  Dir['test/transcripts/PLACEHOLDER.txt'] +
                   Dir['README.txt'] +
                   Dir['LICENSE.txt'] +
-                  Dir['Rakefile']
-  s.files.reject! { |fn| fn.include? "process_" }
+                  Dir['Rakefile']
+  s.files.reject! { |fn| fn.include? "process_" }
+  s.files.reject! { |fn| fn.include? 'lect1' }
+  s.files.reject! { |fn| fn.include? 'lect2' }
+  s.files.reject! { |fn| fn.include? 'lect3' }
+  s.files.reject! { |fn| fn.include? 'file-chksum.rb' }
+  s.files.reject! { |fn| fn.include? 'html_tokenizer-example.rb' }
+  s.files.reject! { |fn| fn.include? 'optparseExample.rb' }
+  s.files.reject! { |fn| fn.include? 'xml_to_sqlite.rb' }
+  s.files.reject! { |fn| fn.include? 'require_relative.rb' }
+  s.files.reject! { |fn| fn.include? '801-lect1.*' }
   s.executables = [ 'sm-transcript' ]
   s.test_files  = Dir["test/test*.rb"]
   s.has_rdoc    = false
 end
 Rake::GemPackageTask.new(spec).define

data/bin/sm-transcript CHANGED

File without changes

data/lib/sm_transcript/metadata.rb CHANGED

@@ -9,6 +9,31 @@ require_relative 'word'
 module SmTranscript
   class Metadata
+    #   "dc-abstract"
+    #   "dc-contributor"
+    #   "dc-creator"
+    #   "dc-description"
+    #   "dc-isPartOf"
+    #   "dc-language"
+    #   "dc-license"
+    #   "dc-subject"
+    #   "dc-title"
+    #   "dc-audience"
+    #   "dc-available"
+    #   "dc-created"
+    #   "dc-extent"
+    #   "dc-identifier"
+    #   "dc-isReplacedBy"
+    #   "dc-issued"
+    #   "dc-modified"
+    #   "dc-publisher"
+    #   "dc-replaces"
+    #   "dc-rightsHolder"
+    #   "dc-spatial"
+    #   "dc-temporal"
+    #   "dc-type"
+    #   "dc-valid"
     def initialize(metadata)
       @metadata = metadata

data/lib/sm_transcript/options.rb CHANGED

@@ -11,6 +11,7 @@ module SmTranscript
     SEG_SRC_TYPE = 'seg'
     WRD_SRC_TYPE = 'wrd'
     TXT_SRC_TYPE = 'txt'
+    TTML_SRC_TYPE = 'xml'
     TTML_DEST_TYPE = 'ttml'
     HTML_DEST_TYPE = 'html'
     DATAJS_DEST_TYPE = 'datajs'
@@ -58,12 +59,12 @@ module SmTranscript
           @options.destdir = @destdir = ddir
         end
-        opts.on("--srctype seg | wrd | txt",
-        "Kind of file to process (Default: seg)") do |stype|
+        opts.on("--srctype seg | wrd | txt | xml",
+        "Kind of file to process (Default: wrd)") do |stype|
           @options.srctype = @srctype = stype
         end
-        opts.on("--desttype html | ttml | datajs",
+        opts.on("--desttype html | ttml | datajs | json",
         "Kind of format to output (Default: html)") do |dtype|
           @options.desttype = @desttype = dtype
         end
@@ -73,6 +74,11 @@ module SmTranscript
           return
         end
+        opts.on("-v", "--version", "Show version") do
+          puts "\nsm-transcript gem version: 0.0.5rc"
+          return
+        end
         begin
           argv = ["-h"] if argv.empty?
           opts.parse!(argv)

data/lib/sm_transcript/runner.rb CHANGED

@@ -7,6 +7,7 @@ require 'extensions/kernel'
 require_relative 'options'
 require_relative 'seg_reader'
 require_relative 'wrd_reader'
+require_relative 'ttml_reader'
 require_relative 'transcript'
 require_relative 'metadata'
 require_relative 'metadata_reader'
@@ -23,6 +24,9 @@ module SmTranscript
     def run
       # collect files to process
       begin
+        # p "working directory is #{File.new(__FILE__).path}"
+        # p "reading from #{@options.srcdir}"
+        # p "writing to   #{@options.destdir}"
         raise "source directory doesn't exist" unless FileTest.exists?(@options.srcdir)
         raise "destination directory doesn't exist" unless FileTest.exists?(@options.destdir)
@@ -32,6 +36,8 @@ module SmTranscript
           case @options.srctype
           when SmTranscript::Options::SEG_SRC_TYPE
             words = SmTranscript::SegReader.from_file(x).words
+          when SmTranscript::Options::TTML_SRC_TYPE
+            words = SmTranscript::TtmlReader.from_file(x).words
           when SmTranscript::Options::TXT_SRC_TYPE
             md = SmTranscript::MetadataReader.from_file(x).metadata
           else SmTranscript::Options::WRD_SRC_TYPE

data/lib/sm_transcript/seg_reader.rb CHANGED

@@ -34,7 +34,7 @@ module SmTranscript
       @root.elements.each("/document/lecture/segment") do |s|
         s.text.scan(/^\d* \d* [\w']*$/) do |t|
           arr = t.split
-          @words << SmTranscript::Word.new(arr[0], arr[1], arr[2])
+          @words << SmTranscript::Word.new(arr[0], arr[1], arr[1].to_i - arr[0].to_i, arr[2])
         end
       end
     end

data/lib/sm_transcript/transcript.rb CHANGED

@@ -5,12 +5,14 @@
 require "rexml/document"
 require 'extensions/kernel'
 require 'builder'
+require 'sqlite3'
 require_relative 'word'
 module SmTranscript
   class Transcript
     @words = Array.new()
+    attr_reader :words
     def initialize(word_arr)
       @metadata = {}
@@ -27,7 +29,7 @@ module SmTranscript
         prev_start_time = 0
         start_time = 0
         @words.each do |w|
-          # get the start time and reduce its granularity so that multiple
+          # get the start time and reduce its granularity so that multiple
           # words fall within a <span> element.
           start_time = w.start_time.to_i/1000
           if start_time.to_i == prev_start_time.to_i # append word
@@ -35,16 +37,16 @@ module SmTranscript
           else # create a new span_element
             # since prev_start_time is zero on first line, this avoids
             # writing a closing </span> with no opening <span>
+            span_element = cleanup_phrase(span_element)
             f.puts span_element << "</span> " unless prev_start_time == 0
-            span_element = "<span id='T#{start_time}'>#{w.word}"
-            prev_start_time = start_time
+            span_element = "<span id='T#{start_time}'>#{w.word}"
+            prev_start_time = start_time
           end
         end
-        # In the block above, the last word isn't written if
-        # the start_time and prev_start_time are the same.
-        f.puts span_element << "</span> " unless start_time != prev_start_time
+        # In the block above, the last word isn't written if
+        # the start_time and prev_start_time are the same.
+        f.puts span_element << "</span> " unless start_time != prev_start_time
+        f.close
       end
     end  # write_html()
@@ -57,13 +59,13 @@ module SmTranscript
       buf = ""
       bldr = Builder::XmlMarkup.new( :target => buf, :indent => 2 )
       bldr.instruct!
-      bldr.tt("xmlns" => "http://www.w3.org/2006/04/ttaf1",
+      bldr.tt("xmlns" => "http://www.w3.org/2006/04/ttaf1",
       "xmlns:tts" => "http://www.w3.org/ns/ttml#styling",
       "xmlns:ttm" => "http://www.w3.org/ns/ttml#metadata",
-      "xml:lang" => "en" ) {
+      "xml:lang" => "en" ) {
         bldr.head { |b|
-          b.ttm :title, 'Document Metadata Example'
-          b.ttm :desc,  'This document employs document metadata.'
+          b.ttm :title, 'The title of this transcript'
+          b.ttm :desc,  'The description of this transcript'
         }
         bldr.body {
           bldr.div {
@@ -72,31 +74,37 @@ module SmTranscript
             start_ms = end_ms = 0
             start_secs = 0
             @words.each do |w|
-              # get the start time and reduce its granularity so that multiple
-              # words fall within a span element.
+              # get the start time and reduce its granularity so that
+              # multiple words form a phrase.
               start_secs = w.start_time.to_i/1000
               if start_secs == prev_start_secs # append word
-                end_ms   = w.end_time.to_i
+                end_ms = w.end_time.to_i
                 span_element << " #{w.word}"
               else # create a new span_element
-                bldr.p( span_element,
-                "xml:id" => "T#{start_secs.to_s}", "begin" => "#{start_ms.to_s}ms", "end" => "#{end_ms.to_s}ms" )
+                start_secs = w.start_time.to_i/1000
+                bldr.p( span_element,
+                  "xml:id" => "T#{start_secs.to_s}",
+                  "begin" => "#{start_ms.to_s}ms",
+                  "dur" => "#{(end_ms - start_ms).to_s}ms",
+                  "end" => "#{end_ms.to_s}ms" )
                 start_ms = w.start_time.to_i
                 end_ms   = w.end_time.to_i
-                span_element = " #{w.word}"
-                prev_start_secs = start_secs
+                span_element = " #{w.word}"
+                prev_start_secs = start_secs
               end
-            end
-            # In the block above, the last word isn't written if
-            # the start_time and prev_start_time are the same.
-            bldr.p( span_element,
-              "xml:id" => "T#{start_secs.to_s}",
-              "begin" => "#{start_ms.to_s}ms",
-              "end" => "#{end_ms.to_s}ms" ) unless start_secs != prev_start_secs
+            end # @words.each
+            # In the block above, the last word isn't written if
+            # the start_time and prev_start_time are the same.
+            bldr.p( span_element,
+              "xml:id" => "T#{start_secs.to_s}",
+              "begin" => "#{start_ms.to_s}ms",
+              "dur" => "#{(end_ms - start_ms).to_s}ms",
+              "end" => "#{end_ms.to_s}ms" ) unless start_secs != prev_start_secs
           }
         }
-      }
+      }
       # p buf
       File.open(dest_file, "w") do |f|
         f.puts buf
@@ -104,27 +112,66 @@ module SmTranscript
       end
     end
-    # Times are expressed in milliseconds, far more granularity than is
-    # useful for most user-facing apps, especially since the player reports
+    # The JSON format is defined at http://url/of/document.  It is the format
+    # of the static timed-text document that is passed to the player.˙
+    def write_json(dest_file)
+    end  # write_json()
+    # Store transcript in a Sqlite database (though the essence of this
+    # method should work for all relational dbs). Unlike some of the other
+    # write_xxx() methods, this one requires a @metadata array.
+    # param db_id - for SQLite, this is a filename.
+    # video_id - is a unique identifier for the video
+    def write_sqlite(db_id)
+      db_id = "sm-transcript"
+      db = SQLite3::Database.open(db_id + '.sqlite3')
+      fields = XPath.match(doc.root, inner_node_name + '[1]/*').map{|node| node.name}
+      field_def = fields.map {|x| "%s TEXT" % x}.join(', ')
+    end  # write_sqlite()
+    private
+    # Times are expressed in milliseconds, far more granularity than is
+    # useful for most user-facing apps, especially since the player reports
     # elapsed time only ten times a second.
-    # By reducing the time by orders of magnitude provides these benefits:
+    # By reducing the time by orders of magnitude provides these benefits:
     # 1) Multiple words fall within a <span> element.
     # 2) Better mapping between start times and player time tracking
     def words_to_phrase(start_time)
       start_time.to_i/1000
     end  # words_to_phrase
-    def get_time_expression(milliseconds)
-      milliseconds
-    end
-    # There are some word combinations that occur with such regularity that
+    # def get_time_expression(milliseconds)
+    #   milliseconds
+    # end
+    # There are some word combinations that occur with such regularity that
     # they call out to be fixed.  For example, "m I t" is unambiguously MIT.
-    # These edits can only be done when the phrase has been assembled.
+    # These edits can only be done when the phrase has been assembled since
+    # each letter is treated as an indiviual word.
     def cleanup_phrase(phrase)
-      phrase
+      phrase.gsub(/m I t/, 'MIT')
+      phrase.gsub(/o e I t/, 'OEIT')
+    end
+    # remove HTML tags from text.  requires classes from ActionPack
+    def strip_tags(html)
+      return html if html.empty? || !html.include?('<')
+      output = ""
+      tokenizer = HTML::Tokenizer.new(html)
+      while token = tokenizer.next
+        node = HTML::Node.parse(nil, 0, 0, token, false)
+        output += token unless (node.kind_of? HTML::Tag) or (token =~ /^<!/)
+      end
+      return output
     end
   end  # class
 end