sm-transcript 0.0.4 → 0.0.6
Sign up to get free protection for your applications and to get access to all the features.
- data/README.txt +138 -118
- data/Rakefile +21 -10
- data/bin/sm-transcript +0 -0
- data/lib/sm_transcript/metadata.rb +25 -0
- data/lib/sm_transcript/options.rb +9 -3
- data/lib/sm_transcript/runner.rb +6 -0
- data/lib/sm_transcript/seg_reader.rb +1 -1
- data/lib/sm_transcript/transcript.rb +86 -39
- data/lib/sm_transcript/ttml_reader.rb +116 -0
- data/lib/sm_transcript/word.rb +6 -4
- data/lib/sm_transcript/wrd_reader.rb +5 -4
- data/test/results/18.03-2004-L01.align2.wrd +6441 -0
- data/test/results/8.01-1999-L01.wrd +5182 -0
- data/test/results/801-1stLecture.ttml.xml +757 -0
- data/test/results/801-lect01-4730.xml +757 -0
- data/test/results/801-lect02-4731.xml +886 -0
- data/test/results/801-lect03-4732.xml +818 -0
- data/test/results/801-lect04-4733.xml +831 -0
- data/test/results/801-lect05-4734.xml +879 -0
- data/test/results/801-lect06-4735.xml +822 -0
- data/test/results/801-lect07-4736.xml +893 -0
- data/test/results/801-lect08-4737.xml +809 -0
- data/test/results/801-lect09-4738.xml +807 -0
- data/test/results/Audio-Open-The_New_Deal_for_Education.xml +4301 -0
- data/test/test_metadatareader.rb +8 -3
- data/test/test_options.rb +8 -1
- data/test/test_runner.rb +34 -1
- data/test/test_transcript.rb +109 -12
- data/test/test_ttmlreader.rb +104 -0
- data/test/test_wrdreader.rb +24 -9
- metadata +47 -148
- data/lib/sm_transcript/optparseExample.rb +0 -113
- data/lib/sm_transcript/process_csv_files_to_html.rb +0 -58
- data/lib/sm_transcript/process_seg_files.rb +0 -21
- data/lib/sm_transcript/process_seg_files_to_csv.rb +0 -24
- data/lib/sm_transcript/process_seg_files_to_html.rb +0 -31
- data/lib/sm_transcript/require_relative.rb +0 -14
- data/test/transcripts/GardnerRileyInterview.t1.html +0 -247
- data/test/transcripts/IIHS_Diane_Davis_Nov2009-t1.html +0 -148
- data/test/transcripts/NERCOMP-SpokenMedia4.t1.html +0 -2178
- data/test/transcripts/data.js +0 -24
- data/test/transcripts/vijay_kumar-1.-t1.html +0 -557
- data/test/transcripts/vijay_kumar-1.t1.html +0 -558
- data/test/transcripts/vijay_kumar-t1.html +0 -558
- data/test/transcripts/vijay_kumar-t1.ttml +0 -570
- data/test/transcripts/vijay_kumar.data.js +0 -2
- data/test/transcripts/vijay_kumar.t1.html +0 -557
- data/test/transcripts/wirehair-beetle.data.js +0 -24
data/README.txt
CHANGED
@@ -1,140 +1,160 @@
|
|
1
|
-
$Id: README.txt
|
1
|
+
$Id: README.txt 196 2010-06-11 18:51:18Z pwilkins $
|
2
2
|
|
3
3
|
sm-transcript reads results of SLS processing and produces transcripts for
|
4
4
|
the SpokenMedia browser. For each file in the source folder whose extension
|
5
5
|
matches the source type, a file of destination type is created in the
|
6
|
-
destination folder. All of these parameters have default values.
|
6
|
+
destination folder. All of these parameters have default values.
|
7
|
+
|
8
|
+
Note: Examples of the commands you enter in the terminal are for *nix. The
|
9
|
+
command prompt in the examples is:
|
10
|
+
|
11
|
+
felix$ <command line>
|
12
|
+
|
13
|
+
If you are a Windows user, make the usual adjustments.
|
7
14
|
|
8
15
|
Requirements:
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
+
sm-transcript is written in Ruby and packaged as a RubyGem. Since Ruby is
|
17
|
+
not a compiled language, you will need to have Ruby installed on your
|
18
|
+
machine to run sm-transcript. You can determine if Ruby is installed by
|
19
|
+
typing "ruby -v" at a terminal prompt. It should return the version of
|
20
|
+
Ruby that is installed. If Ruby is not installed on your machine, contact
|
21
|
+
me (or your local Ruby wizard) for assistance.
|
22
|
+
|
16
23
|
Installation:
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
24
|
+
You can get sm-transcript as either a RubyGem or as source from svn.
|
25
|
+
|
26
|
+
The preferred way to install this package is as a Rubygem. You can
|
27
|
+
download and install the gem with this command:
|
28
|
+
|
29
|
+
felix$ sudo gem install [--verbose] sm-transcript
|
30
|
+
|
31
|
+
This command downloads the most recent version of the gem from rubygems.org
|
32
|
+
and makes it active. Previous versions of the gem remain installed, but
|
33
|
+
are deactivated.
|
34
|
+
|
35
|
+
You must use "sudo" to properly install the gem. If you execute "gem
|
36
|
+
install" (omitting the "sudo") the gem is installed in your home gem
|
37
|
+
repository and it isn't in your path without additional configuration.
|
38
|
+
|
39
|
+
Note: You need sudo privileges to run the command as written. If you
|
40
|
+
can't sudo, then you can install it locally and will need some additional
|
41
|
+
configuration. Contact me (or your local Ruby wizard) for assistance.
|
42
|
+
|
43
|
+
The executable is now in your path.
|
44
|
+
|
45
|
+
You can cleanly uninstall the gem with this command:
|
46
|
+
|
47
|
+
felix$ sudo gem uninstall sm-transcript
|
48
|
+
|
49
|
+
If you have access to our svn repository, you are welcome to check out the
|
50
|
+
code. Be warned that the trunk tip is not necessarily stable. It changes
|
51
|
+
frequently as enhancements (and bug fixes) are added. (note that the
|
52
|
+
'smb_transcript' in the command line below is not a typo.)
|
53
|
+
|
54
|
+
svn co svn+ssh://svn.mit.edu/oeit-tsa/SMB/smb_transcript/trunk sm_transcript
|
55
|
+
|
56
|
+
build the gem by running this command from the directory you installed the
|
57
|
+
source. This is what it looks like on my machine:
|
58
|
+
|
59
|
+
felix$ rake gem
|
60
|
+
|
61
|
+
The gem will be built and put in ./pkg You can now use the gem
|
62
|
+
installation instructions above.
|
63
|
+
|
57
64
|
|
58
65
|
Using the App:
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
66
|
+
Run with no command line parameters, the app reads *.wrd files out of
|
67
|
+
./results and writes *t1.html files to ./transcripts. These directories
|
68
|
+
are relative to where sm_transcript is called.
|
69
|
+
|
70
|
+
Note: destination files are overwritten without a warning prompt. If you
|
71
|
+
want to preserve an existing output file, rename it before running the app
|
72
|
+
again.
|
73
|
+
|
74
|
+
For example, run the app by navigating to the bin folder and enter
|
75
|
+
|
76
|
+
projects/sm_transcript/bin felix$ sm_transcript
|
77
|
+
|
78
|
+
This command run from this folder will read *.wrd files from bin/results
|
79
|
+
and write *-t1.html to bin/transcripts.
|
80
|
+
|
81
|
+
Usage: sm_transcript [options]
|
82
|
+
--srcdir PATH Read files from this folder (Default: ./results)
|
83
|
+
--destdir PATH Write files to this folder (Default: ./transcripts)
|
84
|
+
--srctype wrd | seg | txt | ttml Kind of file to process (Default: wrd)
|
85
|
+
--desttype html | ttml | datajs | json Kind of file to output (Default: html)
|
86
|
+
-h, --help Show this message
|
80
87
|
|
81
88
|
|
82
89
|
Troubleshooting:
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
90
|
+
sm-transcript requires additional gems to operate. The RubyGem
|
91
|
+
installation should install dependencies automatically, but when it
|
92
|
+
doesn't, you get an error that includes
|
93
|
+
|
94
|
+
... no such file to load -- builder (LoadError)
|
95
|
+
|
96
|
+
in the first few lines when you run sm-transcript, the problem is a
|
97
|
+
missing dependent gem. (the error above indicates that the Builder
|
98
|
+
gem is missing.) Try installing the missing gem. For the error above,
|
99
|
+
the command looks like this on my computer:
|
100
|
+
|
101
|
+
felix$ sudo gem install builder
|
102
|
+
|
103
|
+
See "Required Gems" below for more information.
|
104
|
+
|
105
|
+
|
106
|
+
A warning message such as:
|
107
|
+
|
108
|
+
"WARNING: Nokogiri was built against LibXML version 2.7.6,
|
109
|
+
but has dynamically loaded 2.7.7""
|
110
|
+
|
111
|
+
may be safely ignored.
|
112
|
+
|
113
|
+
|
99
114
|
Upgrading:
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
115
|
+
You can easily upgrade by simply executing the same command you used to
|
116
|
+
install the gem. Running install again will add the newer version and make
|
117
|
+
it active. By default the most recent version is used, but older versions
|
118
|
+
are still available, simply inactive.
|
119
|
+
|
120
|
+
If are using svn, you should already know what to do.
|
121
|
+
|
122
|
+
|
108
123
|
Required Gems:
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
124
|
+
builder - create structured data, such as XML
|
125
|
+
extensions - added for the 'require_relative' command. (To get this
|
126
|
+
command in Ruby 1.8 you need to install this gem, for Ruby 1.9
|
127
|
+
the command is already part of the core.)
|
128
|
+
htmlentities - html parsing
|
129
|
+
json - create JSON structured data
|
130
|
+
optparse - option parsing of command line
|
131
|
+
ostruct - open data structures
|
132
|
+
ppcommand - pp is a pretty printer. It is used only for debugging
|
133
|
+
rake - make for Ruby
|
134
|
+
rubygems - support for gems (shouldn't be needed for Ruby 1.9)
|
135
|
+
shoulda - enhancement for Test::Unit
|
136
|
+
|
137
|
+
This command installs gems on OSX and Linux:
|
138
|
+
felix$ sudo gem install <gem name>
|
139
|
+
|
125
140
|
Unit Tests:
|
126
|
-
|
127
|
-
|
141
|
+
You may run all unit tests by navigating to the test folder and running
|
142
|
+
rake with no parameters (the default rake task runs all tests). On my
|
143
|
+
computer, it looks like this:
|
128
144
|
|
129
|
-
|
145
|
+
projects/sm_transcript/test felix$ rake
|
130
146
|
|
131
147
|
|
132
148
|
Release Notes:
|
133
|
-
|
149
|
+
Initial Version - runs under Ruby 1.8.x.
|
150
|
+
version 0.0.4 - fixes bug when processing .WRD files with CRLF line
|
151
|
+
endings.
|
152
|
+
version 0.0.5 - added srctype of ttml and desttype of json, fixed bug
|
153
|
+
where beginning time of word was actually for previous word.
|
134
154
|
|
135
155
|
To Do:
|
136
|
-
|
156
|
+
specify individual files for processing rather than folders
|
157
|
+
update code to run under Ruby 1.9
|
158
|
+
|
137
159
|
|
138
|
-
|
139
|
-
from a public gem repository like RubyForge.
|
140
|
-
|
160
|
+
|
data/Rakefile
CHANGED
@@ -1,31 +1,42 @@
|
|
1
|
-
# $Id: Rakefile
|
1
|
+
# $Id: Rakefile 196 2010-06-11 18:51:18Z pwilkins $
|
2
2
|
|
3
3
|
require 'rake/gempackagetask'
|
4
4
|
require 'rake'
|
5
5
|
|
6
|
-
spec = Gem::Specification.new do |s|
|
6
|
+
spec = Gem::Specification.new do |s|
|
7
7
|
s.name = "sm-transcript"
|
8
8
|
s.summary = "Convert word lists to transcripts"
|
9
9
|
s.description= File.read(File.join(File.dirname(__FILE__), 'README.txt'))
|
10
10
|
s.requirements = [ 'TBD' ]
|
11
|
-
s.version = "0.0.
|
11
|
+
s.version = "0.0.6"
|
12
12
|
s.author = "Peter Wilkins"
|
13
13
|
s.email = "pwilkins@mit.edu"
|
14
14
|
s.homepage = "http://spokenmedia.mit.edu"
|
15
15
|
s.platform = Gem::Platform::RUBY
|
16
16
|
s.required_ruby_version = '>=1.8'
|
17
17
|
s.files = Dir['lib/**/**'] +
|
18
|
-
Dir['bin/sm-transcript'] +
|
19
|
-
Dir['bin/results/PLACEHOLDER.txt'] +
|
20
|
-
Dir['bin/transcripts/PLACEHOLDER.txt'] +
|
21
|
-
Dir['test
|
18
|
+
Dir['bin/sm-transcript'] +
|
19
|
+
Dir['bin/results/PLACEHOLDER.txt'] +
|
20
|
+
Dir['bin/transcripts/PLACEHOLDER.txt'] +
|
21
|
+
Dir['test/*'] +
|
22
|
+
Dir['test/results/*'] +
|
23
|
+
Dir['test/transcripts/PLACEHOLDER.txt'] +
|
22
24
|
Dir['README.txt'] +
|
23
25
|
Dir['LICENSE.txt'] +
|
24
|
-
Dir['Rakefile']
|
25
|
-
s.files.reject! { |fn| fn.include? "process_" }
|
26
|
+
Dir['Rakefile']
|
27
|
+
s.files.reject! { |fn| fn.include? "process_" }
|
28
|
+
s.files.reject! { |fn| fn.include? 'lect1' }
|
29
|
+
s.files.reject! { |fn| fn.include? 'lect2' }
|
30
|
+
s.files.reject! { |fn| fn.include? 'lect3' }
|
31
|
+
s.files.reject! { |fn| fn.include? 'file-chksum.rb' }
|
32
|
+
s.files.reject! { |fn| fn.include? 'html_tokenizer-example.rb' }
|
33
|
+
s.files.reject! { |fn| fn.include? 'optparseExample.rb' }
|
34
|
+
s.files.reject! { |fn| fn.include? 'xml_to_sqlite.rb' }
|
35
|
+
s.files.reject! { |fn| fn.include? 'require_relative.rb' }
|
36
|
+
s.files.reject! { |fn| fn.include? '801-lect1.*' }
|
26
37
|
s.executables = [ 'sm-transcript' ]
|
27
38
|
s.test_files = Dir["test/test*.rb"]
|
28
39
|
s.has_rdoc = false
|
29
40
|
end
|
30
|
-
|
41
|
+
|
31
42
|
Rake::GemPackageTask.new(spec).define
|
data/bin/sm-transcript
CHANGED
File without changes
|
@@ -9,6 +9,31 @@ require_relative 'word'
|
|
9
9
|
|
10
10
|
module SmTranscript
|
11
11
|
class Metadata
|
12
|
+
|
13
|
+
# "dc-abstract"
|
14
|
+
# "dc-contributor"
|
15
|
+
# "dc-creator"
|
16
|
+
# "dc-description"
|
17
|
+
# "dc-isPartOf"
|
18
|
+
# "dc-language"
|
19
|
+
# "dc-license"
|
20
|
+
# "dc-subject"
|
21
|
+
# "dc-title"
|
22
|
+
# "dc-audience"
|
23
|
+
# "dc-available"
|
24
|
+
# "dc-created"
|
25
|
+
# "dc-extent"
|
26
|
+
# "dc-identifier"
|
27
|
+
# "dc-isReplacedBy"
|
28
|
+
# "dc-issued"
|
29
|
+
# "dc-modified"
|
30
|
+
# "dc-publisher"
|
31
|
+
# "dc-replaces"
|
32
|
+
# "dc-rightsHolder"
|
33
|
+
# "dc-spatial"
|
34
|
+
# "dc-temporal"
|
35
|
+
# "dc-type"
|
36
|
+
# "dc-valid"
|
12
37
|
|
13
38
|
def initialize(metadata)
|
14
39
|
@metadata = metadata
|
@@ -11,6 +11,7 @@ module SmTranscript
|
|
11
11
|
SEG_SRC_TYPE = 'seg'
|
12
12
|
WRD_SRC_TYPE = 'wrd'
|
13
13
|
TXT_SRC_TYPE = 'txt'
|
14
|
+
TTML_SRC_TYPE = 'xml'
|
14
15
|
TTML_DEST_TYPE = 'ttml'
|
15
16
|
HTML_DEST_TYPE = 'html'
|
16
17
|
DATAJS_DEST_TYPE = 'datajs'
|
@@ -58,12 +59,12 @@ module SmTranscript
|
|
58
59
|
@options.destdir = @destdir = ddir
|
59
60
|
end
|
60
61
|
|
61
|
-
opts.on("--srctype seg | wrd | txt",
|
62
|
-
"Kind of file to process (Default:
|
62
|
+
opts.on("--srctype seg | wrd | txt | xml",
|
63
|
+
"Kind of file to process (Default: wrd)") do |stype|
|
63
64
|
@options.srctype = @srctype = stype
|
64
65
|
end
|
65
66
|
|
66
|
-
opts.on("--desttype html | ttml | datajs",
|
67
|
+
opts.on("--desttype html | ttml | datajs | json",
|
67
68
|
"Kind of format to output (Default: html)") do |dtype|
|
68
69
|
@options.desttype = @desttype = dtype
|
69
70
|
end
|
@@ -73,6 +74,11 @@ module SmTranscript
|
|
73
74
|
return
|
74
75
|
end
|
75
76
|
|
77
|
+
opts.on("-v", "--version", "Show version") do
|
78
|
+
puts "\nsm-transcript gem version: 0.0.5rc"
|
79
|
+
return
|
80
|
+
end
|
81
|
+
|
76
82
|
begin
|
77
83
|
argv = ["-h"] if argv.empty?
|
78
84
|
opts.parse!(argv)
|
data/lib/sm_transcript/runner.rb
CHANGED
@@ -7,6 +7,7 @@ require 'extensions/kernel'
|
|
7
7
|
require_relative 'options'
|
8
8
|
require_relative 'seg_reader'
|
9
9
|
require_relative 'wrd_reader'
|
10
|
+
require_relative 'ttml_reader'
|
10
11
|
require_relative 'transcript'
|
11
12
|
require_relative 'metadata'
|
12
13
|
require_relative 'metadata_reader'
|
@@ -23,6 +24,9 @@ module SmTranscript
|
|
23
24
|
def run
|
24
25
|
# collect files to process
|
25
26
|
begin
|
27
|
+
# p "working directory is #{File.new(__FILE__).path}"
|
28
|
+
# p "reading from #{@options.srcdir}"
|
29
|
+
# p "writing to #{@options.destdir}"
|
26
30
|
raise "source directory doesn't exist" unless FileTest.exists?(@options.srcdir)
|
27
31
|
raise "destination directory doesn't exist" unless FileTest.exists?(@options.destdir)
|
28
32
|
|
@@ -32,6 +36,8 @@ module SmTranscript
|
|
32
36
|
case @options.srctype
|
33
37
|
when SmTranscript::Options::SEG_SRC_TYPE
|
34
38
|
words = SmTranscript::SegReader.from_file(x).words
|
39
|
+
when SmTranscript::Options::TTML_SRC_TYPE
|
40
|
+
words = SmTranscript::TtmlReader.from_file(x).words
|
35
41
|
when SmTranscript::Options::TXT_SRC_TYPE
|
36
42
|
md = SmTranscript::MetadataReader.from_file(x).metadata
|
37
43
|
else SmTranscript::Options::WRD_SRC_TYPE
|
@@ -34,7 +34,7 @@ module SmTranscript
|
|
34
34
|
@root.elements.each("/document/lecture/segment") do |s|
|
35
35
|
s.text.scan(/^\d* \d* [\w']*$/) do |t|
|
36
36
|
arr = t.split
|
37
|
-
@words << SmTranscript::Word.new(arr[0], arr[1], arr[2])
|
37
|
+
@words << SmTranscript::Word.new(arr[0], arr[1], arr[1].to_i - arr[0].to_i, arr[2])
|
38
38
|
end
|
39
39
|
end
|
40
40
|
end
|
@@ -5,12 +5,14 @@
|
|
5
5
|
require "rexml/document"
|
6
6
|
require 'extensions/kernel'
|
7
7
|
require 'builder'
|
8
|
+
require 'sqlite3'
|
8
9
|
require_relative 'word'
|
9
10
|
|
10
11
|
module SmTranscript
|
11
12
|
class Transcript
|
12
13
|
|
13
14
|
@words = Array.new()
|
15
|
+
attr_reader :words
|
14
16
|
|
15
17
|
def initialize(word_arr)
|
16
18
|
@metadata = {}
|
@@ -27,7 +29,7 @@ module SmTranscript
|
|
27
29
|
prev_start_time = 0
|
28
30
|
start_time = 0
|
29
31
|
@words.each do |w|
|
30
|
-
# get the start time and reduce its granularity so that multiple
|
32
|
+
# get the start time and reduce its granularity so that multiple
|
31
33
|
# words fall within a <span> element.
|
32
34
|
start_time = w.start_time.to_i/1000
|
33
35
|
if start_time.to_i == prev_start_time.to_i # append word
|
@@ -35,16 +37,16 @@ module SmTranscript
|
|
35
37
|
else # create a new span_element
|
36
38
|
# since prev_start_time is zero on first line, this avoids
|
37
39
|
# writing a closing </span> with no opening <span>
|
40
|
+
span_element = cleanup_phrase(span_element)
|
38
41
|
f.puts span_element << "</span> " unless prev_start_time == 0
|
39
|
-
|
40
|
-
|
41
|
-
prev_start_time = start_time
|
42
|
+
span_element = "<span id='T#{start_time}'>#{w.word}"
|
43
|
+
prev_start_time = start_time
|
42
44
|
end
|
43
45
|
end
|
44
|
-
# In the block above, the last word isn't written if
|
45
|
-
# the start_time and prev_start_time are the same.
|
46
|
-
f.puts span_element << "</span> " unless start_time != prev_start_time
|
47
|
-
|
46
|
+
# In the block above, the last word isn't written if
|
47
|
+
# the start_time and prev_start_time are the same.
|
48
|
+
f.puts span_element << "</span> " unless start_time != prev_start_time
|
49
|
+
f.close
|
48
50
|
end
|
49
51
|
end # write_html()
|
50
52
|
|
@@ -57,13 +59,13 @@ module SmTranscript
|
|
57
59
|
buf = ""
|
58
60
|
bldr = Builder::XmlMarkup.new( :target => buf, :indent => 2 )
|
59
61
|
bldr.instruct!
|
60
|
-
bldr.tt("xmlns" => "http://www.w3.org/2006/04/ttaf1",
|
62
|
+
bldr.tt("xmlns" => "http://www.w3.org/2006/04/ttaf1",
|
61
63
|
"xmlns:tts" => "http://www.w3.org/ns/ttml#styling",
|
62
64
|
"xmlns:ttm" => "http://www.w3.org/ns/ttml#metadata",
|
63
|
-
"xml:lang" => "en" ) {
|
65
|
+
"xml:lang" => "en" ) {
|
64
66
|
bldr.head { |b|
|
65
|
-
b.ttm :title, '
|
66
|
-
b.ttm :desc, '
|
67
|
+
b.ttm :title, 'The title of this transcript'
|
68
|
+
b.ttm :desc, 'The description of this transcript'
|
67
69
|
}
|
68
70
|
bldr.body {
|
69
71
|
bldr.div {
|
@@ -72,31 +74,37 @@ module SmTranscript
|
|
72
74
|
start_ms = end_ms = 0
|
73
75
|
start_secs = 0
|
74
76
|
@words.each do |w|
|
75
|
-
# get the start time and reduce its granularity so that
|
76
|
-
# words
|
77
|
+
# get the start time and reduce its granularity so that
|
78
|
+
# multiple words form a phrase.
|
77
79
|
start_secs = w.start_time.to_i/1000
|
78
80
|
if start_secs == prev_start_secs # append word
|
79
|
-
end_ms
|
81
|
+
end_ms = w.end_time.to_i
|
80
82
|
span_element << " #{w.word}"
|
81
83
|
else # create a new span_element
|
82
|
-
|
83
|
-
|
84
|
+
start_secs = w.start_time.to_i/1000
|
85
|
+
bldr.p( span_element,
|
86
|
+
"xml:id" => "T#{start_secs.to_s}",
|
87
|
+
"begin" => "#{start_ms.to_s}ms",
|
88
|
+
"dur" => "#{(end_ms - start_ms).to_s}ms",
|
89
|
+
"end" => "#{end_ms.to_s}ms" )
|
84
90
|
|
85
91
|
start_ms = w.start_time.to_i
|
86
92
|
end_ms = w.end_time.to_i
|
87
|
-
span_element = " #{w.word}"
|
88
|
-
prev_start_secs = start_secs
|
93
|
+
span_element = " #{w.word}"
|
94
|
+
prev_start_secs = start_secs
|
89
95
|
end
|
90
|
-
end
|
91
|
-
|
92
|
-
# the
|
93
|
-
|
94
|
-
|
95
|
-
"
|
96
|
-
"
|
96
|
+
end # @words.each
|
97
|
+
|
98
|
+
# In the block above, the last word isn't written if
|
99
|
+
# the start_time and prev_start_time are the same.
|
100
|
+
bldr.p( span_element,
|
101
|
+
"xml:id" => "T#{start_secs.to_s}",
|
102
|
+
"begin" => "#{start_ms.to_s}ms",
|
103
|
+
"dur" => "#{(end_ms - start_ms).to_s}ms",
|
104
|
+
"end" => "#{end_ms.to_s}ms" ) unless start_secs != prev_start_secs
|
97
105
|
}
|
98
106
|
}
|
99
|
-
}
|
107
|
+
}
|
100
108
|
# p buf
|
101
109
|
File.open(dest_file, "w") do |f|
|
102
110
|
f.puts buf
|
@@ -104,27 +112,66 @@ module SmTranscript
|
|
104
112
|
end
|
105
113
|
end
|
106
114
|
|
107
|
-
|
108
|
-
#
|
115
|
+
|
116
|
+
# The JSON format is defined at http://url/of/document. It is the format
|
117
|
+
# of the static timed-text document that is passed to the player.˙
|
118
|
+
def write_json(dest_file)
|
119
|
+
|
120
|
+
end # write_json()
|
121
|
+
|
122
|
+
|
123
|
+
# Store transcript in a Sqlite database (though the essence of this
|
124
|
+
# method should work for all relational dbs). Unlike some of the other
|
125
|
+
# write_xxx() methods, this one requires a @metadata array.
|
126
|
+
# param db_id - for SQLite, this is a filename.
|
127
|
+
# video_id - is a unique identifier for the video
|
128
|
+
|
129
|
+
def write_sqlite(db_id)
|
130
|
+
db_id = "sm-transcript"
|
131
|
+
db = SQLite3::Database.open(db_id + '.sqlite3')
|
132
|
+
|
133
|
+
fields = XPath.match(doc.root, inner_node_name + '[1]/*').map{|node| node.name}
|
134
|
+
field_def = fields.map {|x| "%s TEXT" % x}.join(', ')
|
135
|
+
|
136
|
+
end # write_sqlite()
|
137
|
+
|
138
|
+
|
139
|
+
private
|
140
|
+
|
141
|
+
# Times are expressed in milliseconds, far more granularity than is
|
142
|
+
# useful for most user-facing apps, especially since the player reports
|
109
143
|
# elapsed time only ten times a second.
|
110
|
-
# By reducing the time by orders of magnitude provides these benefits:
|
144
|
+
# By reducing the time by orders of magnitude provides these benefits:
|
111
145
|
# 1) Multiple words fall within a <span> element.
|
112
146
|
# 2) Better mapping between start times and player time tracking
|
113
147
|
def words_to_phrase(start_time)
|
114
148
|
start_time.to_i/1000
|
115
149
|
end # words_to_phrase
|
116
|
-
|
117
|
-
def get_time_expression(milliseconds)
|
118
|
-
|
119
|
-
end
|
120
|
-
|
121
|
-
# There are some word combinations that occur with such regularity that
|
150
|
+
|
151
|
+
# def get_time_expression(milliseconds)
|
152
|
+
# milliseconds
|
153
|
+
# end
|
154
|
+
|
155
|
+
# There are some word combinations that occur with such regularity that
|
122
156
|
# they call out to be fixed. For example, "m I t" is unambiguously MIT.
|
123
|
-
# These edits can only be done when the phrase has been assembled
|
157
|
+
# These edits can only be done when the phrase has been assembled since
|
158
|
+
# each letter is treated as an indiviual word.
|
124
159
|
def cleanup_phrase(phrase)
|
125
|
-
phrase
|
160
|
+
phrase.gsub(/m I t/, 'MIT')
|
161
|
+
phrase.gsub(/o e I t/, 'OEIT')
|
162
|
+
end
|
163
|
+
|
164
|
+
# remove HTML tags from text. requires classes from ActionPack
|
165
|
+
def strip_tags(html)
|
166
|
+
return html if html.empty? || !html.include?('<')
|
167
|
+
output = ""
|
168
|
+
tokenizer = HTML::Tokenizer.new(html)
|
169
|
+
while token = tokenizer.next
|
170
|
+
node = HTML::Node.parse(nil, 0, 0, token, false)
|
171
|
+
output += token unless (node.kind_of? HTML::Tag) or (token =~ /^<!/)
|
172
|
+
end
|
173
|
+
return output
|
126
174
|
end
|
127
|
-
|
128
175
|
|
129
176
|
end # class
|
130
177
|
end
|