sm-transcript 0.0.4 → 0.0.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.txt +138 -118
- data/Rakefile +21 -10
- data/bin/sm-transcript +0 -0
- data/lib/sm_transcript/metadata.rb +25 -0
- data/lib/sm_transcript/options.rb +9 -3
- data/lib/sm_transcript/runner.rb +6 -0
- data/lib/sm_transcript/seg_reader.rb +1 -1
- data/lib/sm_transcript/transcript.rb +86 -39
- data/lib/sm_transcript/ttml_reader.rb +116 -0
- data/lib/sm_transcript/word.rb +6 -4
- data/lib/sm_transcript/wrd_reader.rb +5 -4
- data/test/results/18.03-2004-L01.align2.wrd +6441 -0
- data/test/results/8.01-1999-L01.wrd +5182 -0
- data/test/results/801-1stLecture.ttml.xml +757 -0
- data/test/results/801-lect01-4730.xml +757 -0
- data/test/results/801-lect02-4731.xml +886 -0
- data/test/results/801-lect03-4732.xml +818 -0
- data/test/results/801-lect04-4733.xml +831 -0
- data/test/results/801-lect05-4734.xml +879 -0
- data/test/results/801-lect06-4735.xml +822 -0
- data/test/results/801-lect07-4736.xml +893 -0
- data/test/results/801-lect08-4737.xml +809 -0
- data/test/results/801-lect09-4738.xml +807 -0
- data/test/results/Audio-Open-The_New_Deal_for_Education.xml +4301 -0
- data/test/test_metadatareader.rb +8 -3
- data/test/test_options.rb +8 -1
- data/test/test_runner.rb +34 -1
- data/test/test_transcript.rb +109 -12
- data/test/test_ttmlreader.rb +104 -0
- data/test/test_wrdreader.rb +24 -9
- metadata +47 -148
- data/lib/sm_transcript/optparseExample.rb +0 -113
- data/lib/sm_transcript/process_csv_files_to_html.rb +0 -58
- data/lib/sm_transcript/process_seg_files.rb +0 -21
- data/lib/sm_transcript/process_seg_files_to_csv.rb +0 -24
- data/lib/sm_transcript/process_seg_files_to_html.rb +0 -31
- data/lib/sm_transcript/require_relative.rb +0 -14
- data/test/transcripts/GardnerRileyInterview.t1.html +0 -247
- data/test/transcripts/IIHS_Diane_Davis_Nov2009-t1.html +0 -148
- data/test/transcripts/NERCOMP-SpokenMedia4.t1.html +0 -2178
- data/test/transcripts/data.js +0 -24
- data/test/transcripts/vijay_kumar-1.-t1.html +0 -557
- data/test/transcripts/vijay_kumar-1.t1.html +0 -558
- data/test/transcripts/vijay_kumar-t1.html +0 -558
- data/test/transcripts/vijay_kumar-t1.ttml +0 -570
- data/test/transcripts/vijay_kumar.data.js +0 -2
- data/test/transcripts/vijay_kumar.t1.html +0 -557
- data/test/transcripts/wirehair-beetle.data.js +0 -24
data/README.txt
CHANGED
@@ -1,140 +1,160 @@
|
|
1
|
-
$Id: README.txt
|
1
|
+
$Id: README.txt 196 2010-06-11 18:51:18Z pwilkins $
|
2
2
|
|
3
3
|
sm-transcript reads results of SLS processing and produces transcripts for
|
4
4
|
the SpokenMedia browser. For each file in the source folder whose extension
|
5
5
|
matches the source type, a file of destination type is created in the
|
6
|
-
destination folder. All of these parameters have default values.
|
6
|
+
destination folder. All of these parameters have default values.
|
7
|
+
|
8
|
+
Note: Examples of the commands you enter in the terminal are for *nix. The
|
9
|
+
command prompt in the examples is:
|
10
|
+
|
11
|
+
felix$ <command line>
|
12
|
+
|
13
|
+
If you are a Windows user, make the usual adjustments.
|
7
14
|
|
8
15
|
Requirements:
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
+
sm-transcript is written in Ruby and packaged as a RubyGem. Since Ruby is
|
17
|
+
not a compiled language, you will need to have Ruby installed on your
|
18
|
+
machine to run sm-transcript. You can determine if Ruby is installed by
|
19
|
+
typing "ruby -v" at a terminal prompt. It should return the version of
|
20
|
+
Ruby that is installed. If Ruby is not installed on your machine, contact
|
21
|
+
me (or your local Ruby wizard) for assistance.
|
22
|
+
|
16
23
|
Installation:
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
24
|
+
You can get sm-transcript as either a RubyGem or as source from svn.
|
25
|
+
|
26
|
+
The preferred way to install this package is as a Rubygem. You can
|
27
|
+
download and install the gem with this command:
|
28
|
+
|
29
|
+
felix$ sudo gem install [--verbose] sm-transcript
|
30
|
+
|
31
|
+
This command downloads the most recent version of the gem from rubygems.org
|
32
|
+
and makes it active. Previous versions of the gem remain installed, but
|
33
|
+
are deactivated.
|
34
|
+
|
35
|
+
You must use "sudo" to properly install the gem. If you execute "gem
|
36
|
+
install" (omitting the "sudo") the gem is installed in your home gem
|
37
|
+
repository and it isn't in your path without additional configuration.
|
38
|
+
|
39
|
+
Note: You need sudo privileges to run the command as written. If you
|
40
|
+
can't sudo, then you can install it locally and will need some additional
|
41
|
+
configuration. Contact me (or your local Ruby wizard) for assistance.
|
42
|
+
|
43
|
+
The executable is now in your path.
|
44
|
+
|
45
|
+
You can cleanly uninstall the gem with this command:
|
46
|
+
|
47
|
+
felix$ sudo gem uninstall sm-transcript
|
48
|
+
|
49
|
+
If you have access to our svn repository, you are welcome to check out the
|
50
|
+
code. Be warned that the trunk tip is not necessarily stable. It changes
|
51
|
+
frequently as enhancements (and bug fixes) are added. (note that the
|
52
|
+
'smb_transcript' in the command line below is not a typo.)
|
53
|
+
|
54
|
+
svn co svn+ssh://svn.mit.edu/oeit-tsa/SMB/smb_transcript/trunk sm_transcript
|
55
|
+
|
56
|
+
build the gem by running this command from the directory you installed the
|
57
|
+
source. This is what it looks like on my machine:
|
58
|
+
|
59
|
+
felix$ rake gem
|
60
|
+
|
61
|
+
The gem will be built and put in ./pkg You can now use the gem
|
62
|
+
installation instructions above.
|
63
|
+
|
57
64
|
|
58
65
|
Using the App:
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
66
|
+
Run with no command line parameters, the app reads *.wrd files out of
|
67
|
+
./results and writes *t1.html files to ./transcripts. These directories
|
68
|
+
are relative to where sm_transcript is called.
|
69
|
+
|
70
|
+
Note: destination files are overwritten without a warning prompt. If you
|
71
|
+
want to preserve an existing output file, rename it before running the app
|
72
|
+
again.
|
73
|
+
|
74
|
+
For example, run the app by navigating to the bin folder and enter
|
75
|
+
|
76
|
+
projects/sm_transcript/bin felix$ sm_transcript
|
77
|
+
|
78
|
+
This command run from this folder will read *.wrd files from bin/results
|
79
|
+
and write *-t1.html to bin/transcripts.
|
80
|
+
|
81
|
+
Usage: sm_transcript [options]
|
82
|
+
--srcdir PATH Read files from this folder (Default: ./results)
|
83
|
+
--destdir PATH Write files to this folder (Default: ./transcripts)
|
84
|
+
--srctype wrd | seg | txt | ttml Kind of file to process (Default: wrd)
|
85
|
+
--desttype html | ttml | datajs | json Kind of file to output (Default: html)
|
86
|
+
-h, --help Show this message
|
80
87
|
|
81
88
|
|
82
89
|
Troubleshooting:
|
83
|
-
|
84
|
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
90
|
+
sm-transcript requires additional gems to operate. The RubyGem
|
91
|
+
installation should install dependencies automatically, but when it
|
92
|
+
doesn't, you get an error that includes
|
93
|
+
|
94
|
+
... no such file to load -- builder (LoadError)
|
95
|
+
|
96
|
+
in the first few lines when you run sm-transcript, the problem is a
|
97
|
+
missing dependent gem. (the error above indicates that the Builder
|
98
|
+
gem is missing.) Try installing the missing gem. For the error above,
|
99
|
+
the command looks like this on my computer:
|
100
|
+
|
101
|
+
felix$ sudo gem install builder
|
102
|
+
|
103
|
+
See "Required Gems" below for more information.
|
104
|
+
|
105
|
+
|
106
|
+
A warning message such as:
|
107
|
+
|
108
|
+
"WARNING: Nokogiri was built against LibXML version 2.7.6,
|
109
|
+
but has dynamically loaded 2.7.7""
|
110
|
+
|
111
|
+
may be safely ignored.
|
112
|
+
|
113
|
+
|
99
114
|
Upgrading:
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
115
|
+
You can easily upgrade by simply executing the same command you used to
|
116
|
+
install the gem. Running install again will add the newer version and make
|
117
|
+
it active. By default the most recent version is used, but older versions
|
118
|
+
are still available, simply inactive.
|
119
|
+
|
120
|
+
If are using svn, you should already know what to do.
|
121
|
+
|
122
|
+
|
108
123
|
Required Gems:
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
124
|
+
builder - create structured data, such as XML
|
125
|
+
extensions - added for the 'require_relative' command. (To get this
|
126
|
+
command in Ruby 1.8 you need to install this gem, for Ruby 1.9
|
127
|
+
the command is already part of the core.)
|
128
|
+
htmlentities - html parsing
|
129
|
+
json - create JSON structured data
|
130
|
+
optparse - option parsing of command line
|
131
|
+
ostruct - open data structures
|
132
|
+
ppcommand - pp is a pretty printer. It is used only for debugging
|
133
|
+
rake - make for Ruby
|
134
|
+
rubygems - support for gems (shouldn't be needed for Ruby 1.9)
|
135
|
+
shoulda - enhancement for Test::Unit
|
136
|
+
|
137
|
+
This command installs gems on OSX and Linux:
|
138
|
+
felix$ sudo gem install <gem name>
|
139
|
+
|
125
140
|
Unit Tests:
|
126
|
-
|
127
|
-
|
141
|
+
You may run all unit tests by navigating to the test folder and running
|
142
|
+
rake with no parameters (the default rake task runs all tests). On my
|
143
|
+
computer, it looks like this:
|
128
144
|
|
129
|
-
|
145
|
+
projects/sm_transcript/test felix$ rake
|
130
146
|
|
131
147
|
|
132
148
|
Release Notes:
|
133
|
-
|
149
|
+
Initial Version - runs under Ruby 1.8.x.
|
150
|
+
version 0.0.4 - fixes bug when processing .WRD files with CRLF line
|
151
|
+
endings.
|
152
|
+
version 0.0.5 - added srctype of ttml and desttype of json, fixed bug
|
153
|
+
where beginning time of word was actually for previous word.
|
134
154
|
|
135
155
|
To Do:
|
136
|
-
|
156
|
+
specify individual files for processing rather than folders
|
157
|
+
update code to run under Ruby 1.9
|
158
|
+
|
137
159
|
|
138
|
-
|
139
|
-
from a public gem repository like RubyForge.
|
140
|
-
|
160
|
+
|
data/Rakefile
CHANGED
@@ -1,31 +1,42 @@
|
|
1
|
-
# $Id: Rakefile
|
1
|
+
# $Id: Rakefile 196 2010-06-11 18:51:18Z pwilkins $
|
2
2
|
|
3
3
|
require 'rake/gempackagetask'
|
4
4
|
require 'rake'
|
5
5
|
|
6
|
-
spec = Gem::Specification.new do |s|
|
6
|
+
spec = Gem::Specification.new do |s|
|
7
7
|
s.name = "sm-transcript"
|
8
8
|
s.summary = "Convert word lists to transcripts"
|
9
9
|
s.description= File.read(File.join(File.dirname(__FILE__), 'README.txt'))
|
10
10
|
s.requirements = [ 'TBD' ]
|
11
|
-
s.version = "0.0.
|
11
|
+
s.version = "0.0.6"
|
12
12
|
s.author = "Peter Wilkins"
|
13
13
|
s.email = "pwilkins@mit.edu"
|
14
14
|
s.homepage = "http://spokenmedia.mit.edu"
|
15
15
|
s.platform = Gem::Platform::RUBY
|
16
16
|
s.required_ruby_version = '>=1.8'
|
17
17
|
s.files = Dir['lib/**/**'] +
|
18
|
-
Dir['bin/sm-transcript'] +
|
19
|
-
Dir['bin/results/PLACEHOLDER.txt'] +
|
20
|
-
Dir['bin/transcripts/PLACEHOLDER.txt'] +
|
21
|
-
Dir['test
|
18
|
+
Dir['bin/sm-transcript'] +
|
19
|
+
Dir['bin/results/PLACEHOLDER.txt'] +
|
20
|
+
Dir['bin/transcripts/PLACEHOLDER.txt'] +
|
21
|
+
Dir['test/*'] +
|
22
|
+
Dir['test/results/*'] +
|
23
|
+
Dir['test/transcripts/PLACEHOLDER.txt'] +
|
22
24
|
Dir['README.txt'] +
|
23
25
|
Dir['LICENSE.txt'] +
|
24
|
-
Dir['Rakefile']
|
25
|
-
s.files.reject! { |fn| fn.include? "process_" }
|
26
|
+
Dir['Rakefile']
|
27
|
+
s.files.reject! { |fn| fn.include? "process_" }
|
28
|
+
s.files.reject! { |fn| fn.include? 'lect1' }
|
29
|
+
s.files.reject! { |fn| fn.include? 'lect2' }
|
30
|
+
s.files.reject! { |fn| fn.include? 'lect3' }
|
31
|
+
s.files.reject! { |fn| fn.include? 'file-chksum.rb' }
|
32
|
+
s.files.reject! { |fn| fn.include? 'html_tokenizer-example.rb' }
|
33
|
+
s.files.reject! { |fn| fn.include? 'optparseExample.rb' }
|
34
|
+
s.files.reject! { |fn| fn.include? 'xml_to_sqlite.rb' }
|
35
|
+
s.files.reject! { |fn| fn.include? 'require_relative.rb' }
|
36
|
+
s.files.reject! { |fn| fn.include? '801-lect1.*' }
|
26
37
|
s.executables = [ 'sm-transcript' ]
|
27
38
|
s.test_files = Dir["test/test*.rb"]
|
28
39
|
s.has_rdoc = false
|
29
40
|
end
|
30
|
-
|
41
|
+
|
31
42
|
Rake::GemPackageTask.new(spec).define
|
data/bin/sm-transcript
CHANGED
File without changes
|
@@ -9,6 +9,31 @@ require_relative 'word'
|
|
9
9
|
|
10
10
|
module SmTranscript
|
11
11
|
class Metadata
|
12
|
+
|
13
|
+
# "dc-abstract"
|
14
|
+
# "dc-contributor"
|
15
|
+
# "dc-creator"
|
16
|
+
# "dc-description"
|
17
|
+
# "dc-isPartOf"
|
18
|
+
# "dc-language"
|
19
|
+
# "dc-license"
|
20
|
+
# "dc-subject"
|
21
|
+
# "dc-title"
|
22
|
+
# "dc-audience"
|
23
|
+
# "dc-available"
|
24
|
+
# "dc-created"
|
25
|
+
# "dc-extent"
|
26
|
+
# "dc-identifier"
|
27
|
+
# "dc-isReplacedBy"
|
28
|
+
# "dc-issued"
|
29
|
+
# "dc-modified"
|
30
|
+
# "dc-publisher"
|
31
|
+
# "dc-replaces"
|
32
|
+
# "dc-rightsHolder"
|
33
|
+
# "dc-spatial"
|
34
|
+
# "dc-temporal"
|
35
|
+
# "dc-type"
|
36
|
+
# "dc-valid"
|
12
37
|
|
13
38
|
def initialize(metadata)
|
14
39
|
@metadata = metadata
|
@@ -11,6 +11,7 @@ module SmTranscript
|
|
11
11
|
SEG_SRC_TYPE = 'seg'
|
12
12
|
WRD_SRC_TYPE = 'wrd'
|
13
13
|
TXT_SRC_TYPE = 'txt'
|
14
|
+
TTML_SRC_TYPE = 'xml'
|
14
15
|
TTML_DEST_TYPE = 'ttml'
|
15
16
|
HTML_DEST_TYPE = 'html'
|
16
17
|
DATAJS_DEST_TYPE = 'datajs'
|
@@ -58,12 +59,12 @@ module SmTranscript
|
|
58
59
|
@options.destdir = @destdir = ddir
|
59
60
|
end
|
60
61
|
|
61
|
-
opts.on("--srctype seg | wrd | txt",
|
62
|
-
"Kind of file to process (Default:
|
62
|
+
opts.on("--srctype seg | wrd | txt | xml",
|
63
|
+
"Kind of file to process (Default: wrd)") do |stype|
|
63
64
|
@options.srctype = @srctype = stype
|
64
65
|
end
|
65
66
|
|
66
|
-
opts.on("--desttype html | ttml | datajs",
|
67
|
+
opts.on("--desttype html | ttml | datajs | json",
|
67
68
|
"Kind of format to output (Default: html)") do |dtype|
|
68
69
|
@options.desttype = @desttype = dtype
|
69
70
|
end
|
@@ -73,6 +74,11 @@ module SmTranscript
|
|
73
74
|
return
|
74
75
|
end
|
75
76
|
|
77
|
+
opts.on("-v", "--version", "Show version") do
|
78
|
+
puts "\nsm-transcript gem version: 0.0.5rc"
|
79
|
+
return
|
80
|
+
end
|
81
|
+
|
76
82
|
begin
|
77
83
|
argv = ["-h"] if argv.empty?
|
78
84
|
opts.parse!(argv)
|
data/lib/sm_transcript/runner.rb
CHANGED
@@ -7,6 +7,7 @@ require 'extensions/kernel'
|
|
7
7
|
require_relative 'options'
|
8
8
|
require_relative 'seg_reader'
|
9
9
|
require_relative 'wrd_reader'
|
10
|
+
require_relative 'ttml_reader'
|
10
11
|
require_relative 'transcript'
|
11
12
|
require_relative 'metadata'
|
12
13
|
require_relative 'metadata_reader'
|
@@ -23,6 +24,9 @@ module SmTranscript
|
|
23
24
|
def run
|
24
25
|
# collect files to process
|
25
26
|
begin
|
27
|
+
# p "working directory is #{File.new(__FILE__).path}"
|
28
|
+
# p "reading from #{@options.srcdir}"
|
29
|
+
# p "writing to #{@options.destdir}"
|
26
30
|
raise "source directory doesn't exist" unless FileTest.exists?(@options.srcdir)
|
27
31
|
raise "destination directory doesn't exist" unless FileTest.exists?(@options.destdir)
|
28
32
|
|
@@ -32,6 +36,8 @@ module SmTranscript
|
|
32
36
|
case @options.srctype
|
33
37
|
when SmTranscript::Options::SEG_SRC_TYPE
|
34
38
|
words = SmTranscript::SegReader.from_file(x).words
|
39
|
+
when SmTranscript::Options::TTML_SRC_TYPE
|
40
|
+
words = SmTranscript::TtmlReader.from_file(x).words
|
35
41
|
when SmTranscript::Options::TXT_SRC_TYPE
|
36
42
|
md = SmTranscript::MetadataReader.from_file(x).metadata
|
37
43
|
else SmTranscript::Options::WRD_SRC_TYPE
|
@@ -34,7 +34,7 @@ module SmTranscript
|
|
34
34
|
@root.elements.each("/document/lecture/segment") do |s|
|
35
35
|
s.text.scan(/^\d* \d* [\w']*$/) do |t|
|
36
36
|
arr = t.split
|
37
|
-
@words << SmTranscript::Word.new(arr[0], arr[1], arr[2])
|
37
|
+
@words << SmTranscript::Word.new(arr[0], arr[1], arr[1].to_i - arr[0].to_i, arr[2])
|
38
38
|
end
|
39
39
|
end
|
40
40
|
end
|
@@ -5,12 +5,14 @@
|
|
5
5
|
require "rexml/document"
|
6
6
|
require 'extensions/kernel'
|
7
7
|
require 'builder'
|
8
|
+
require 'sqlite3'
|
8
9
|
require_relative 'word'
|
9
10
|
|
10
11
|
module SmTranscript
|
11
12
|
class Transcript
|
12
13
|
|
13
14
|
@words = Array.new()
|
15
|
+
attr_reader :words
|
14
16
|
|
15
17
|
def initialize(word_arr)
|
16
18
|
@metadata = {}
|
@@ -27,7 +29,7 @@ module SmTranscript
|
|
27
29
|
prev_start_time = 0
|
28
30
|
start_time = 0
|
29
31
|
@words.each do |w|
|
30
|
-
# get the start time and reduce its granularity so that multiple
|
32
|
+
# get the start time and reduce its granularity so that multiple
|
31
33
|
# words fall within a <span> element.
|
32
34
|
start_time = w.start_time.to_i/1000
|
33
35
|
if start_time.to_i == prev_start_time.to_i # append word
|
@@ -35,16 +37,16 @@ module SmTranscript
|
|
35
37
|
else # create a new span_element
|
36
38
|
# since prev_start_time is zero on first line, this avoids
|
37
39
|
# writing a closing </span> with no opening <span>
|
40
|
+
span_element = cleanup_phrase(span_element)
|
38
41
|
f.puts span_element << "</span> " unless prev_start_time == 0
|
39
|
-
|
40
|
-
|
41
|
-
prev_start_time = start_time
|
42
|
+
span_element = "<span id='T#{start_time}'>#{w.word}"
|
43
|
+
prev_start_time = start_time
|
42
44
|
end
|
43
45
|
end
|
44
|
-
# In the block above, the last word isn't written if
|
45
|
-
# the start_time and prev_start_time are the same.
|
46
|
-
f.puts span_element << "</span> " unless start_time != prev_start_time
|
47
|
-
|
46
|
+
# In the block above, the last word isn't written if
|
47
|
+
# the start_time and prev_start_time are the same.
|
48
|
+
f.puts span_element << "</span> " unless start_time != prev_start_time
|
49
|
+
f.close
|
48
50
|
end
|
49
51
|
end # write_html()
|
50
52
|
|
@@ -57,13 +59,13 @@ module SmTranscript
|
|
57
59
|
buf = ""
|
58
60
|
bldr = Builder::XmlMarkup.new( :target => buf, :indent => 2 )
|
59
61
|
bldr.instruct!
|
60
|
-
bldr.tt("xmlns" => "http://www.w3.org/2006/04/ttaf1",
|
62
|
+
bldr.tt("xmlns" => "http://www.w3.org/2006/04/ttaf1",
|
61
63
|
"xmlns:tts" => "http://www.w3.org/ns/ttml#styling",
|
62
64
|
"xmlns:ttm" => "http://www.w3.org/ns/ttml#metadata",
|
63
|
-
"xml:lang" => "en" ) {
|
65
|
+
"xml:lang" => "en" ) {
|
64
66
|
bldr.head { |b|
|
65
|
-
b.ttm :title, '
|
66
|
-
b.ttm :desc, '
|
67
|
+
b.ttm :title, 'The title of this transcript'
|
68
|
+
b.ttm :desc, 'The description of this transcript'
|
67
69
|
}
|
68
70
|
bldr.body {
|
69
71
|
bldr.div {
|
@@ -72,31 +74,37 @@ module SmTranscript
|
|
72
74
|
start_ms = end_ms = 0
|
73
75
|
start_secs = 0
|
74
76
|
@words.each do |w|
|
75
|
-
# get the start time and reduce its granularity so that
|
76
|
-
# words
|
77
|
+
# get the start time and reduce its granularity so that
|
78
|
+
# multiple words form a phrase.
|
77
79
|
start_secs = w.start_time.to_i/1000
|
78
80
|
if start_secs == prev_start_secs # append word
|
79
|
-
end_ms
|
81
|
+
end_ms = w.end_time.to_i
|
80
82
|
span_element << " #{w.word}"
|
81
83
|
else # create a new span_element
|
82
|
-
|
83
|
-
|
84
|
+
start_secs = w.start_time.to_i/1000
|
85
|
+
bldr.p( span_element,
|
86
|
+
"xml:id" => "T#{start_secs.to_s}",
|
87
|
+
"begin" => "#{start_ms.to_s}ms",
|
88
|
+
"dur" => "#{(end_ms - start_ms).to_s}ms",
|
89
|
+
"end" => "#{end_ms.to_s}ms" )
|
84
90
|
|
85
91
|
start_ms = w.start_time.to_i
|
86
92
|
end_ms = w.end_time.to_i
|
87
|
-
span_element = " #{w.word}"
|
88
|
-
prev_start_secs = start_secs
|
93
|
+
span_element = " #{w.word}"
|
94
|
+
prev_start_secs = start_secs
|
89
95
|
end
|
90
|
-
end
|
91
|
-
|
92
|
-
# the
|
93
|
-
|
94
|
-
|
95
|
-
"
|
96
|
-
"
|
96
|
+
end # @words.each
|
97
|
+
|
98
|
+
# In the block above, the last word isn't written if
|
99
|
+
# the start_time and prev_start_time are the same.
|
100
|
+
bldr.p( span_element,
|
101
|
+
"xml:id" => "T#{start_secs.to_s}",
|
102
|
+
"begin" => "#{start_ms.to_s}ms",
|
103
|
+
"dur" => "#{(end_ms - start_ms).to_s}ms",
|
104
|
+
"end" => "#{end_ms.to_s}ms" ) unless start_secs != prev_start_secs
|
97
105
|
}
|
98
106
|
}
|
99
|
-
}
|
107
|
+
}
|
100
108
|
# p buf
|
101
109
|
File.open(dest_file, "w") do |f|
|
102
110
|
f.puts buf
|
@@ -104,27 +112,66 @@ module SmTranscript
|
|
104
112
|
end
|
105
113
|
end
|
106
114
|
|
107
|
-
|
108
|
-
#
|
115
|
+
|
116
|
+
# The JSON format is defined at http://url/of/document. It is the format
|
117
|
+
# of the static timed-text document that is passed to the player.˙
|
118
|
+
def write_json(dest_file)
|
119
|
+
|
120
|
+
end # write_json()
|
121
|
+
|
122
|
+
|
123
|
+
# Store transcript in a Sqlite database (though the essence of this
|
124
|
+
# method should work for all relational dbs). Unlike some of the other
|
125
|
+
# write_xxx() methods, this one requires a @metadata array.
|
126
|
+
# param db_id - for SQLite, this is a filename.
|
127
|
+
# video_id - is a unique identifier for the video
|
128
|
+
|
129
|
+
def write_sqlite(db_id)
|
130
|
+
db_id = "sm-transcript"
|
131
|
+
db = SQLite3::Database.open(db_id + '.sqlite3')
|
132
|
+
|
133
|
+
fields = XPath.match(doc.root, inner_node_name + '[1]/*').map{|node| node.name}
|
134
|
+
field_def = fields.map {|x| "%s TEXT" % x}.join(', ')
|
135
|
+
|
136
|
+
end # write_sqlite()
|
137
|
+
|
138
|
+
|
139
|
+
private
|
140
|
+
|
141
|
+
# Times are expressed in milliseconds, far more granularity than is
|
142
|
+
# useful for most user-facing apps, especially since the player reports
|
109
143
|
# elapsed time only ten times a second.
|
110
|
-
# By reducing the time by orders of magnitude provides these benefits:
|
144
|
+
# By reducing the time by orders of magnitude provides these benefits:
|
111
145
|
# 1) Multiple words fall within a <span> element.
|
112
146
|
# 2) Better mapping between start times and player time tracking
|
113
147
|
def words_to_phrase(start_time)
|
114
148
|
start_time.to_i/1000
|
115
149
|
end # words_to_phrase
|
116
|
-
|
117
|
-
def get_time_expression(milliseconds)
|
118
|
-
|
119
|
-
end
|
120
|
-
|
121
|
-
# There are some word combinations that occur with such regularity that
|
150
|
+
|
151
|
+
# def get_time_expression(milliseconds)
|
152
|
+
# milliseconds
|
153
|
+
# end
|
154
|
+
|
155
|
+
# There are some word combinations that occur with such regularity that
|
122
156
|
# they call out to be fixed. For example, "m I t" is unambiguously MIT.
|
123
|
-
# These edits can only be done when the phrase has been assembled
|
157
|
+
# These edits can only be done when the phrase has been assembled since
|
158
|
+
# each letter is treated as an indiviual word.
|
124
159
|
def cleanup_phrase(phrase)
|
125
|
-
phrase
|
160
|
+
phrase.gsub(/m I t/, 'MIT')
|
161
|
+
phrase.gsub(/o e I t/, 'OEIT')
|
162
|
+
end
|
163
|
+
|
164
|
+
# remove HTML tags from text. requires classes from ActionPack
|
165
|
+
def strip_tags(html)
|
166
|
+
return html if html.empty? || !html.include?('<')
|
167
|
+
output = ""
|
168
|
+
tokenizer = HTML::Tokenizer.new(html)
|
169
|
+
while token = tokenizer.next
|
170
|
+
node = HTML::Node.parse(nil, 0, 0, token, false)
|
171
|
+
output += token unless (node.kind_of? HTML::Tag) or (token =~ /^<!/)
|
172
|
+
end
|
173
|
+
return output
|
126
174
|
end
|
127
|
-
|
128
175
|
|
129
176
|
end # class
|
130
177
|
end
|