interscript 0.1.4 → 0.1.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6228d56d873603190ed290757f3e8056258cb88761b93405e15f06d25a08cba5
4
- data.tar.gz: 0ceb25cb5862530d2c4798f921c4eaa5e9d999844266e8eab1d6bcd3d4b96842
3
+ metadata.gz: 3eb5c858ad7332995b5f8ed1c7766305f6be5fb5524e91de7df1d0d3d770a3ee
4
+ data.tar.gz: 80be952635dd65cf1e23107f70c44557ec9711590759e610a042f2c746f67d4e
5
5
  SHA512:
6
- metadata.gz: 301342d3cc50849fcb7f14c2b8b42404929cef1ed47192bc9cda236333b7b147c55d7b98648becc9c84d1b8790e6dbeffa30ecb27f472adf803463416e9babdc
7
- data.tar.gz: cf95797a136ff95dfd3edc66378c0b6320f6a87325206d56c5e9acda381733b4168d8f651cb15a30c8a5b40add43bd3b2eeafde24bc19a959de0eb6983ad8359
6
+ metadata.gz: bdb63b2ea8e6337a31879e7c1331f6e9a95dd205580bc8656cdb806e624b7cf946c17d300125470494c1643332b3f5f1a3d713a040bd6ee907dcba5d189db95e
7
+ data.tar.gz: 1aa1b254b253f148b4a5cc316027d3e67ca1ba68e866ebedbc025dc9c5f5c78680b12c24bc52177d65b2c9a6dd262ee97e34f52b4d32f83dfc9c232dc2c2d0de
@@ -1,6 +1,7 @@
1
1
  = Interscript: Interoperable Script Conversion Systems, with a Ruby implementation
2
2
 
3
- image:https://github.com/interscript/interscript/workflows/test/badge.svg["Build Status", link="https://github.com/interscript/interscript/actions?workflow=test"]
3
+ image:https://github.com/interscript/interscript/workflows/test/badge.svg["Ruby build status", link="https://github.com/interscript/interscript/actions?workflow=test"]
4
+ image:https://github.com/interscript/interscript/workflows/js/badge.svg["JavaScript build status", link="https://github.com/interscript/interscript/actions?workflow=js"]
4
5
 
5
6
  == Introduction
6
7
 
@@ -22,9 +23,9 @@ The goal is to achieve interoperable transliteration schemes allowing quality co
22
23
  These transliteration systems are used in the demo:
23
24
 
24
25
  `bgnpcgn-rus-Cyrl-Latn-1947`:: BGN/PCGN Romanization of Russian
25
- `iso-rus-Cyrl-Latn-iso9`:: ISO 9 Romanization of Russian
26
+ `iso-rus-Cyrl-Latn-9-1995`:: ISO 9 Romanization of Russian
26
27
  `icao-rus-Cyrl-Latn-9303`:: ICAO MRZ Romanization of Russian
27
- `bas-rus-Cyrl-Latn-bss`:: Bulgaria Academy of Science Streamlined System for Russian
28
+ `bas-rus-Cyrl-Latn-2017-bss`:: Bulgaria Academy of Science Streamlined System for Russian
28
29
 
29
30
  image:demo/20191118-interscript-demo-cast.gif["interscript screencast"]
30
31
 
@@ -95,7 +96,7 @@ interscript rus-Cyrl.txt \
95
96
  --output=bgnpcgn-rus-Latn.txt
96
97
 
97
98
  interscript rus-Cyrl.txt \
98
- --system=iso-rus-Cyrl-Latn-iso9 \
99
+ --system=iso-rus-Cyrl-Latn-9-1995 \
99
100
  --output=iso-rus-Latn.txt
100
101
 
101
102
  interscript rus-Cyrl.txt \
@@ -103,7 +104,7 @@ interscript rus-Cyrl.txt \
103
104
  --output=icao-rus-Latn.txt
104
105
 
105
106
  interscript rus-Cyrl.txt \
106
- --system=bas-rus-Cyrl-Latn-bss \
107
+ --system=bas-rus-Cyrl-Latn-2017-bss \
107
108
  --output=bas-rus-Latn.txt
108
109
  ----
109
110
 
@@ -149,7 +150,7 @@ tests:
149
150
 
150
151
  map:
151
152
  rules:
152
- - pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 # Е after a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь
153
+ - pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 # Е after a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь
153
154
  result: Ye
154
155
  - pattern: \b\u0415 # Е initially
155
156
  result: Ye
@@ -167,7 +168,7 @@ The subsection `rules` is placed under the `map` key. All rules are applied in o
167
168
 
168
169
  Each rule has `pattern` and `result` elements.
169
170
 
170
- Pattern is a regex expression. It should be representing as a string without `//` or `%r{}` parentheses. For example `\b\u0415`. In case a rule is depend on previous or next content, lookahead or lookbehind could be used. For example a rule with the pattern `(?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415` find every Е after upper or lower case symbols a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь.
171
+ Pattern is a regex expression. It should be representing as a string without `//` or `%r{}` parentheses. For example `\b\u0415`. In case a rule is depend on previous or next content, lookahead or lookbehind could be used. For example a rule with the pattern `(?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415` find every Е after upper or lower case symbols a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь.
171
172
 
172
173
  Result is a replacement a for pattern's match. It can contain a string, an Unicode characters specified by a hexadecimal number, a captured group reference. String with hexadecimal number or captured group reference should be double quoted. For example `"Y\u00eb"` or `"\\1\u00b7\\2"`. Captured group are referred by double backslash and group's number.
173
174
 
@@ -256,7 +257,7 @@ the system code identifying a script conversion system has the following compone
256
257
  e.g. `bgnpcgn-rus-Cyrl-Latn-1947`:
257
258
 
258
259
  `bgnpcgn`:: the authority identifier
259
- `rus`:: an ISO 639-2 3-letter language code that this system applies to
260
+ `rus`:: an ISO 639-{1,2,3,5} language code that this system applies to (For 639-2, use (T) code)
260
261
  `Cyrl`:: an ISO 15924 script code, identifying the source script
261
262
  `Latn`:: an ISO 15924 script code, identifying the target script
262
263
  `1947`:: an identifier unit within the authority to identify this system
@@ -0,0 +1,2 @@
1
+ require "opal"
2
+ require "interscript"
@@ -1,56 +1,32 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require "yaml"
3
+ require "interscript/opal/maps" if RUBY_ENGINE == "opal"
4
4
  require "interscript/mapping"
5
5
 
6
6
  # Transliteration
7
7
  module Interscript
8
8
 
9
- class << self
10
- def root_path
11
- @root_path ||= Pathname.new(File.dirname(__dir__))
12
- end
13
-
14
- def transliterate_file(system_code, input_file, output_file, maps)
15
- input = File.read(input_file)
16
- output = transliterate(system_code, input, maps)
17
-
18
- File.open(output_file, 'w') do |f|
19
- f.puts(output)
20
- end
21
- puts "Output written to: #{output_file}"
22
- end
9
+ class InvalidSystemError < StandardError; end
10
+ class ExternalProcessNotRecognizedError < StandardError; end
11
+ class ExternalProcessUnavailableError < StandardError; end
23
12
 
24
- def import_python_modules
25
- begin
26
- pyimport :g2pwrapper
27
- rescue
28
- pyimport :sys
29
- sys.path.append(root_path.to_s+"/lib/")
30
- pyimport :g2pwrapper
31
- end
32
- end
13
+ if RUBY_ENGINE == 'opal'
14
+ require "interscript/opal"
15
+ extend Opal
16
+ else
17
+ require "interscript/fs"
18
+ extend Fs
19
+ end
33
20
 
34
- def external_process(process_name, string)
35
- import_python_modules
36
- case process_name
37
- when 'sequitur.pythainlp_lexicon'
38
- return g2pwrapper.transliterate('pythainlp_lexicon', string)
39
- when 'sequitur.wiktionary_phonemic'
40
- return g2pwrapper.transliterate('wiktionary_phonemic', string)
41
- else
42
- puts "Invalid Process"
43
- end
44
- end
21
+ class << self
45
22
 
46
23
  def transliterate(system_code, string, maps={})
47
- if (!maps.has_key?system_code)
24
+ unless maps.has_key? system_code
48
25
  maps[system_code] = Interscript::Mapping.for(system_code)
49
26
  end
50
27
  # mapping = Interscript::Mapping.for(system_code)
51
28
  mapping = maps[system_code]
52
29
 
53
-
54
30
  # First, apply chained transliteration as specified in the list `chain`
55
31
  chain = mapping.chain.dup
56
32
  while chain.length > 0
@@ -69,11 +45,7 @@ module Interscript
69
45
  dictmap = mapping.dictionary_hash
70
46
  trie = mapping.dictionary_trie
71
47
 
72
- # Segmentation
73
- string = external_process(mapping.segmentation, string) if mapping.segmentation
74
-
75
- # Transliteration/Transcription
76
- string = external_process(mapping.transcription, string) if mapping.transcription
48
+ string = external_processing(mapping, string)
77
49
 
78
50
  pos = 0
79
51
  while pos < string.to_s.size
@@ -81,14 +53,15 @@ module Interscript
81
53
  wordmatch = ""
82
54
 
83
55
  # Using Trie, find the longest matching substring
84
- while (pos + m < string.to_s.size) && (trie.partial_word?string[pos..pos+m])
56
+ while (pos + m < string.to_s.size) && (trie.partial_word?string[pos..pos+m])
85
57
  wordmatch = string[pos..pos+m] if trie.word?string[pos..pos+m]
86
58
  m += 1
87
59
  end
60
+
88
61
  m = wordmatch.length
89
62
  if m > 0
90
63
  repl = dictmap[string[pos..pos+m-1]]
91
- string[pos..pos+m-1] = repl
64
+ string = sub_replace(string, pos, m, repl)
92
65
  pos += repl.length
93
66
  else
94
67
  pos += 1
@@ -109,32 +82,48 @@ module Interscript
109
82
  # offsets[pos] += result.size - match[0].size
110
83
  # end
111
84
  # end
85
+
112
86
  mapping.rules.each do |r|
113
- output.gsub!(/#{r['pattern']}/, r['result'])
87
+ next unless output
88
+ re = mkregexp(r["pattern"])
89
+ output = output.gsub(re, r["result"])
114
90
  end
115
91
 
116
92
  charmap.each do |k, v|
117
93
  while (match = output&.match(/#{k}/))
118
94
  pos = match.offset(0).first
119
95
  result = !downcase && up_case_around?(output, pos) ? v.upcase : v
120
- result = result[0] if result.is_a?(Array) # if more than one, choose the first one
121
- output[pos, match[0].size] = add_separator(separator, pos, result)
96
+
97
+ # if more than one, choose the first one
98
+ result = result[0] if result.is_a?(Array)
99
+
100
+ output = sub_replace(
101
+ output,
102
+ pos,
103
+ match[0].size,
104
+ add_separator(separator, pos, result)
105
+ )
122
106
  end
123
107
  end
124
108
 
125
109
  mapping.postrules.each do |r|
126
- output.gsub!(/#{r['pattern']}/, r['result'])
110
+ next unless output
111
+ re = mkregexp(r["pattern"])
112
+ output = output.gsub(re, r["result"])
127
113
  end
128
114
 
129
- if output
130
- output.sub!(/^(.)/, &:upcase) if title_case
131
- if word_separator != ''
132
- output.gsub!(/#{word_separator}#{separator}/,word_separator)
133
- output.gsub!(/#{word_separator}(.)/, &:upcase) if title_case
115
+ return unless output
116
+
117
+ output = output.sub(/^(.)/, &:upcase) if title_case
118
+ if word_separator != ''
119
+ output = output.gsub(/#{word_separator}#{separator}/u, word_separator)
120
+
121
+ if title_case
122
+ output = output.gsub(/#{word_separator}(.)/u, &:upcase)
134
123
  end
135
124
  end
136
125
 
137
- output ? output.unicode_normalize : output
126
+ output.unicode_normalize
138
127
  end
139
128
 
140
129
  private
@@ -147,11 +136,11 @@ module Interscript
147
136
  return false if string[pos] == string[pos].downcase
148
137
 
149
138
  i = pos - 1
150
- i -= 1 while i.positive? && string[i] !~ /[[:alpha:]]/
139
+ i -= 1 while i.positive? && string[i] !~ Regexp.new(ALPHA_REGEXP)
151
140
  before = i >= 0 && i < pos ? string[i].to_s.strip : ''
152
141
 
153
142
  i = pos + 1
154
- i += 1 while i < string.size - 1 && string[i] !~ /[[:alpha:]]/
143
+ i += 1 while i < string.size - 1 && string[i] !~ Regexp.new(ALPHA_REGEXP)
155
144
  after = i > pos ? string[i].to_s.strip : ''
156
145
 
157
146
  before_uc = !before.empty? && before == before.upcase
@@ -159,5 +148,6 @@ module Interscript
159
148
  # before_uc && (after.empty? || after_uc) || after_uc && (before.empty? || before_uc)
160
149
  before_uc || after_uc
161
150
  end
151
+
162
152
  end
163
153
  end
@@ -1,16 +1,17 @@
1
1
  require 'thor'
2
2
  require 'interscript'
3
-
3
+ require 'json'
4
4
  module Interscript
5
5
  # Command line interface
6
6
  class Command < Thor
7
7
  desc '<file>', 'Transliterate text'
8
8
  option :system, aliases: '-s', required: true, desc: 'Transliteration system'
9
9
  option :output, aliases: '-o', required: false, desc: 'Output file'
10
+ option :map, aliases: '-m', required: false, default: "{}", desc: 'Transliteration mapping json'
10
11
 
11
12
  def translit(input)
12
13
  if options[:output]
13
- Interscript.transliterate_file(options[:system], input, options[:output])
14
+ Interscript.transliterate_file(options[:system], input, options[:output], JSON.parse(options[:map]))
14
15
  else
15
16
  puts Interscript.transliterate(options[:system], IO.read(input))
16
17
  end
@@ -0,0 +1,69 @@
1
+ module Interscript
2
+ module Fs
3
+ ALPHA_REGEXP = '[[:alpha:]]'
4
+
5
+ def sub_replace(string, pos, size, repl)
6
+ string[pos..pos + size - 1] = repl
7
+ string
8
+ end
9
+
10
+ def root_path
11
+ @root_path ||= Pathname.new(File.join(File.dirname(__dir__), ".."))
12
+ end
13
+
14
+ def transliterate_file(system_code, input_file, output_file, maps={})
15
+ input = File.read(input_file)
16
+ output = transliterate(system_code, input, maps)
17
+
18
+ File.open(output_file, 'w') do |f|
19
+ f.puts(output)
20
+ end
21
+
22
+ puts "Output written to: #{output_file}"
23
+ output_file
24
+ end
25
+
26
+ def import_python_modules
27
+ begin
28
+ pyimport :g2pwrapper
29
+ rescue
30
+ pyimport :sys
31
+ sys.path.append(root_path.to_s + "/lib/")
32
+ pyimport :g2pwrapper
33
+ end
34
+ end
35
+
36
+ def external_process(process_name, string)
37
+ import_python_modules
38
+
39
+ case process_name
40
+ when 'sequitur.pythainlp_lexicon'
41
+ return g2pwrapper.transliterate('pythainlp_lexicon', string)
42
+ when 'sequitur.wiktionary_phonemic'
43
+ return g2pwrapper.transliterate('wiktionary_phonemic', string)
44
+ else
45
+ raise ExternalProcessNotRecognizedError.new
46
+ end
47
+
48
+ rescue
49
+ raise ExternalProcessUnavailableError.new
50
+ end
51
+
52
+ def external_processing(mapping, string)
53
+ # Segmentation
54
+ string = external_process(mapping.segmentation, string) if mapping.segmentation
55
+
56
+ # Transliteration/Transcription
57
+ string = external_process(mapping.transcription, string) if mapping.transcription
58
+
59
+ string
60
+ end
61
+
62
+ private
63
+
64
+ def mkregexp(regexpstring)
65
+ /#{regexpstring}/u
66
+ end
67
+
68
+ end
69
+ end
@@ -1,7 +1,8 @@
1
1
  require 'rambling-trie'
2
+ require 'yaml'
3
+ require 'json'
2
4
 
3
5
  module Interscript
4
- class InvalidSystemError < StandardError; end
5
6
 
6
7
  class Mapping
7
8
  attr_reader(
@@ -35,7 +36,10 @@ module Interscript
35
36
  def initialize(system_code, options = {})
36
37
  @system_code = system_code
37
38
  @depth = options.fetch(:depth, 0).to_i
38
- @system_path = options.fetch(:system_code, default_path)
39
+
40
+ unless RUBY_ENGINE == 'opal'
41
+ @system_path = options.fetch(:system_code, default_path)
42
+ end
39
43
 
40
44
  load_and_serialize_system_mappings
41
45
  end
@@ -45,10 +49,10 @@ module Interscript
45
49
  end
46
50
 
47
51
  def load_and_serialize_system_mappings
48
- if depth < 5
49
- mappings = load_system_mappings
50
- serialize_system_mappings(mappings)
51
- end
52
+ return if depth >= 5
53
+
54
+ mappings = load_system_mappings
55
+ serialize_system_mappings(mappings)
52
56
  end
53
57
 
54
58
  private
@@ -64,6 +68,18 @@ module Interscript
64
68
  end
65
69
 
66
70
  def load_system_mappings
71
+ if RUBY_ENGINE == 'opal'
72
+ load_opal_mappings
73
+ else
74
+ load_fs_mappings
75
+ end
76
+ end
77
+
78
+ def load_opal_mappings
79
+ JSON.parse(`InterscriptMaps[#{system_code}]`)
80
+ end
81
+
82
+ def load_fs_mappings
67
83
  YAML.load_file(system_path.join(system_code_file))
68
84
  rescue Errno::ENOENT
69
85
  raise Interscript::InvalidSystemError.new("No system mappings found")
@@ -100,26 +116,27 @@ module Interscript
100
116
 
101
117
  def include_inherited_mappings(mappings)
102
118
  inherit_systems = [].push(mappings["map"]["inherit"]).flatten
103
- for inherit_system in inherit_systems do
104
- if (inherit_system)
105
- inherited_mapping = Mapping.for(inherit_system, depth: depth + 1)
106
-
107
- @rules = [inherited_mapping.rules, rules].flatten
108
- @postrules = [inherited_mapping.postrules, postrules].flatten
109
- @characters = (inherited_mapping.characters|| {}).merge(characters)
110
- @dictionary = (inherited_mapping.dictionary|| {}).merge(dictionary)
111
- end
119
+
120
+ inherit_systems.each do |inherit_system|
121
+ next unless inherit_system
122
+
123
+ inherited_mapping = Mapping.for(inherit_system, depth: depth + 1)
124
+
125
+ @rules = [inherited_mapping.rules, rules].flatten
126
+ @postrules = [inherited_mapping.postrules, postrules].flatten
127
+ @characters = (inherited_mapping.characters|| {}).merge(characters)
128
+ @dictionary = (inherited_mapping.dictionary|| {}).merge(dictionary)
112
129
  end
113
130
  end
114
131
 
115
- def build_hashes()
132
+ def build_hashes
116
133
  @characters_hash = characters&.sort_by { |k, _v| k.size }&.reverse&.to_h
117
134
  @dictionary_hash = dictionary&.sort_by { |k, _v| k.size }&.reverse&.to_h
118
135
  end
119
136
 
120
- def build_trie()
137
+ def build_trie
121
138
  @dictionary_trie = Rambling::Trie.create
122
139
  dictionary_trie.concat dictionary.keys
123
140
  end
124
141
  end
125
- end
142
+ end
@@ -0,0 +1,23 @@
1
+ module Interscript
2
+ module Opal
3
+ ALPHA_REGEXP = '\p{L}'
4
+
5
+ def mkregexp(regexpstring)
6
+ flags = 'u'
7
+ if regexpstring.include? "(?i)"
8
+ regexpstring = regexpstring.gsub("(?i)", "").gsub("(?-i)", "")
9
+ flags = 'ui'
10
+ end
11
+ Regexp.new("/#{regexpstring}/#{flags}")
12
+ end
13
+
14
+ def sub_replace(string, pos, size, repl)
15
+ string[0, pos] + repl + string[pos + size..-1]
16
+ end
17
+
18
+ def external_processing(mapping, string)
19
+ string
20
+ end
21
+
22
+ end
23
+ end