interscript 0.1.4 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6228d56d873603190ed290757f3e8056258cb88761b93405e15f06d25a08cba5
4
- data.tar.gz: 0ceb25cb5862530d2c4798f921c4eaa5e9d999844266e8eab1d6bcd3d4b96842
3
+ metadata.gz: 3eb5c858ad7332995b5f8ed1c7766305f6be5fb5524e91de7df1d0d3d770a3ee
4
+ data.tar.gz: 80be952635dd65cf1e23107f70c44557ec9711590759e610a042f2c746f67d4e
5
5
  SHA512:
6
- metadata.gz: 301342d3cc50849fcb7f14c2b8b42404929cef1ed47192bc9cda236333b7b147c55d7b98648becc9c84d1b8790e6dbeffa30ecb27f472adf803463416e9babdc
7
- data.tar.gz: cf95797a136ff95dfd3edc66378c0b6320f6a87325206d56c5e9acda381733b4168d8f651cb15a30c8a5b40add43bd3b2eeafde24bc19a959de0eb6983ad8359
6
+ metadata.gz: bdb63b2ea8e6337a31879e7c1331f6e9a95dd205580bc8656cdb806e624b7cf946c17d300125470494c1643332b3f5f1a3d713a040bd6ee907dcba5d189db95e
7
+ data.tar.gz: 1aa1b254b253f148b4a5cc316027d3e67ca1ba68e866ebedbc025dc9c5f5c78680b12c24bc52177d65b2c9a6dd262ee97e34f52b4d32f83dfc9c232dc2c2d0de
@@ -1,6 +1,7 @@
1
1
  = Interscript: Interoperable Script Conversion Systems, with a Ruby implementation
2
2
 
3
- image:https://github.com/interscript/interscript/workflows/test/badge.svg["Build Status", link="https://github.com/interscript/interscript/actions?workflow=test"]
3
+ image:https://github.com/interscript/interscript/workflows/test/badge.svg["Ruby build status", link="https://github.com/interscript/interscript/actions?workflow=test"]
4
+ image:https://github.com/interscript/interscript/workflows/js/badge.svg["JavaScript build status", link="https://github.com/interscript/interscript/actions?workflow=js"]
4
5
 
5
6
  == Introduction
6
7
 
@@ -22,9 +23,9 @@ The goal is to achieve interoperable transliteration schemes allowing quality co
22
23
  These transliteration systems are used in the demo:
23
24
 
24
25
  `bgnpcgn-rus-Cyrl-Latn-1947`:: BGN/PCGN Romanization of Russian
25
- `iso-rus-Cyrl-Latn-iso9`:: ISO 9 Romanization of Russian
26
+ `iso-rus-Cyrl-Latn-9-1995`:: ISO 9 Romanization of Russian
26
27
  `icao-rus-Cyrl-Latn-9303`:: ICAO MRZ Romanization of Russian
27
- `bas-rus-Cyrl-Latn-bss`:: Bulgaria Academy of Science Streamlined System for Russian
28
+ `bas-rus-Cyrl-Latn-2017-bss`:: Bulgaria Academy of Science Streamlined System for Russian
28
29
 
29
30
  image:demo/20191118-interscript-demo-cast.gif["interscript screencast"]
30
31
 
@@ -95,7 +96,7 @@ interscript rus-Cyrl.txt \
95
96
  --output=bgnpcgn-rus-Latn.txt
96
97
 
97
98
  interscript rus-Cyrl.txt \
98
- --system=iso-rus-Cyrl-Latn-iso9 \
99
+ --system=iso-rus-Cyrl-Latn-9-1995 \
99
100
  --output=iso-rus-Latn.txt
100
101
 
101
102
  interscript rus-Cyrl.txt \
@@ -103,7 +104,7 @@ interscript rus-Cyrl.txt \
103
104
  --output=icao-rus-Latn.txt
104
105
 
105
106
  interscript rus-Cyrl.txt \
106
- --system=bas-rus-Cyrl-Latn-bss \
107
+ --system=bas-rus-Cyrl-Latn-2017-bss \
107
108
  --output=bas-rus-Latn.txt
108
109
  ----
109
110
 
@@ -149,7 +150,7 @@ tests:
149
150
 
150
151
  map:
151
152
  rules:
152
- - pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 # Е after a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь
153
+ - pattern: (?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415 # Е after a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь
153
154
  result: Ye
154
155
  - pattern: \b\u0415 # Е initially
155
156
  result: Ye
@@ -167,7 +168,7 @@ The subsection `rules` is placed under the `map` key. All rules are applied in o
167
168
 
168
169
  Each rule has `pattern` and `result` elements.
169
170
 
170
- Pattern is a regex expression. It should be representing as a string without `//` or `%r{}` parentheses. For example `\b\u0415`. In case a rule is depend on previous or next content, lookahead or lookbehind could be used. For example a rule with the pattern `(?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415` find every Е after upper or lower case symbols a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь.
171
+ Pattern is a regex expression. It should be representing as a string without `//` or `%r{}` parentheses. For example `\b\u0415`. In case a rule is depend on previous or next content, lookahead or lookbehind could be used. For example a rule with the pattern `(?<=[АаЕеЁёИиОоУуЫыЭэЮюЯяЙйЪъЬь])\u0415` find every Е after upper or lower case symbols a, e, ё, и, о, у, ы, э, ю, я, й, ъ, ь.
171
172
 
172
173
  Result is a replacement a for pattern's match. It can contain a string, an Unicode characters specified by a hexadecimal number, a captured group reference. String with hexadecimal number or captured group reference should be double quoted. For example `"Y\u00eb"` or `"\\1\u00b7\\2"`. Captured group are referred by double backslash and group's number.
173
174
 
@@ -256,7 +257,7 @@ the system code identifying a script conversion system has the following compone
256
257
  e.g. `bgnpcgn-rus-Cyrl-Latn-1947`:
257
258
 
258
259
  `bgnpcgn`:: the authority identifier
259
- `rus`:: an ISO 639-2 3-letter language code that this system applies to
260
+ `rus`:: an ISO 639-{1,2,3,5} language code that this system applies to (For 639-2, use (T) code)
260
261
  `Cyrl`:: an ISO 15924 script code, identifying the source script
261
262
  `Latn`:: an ISO 15924 script code, identifying the target script
262
263
  `1947`:: an identifier unit within the authority to identify this system
@@ -0,0 +1,2 @@
1
+ require "opal"
2
+ require "interscript"
@@ -1,56 +1,32 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require "yaml"
3
+ require "interscript/opal/maps" if RUBY_ENGINE == "opal"
4
4
  require "interscript/mapping"
5
5
 
6
6
  # Transliteration
7
7
  module Interscript
8
8
 
9
- class << self
10
- def root_path
11
- @root_path ||= Pathname.new(File.dirname(__dir__))
12
- end
13
-
14
- def transliterate_file(system_code, input_file, output_file, maps)
15
- input = File.read(input_file)
16
- output = transliterate(system_code, input, maps)
17
-
18
- File.open(output_file, 'w') do |f|
19
- f.puts(output)
20
- end
21
- puts "Output written to: #{output_file}"
22
- end
9
+ class InvalidSystemError < StandardError; end
10
+ class ExternalProcessNotRecognizedError < StandardError; end
11
+ class ExternalProcessUnavailableError < StandardError; end
23
12
 
24
- def import_python_modules
25
- begin
26
- pyimport :g2pwrapper
27
- rescue
28
- pyimport :sys
29
- sys.path.append(root_path.to_s+"/lib/")
30
- pyimport :g2pwrapper
31
- end
32
- end
13
+ if RUBY_ENGINE == 'opal'
14
+ require "interscript/opal"
15
+ extend Opal
16
+ else
17
+ require "interscript/fs"
18
+ extend Fs
19
+ end
33
20
 
34
- def external_process(process_name, string)
35
- import_python_modules
36
- case process_name
37
- when 'sequitur.pythainlp_lexicon'
38
- return g2pwrapper.transliterate('pythainlp_lexicon', string)
39
- when 'sequitur.wiktionary_phonemic'
40
- return g2pwrapper.transliterate('wiktionary_phonemic', string)
41
- else
42
- puts "Invalid Process"
43
- end
44
- end
21
+ class << self
45
22
 
46
23
  def transliterate(system_code, string, maps={})
47
- if (!maps.has_key?system_code)
24
+ unless maps.has_key? system_code
48
25
  maps[system_code] = Interscript::Mapping.for(system_code)
49
26
  end
50
27
  # mapping = Interscript::Mapping.for(system_code)
51
28
  mapping = maps[system_code]
52
29
 
53
-
54
30
  # First, apply chained transliteration as specified in the list `chain`
55
31
  chain = mapping.chain.dup
56
32
  while chain.length > 0
@@ -69,11 +45,7 @@ module Interscript
69
45
  dictmap = mapping.dictionary_hash
70
46
  trie = mapping.dictionary_trie
71
47
 
72
- # Segmentation
73
- string = external_process(mapping.segmentation, string) if mapping.segmentation
74
-
75
- # Transliteration/Transcription
76
- string = external_process(mapping.transcription, string) if mapping.transcription
48
+ string = external_processing(mapping, string)
77
49
 
78
50
  pos = 0
79
51
  while pos < string.to_s.size
@@ -81,14 +53,15 @@ module Interscript
81
53
  wordmatch = ""
82
54
 
83
55
  # Using Trie, find the longest matching substring
84
- while (pos + m < string.to_s.size) && (trie.partial_word?string[pos..pos+m])
56
+ while (pos + m < string.to_s.size) && (trie.partial_word?string[pos..pos+m])
85
57
  wordmatch = string[pos..pos+m] if trie.word?string[pos..pos+m]
86
58
  m += 1
87
59
  end
60
+
88
61
  m = wordmatch.length
89
62
  if m > 0
90
63
  repl = dictmap[string[pos..pos+m-1]]
91
- string[pos..pos+m-1] = repl
64
+ string = sub_replace(string, pos, m, repl)
92
65
  pos += repl.length
93
66
  else
94
67
  pos += 1
@@ -109,32 +82,48 @@ module Interscript
109
82
  # offsets[pos] += result.size - match[0].size
110
83
  # end
111
84
  # end
85
+
112
86
  mapping.rules.each do |r|
113
- output.gsub!(/#{r['pattern']}/, r['result'])
87
+ next unless output
88
+ re = mkregexp(r["pattern"])
89
+ output = output.gsub(re, r["result"])
114
90
  end
115
91
 
116
92
  charmap.each do |k, v|
117
93
  while (match = output&.match(/#{k}/))
118
94
  pos = match.offset(0).first
119
95
  result = !downcase && up_case_around?(output, pos) ? v.upcase : v
120
- result = result[0] if result.is_a?(Array) # if more than one, choose the first one
121
- output[pos, match[0].size] = add_separator(separator, pos, result)
96
+
97
+ # if more than one, choose the first one
98
+ result = result[0] if result.is_a?(Array)
99
+
100
+ output = sub_replace(
101
+ output,
102
+ pos,
103
+ match[0].size,
104
+ add_separator(separator, pos, result)
105
+ )
122
106
  end
123
107
  end
124
108
 
125
109
  mapping.postrules.each do |r|
126
- output.gsub!(/#{r['pattern']}/, r['result'])
110
+ next unless output
111
+ re = mkregexp(r["pattern"])
112
+ output = output.gsub(re, r["result"])
127
113
  end
128
114
 
129
- if output
130
- output.sub!(/^(.)/, &:upcase) if title_case
131
- if word_separator != ''
132
- output.gsub!(/#{word_separator}#{separator}/,word_separator)
133
- output.gsub!(/#{word_separator}(.)/, &:upcase) if title_case
115
+ return unless output
116
+
117
+ output = output.sub(/^(.)/, &:upcase) if title_case
118
+ if word_separator != ''
119
+ output = output.gsub(/#{word_separator}#{separator}/u, word_separator)
120
+
121
+ if title_case
122
+ output = output.gsub(/#{word_separator}(.)/u, &:upcase)
134
123
  end
135
124
  end
136
125
 
137
- output ? output.unicode_normalize : output
126
+ output.unicode_normalize
138
127
  end
139
128
 
140
129
  private
@@ -147,11 +136,11 @@ module Interscript
147
136
  return false if string[pos] == string[pos].downcase
148
137
 
149
138
  i = pos - 1
150
- i -= 1 while i.positive? && string[i] !~ /[[:alpha:]]/
139
+ i -= 1 while i.positive? && string[i] !~ Regexp.new(ALPHA_REGEXP)
151
140
  before = i >= 0 && i < pos ? string[i].to_s.strip : ''
152
141
 
153
142
  i = pos + 1
154
- i += 1 while i < string.size - 1 && string[i] !~ /[[:alpha:]]/
143
+ i += 1 while i < string.size - 1 && string[i] !~ Regexp.new(ALPHA_REGEXP)
155
144
  after = i > pos ? string[i].to_s.strip : ''
156
145
 
157
146
  before_uc = !before.empty? && before == before.upcase
@@ -159,5 +148,6 @@ module Interscript
159
148
  # before_uc && (after.empty? || after_uc) || after_uc && (before.empty? || before_uc)
160
149
  before_uc || after_uc
161
150
  end
151
+
162
152
  end
163
153
  end
@@ -1,16 +1,17 @@
1
1
  require 'thor'
2
2
  require 'interscript'
3
-
3
+ require 'json'
4
4
  module Interscript
5
5
  # Command line interface
6
6
  class Command < Thor
7
7
  desc '<file>', 'Transliterate text'
8
8
  option :system, aliases: '-s', required: true, desc: 'Transliteration system'
9
9
  option :output, aliases: '-o', required: false, desc: 'Output file'
10
+ option :map, aliases: '-m', required: false, default: "{}", desc: 'Transliteration mapping json'
10
11
 
11
12
  def translit(input)
12
13
  if options[:output]
13
- Interscript.transliterate_file(options[:system], input, options[:output])
14
+ Interscript.transliterate_file(options[:system], input, options[:output], JSON.parse(options[:map]))
14
15
  else
15
16
  puts Interscript.transliterate(options[:system], IO.read(input))
16
17
  end
@@ -0,0 +1,69 @@
1
+ module Interscript
2
+ module Fs
3
+ ALPHA_REGEXP = '[[:alpha:]]'
4
+
5
+ def sub_replace(string, pos, size, repl)
6
+ string[pos..pos + size - 1] = repl
7
+ string
8
+ end
9
+
10
+ def root_path
11
+ @root_path ||= Pathname.new(File.join(File.dirname(__dir__), ".."))
12
+ end
13
+
14
+ def transliterate_file(system_code, input_file, output_file, maps={})
15
+ input = File.read(input_file)
16
+ output = transliterate(system_code, input, maps)
17
+
18
+ File.open(output_file, 'w') do |f|
19
+ f.puts(output)
20
+ end
21
+
22
+ puts "Output written to: #{output_file}"
23
+ output_file
24
+ end
25
+
26
+ def import_python_modules
27
+ begin
28
+ pyimport :g2pwrapper
29
+ rescue
30
+ pyimport :sys
31
+ sys.path.append(root_path.to_s + "/lib/")
32
+ pyimport :g2pwrapper
33
+ end
34
+ end
35
+
36
+ def external_process(process_name, string)
37
+ import_python_modules
38
+
39
+ case process_name
40
+ when 'sequitur.pythainlp_lexicon'
41
+ return g2pwrapper.transliterate('pythainlp_lexicon', string)
42
+ when 'sequitur.wiktionary_phonemic'
43
+ return g2pwrapper.transliterate('wiktionary_phonemic', string)
44
+ else
45
+ raise ExternalProcessNotRecognizedError.new
46
+ end
47
+
48
+ rescue
49
+ raise ExternalProcessUnavailableError.new
50
+ end
51
+
52
+ def external_processing(mapping, string)
53
+ # Segmentation
54
+ string = external_process(mapping.segmentation, string) if mapping.segmentation
55
+
56
+ # Transliteration/Transcription
57
+ string = external_process(mapping.transcription, string) if mapping.transcription
58
+
59
+ string
60
+ end
61
+
62
+ private
63
+
64
+ def mkregexp(regexpstring)
65
+ /#{regexpstring}/u
66
+ end
67
+
68
+ end
69
+ end
@@ -1,7 +1,8 @@
1
1
  require 'rambling-trie'
2
+ require 'yaml'
3
+ require 'json'
2
4
 
3
5
  module Interscript
4
- class InvalidSystemError < StandardError; end
5
6
 
6
7
  class Mapping
7
8
  attr_reader(
@@ -35,7 +36,10 @@ module Interscript
35
36
  def initialize(system_code, options = {})
36
37
  @system_code = system_code
37
38
  @depth = options.fetch(:depth, 0).to_i
38
- @system_path = options.fetch(:system_code, default_path)
39
+
40
+ unless RUBY_ENGINE == 'opal'
41
+ @system_path = options.fetch(:system_code, default_path)
42
+ end
39
43
 
40
44
  load_and_serialize_system_mappings
41
45
  end
@@ -45,10 +49,10 @@ module Interscript
45
49
  end
46
50
 
47
51
  def load_and_serialize_system_mappings
48
- if depth < 5
49
- mappings = load_system_mappings
50
- serialize_system_mappings(mappings)
51
- end
52
+ return if depth >= 5
53
+
54
+ mappings = load_system_mappings
55
+ serialize_system_mappings(mappings)
52
56
  end
53
57
 
54
58
  private
@@ -64,6 +68,18 @@ module Interscript
64
68
  end
65
69
 
66
70
  def load_system_mappings
71
+ if RUBY_ENGINE == 'opal'
72
+ load_opal_mappings
73
+ else
74
+ load_fs_mappings
75
+ end
76
+ end
77
+
78
+ def load_opal_mappings
79
+ JSON.parse(`InterscriptMaps[#{system_code}]`)
80
+ end
81
+
82
+ def load_fs_mappings
67
83
  YAML.load_file(system_path.join(system_code_file))
68
84
  rescue Errno::ENOENT
69
85
  raise Interscript::InvalidSystemError.new("No system mappings found")
@@ -100,26 +116,27 @@ module Interscript
100
116
 
101
117
  def include_inherited_mappings(mappings)
102
118
  inherit_systems = [].push(mappings["map"]["inherit"]).flatten
103
- for inherit_system in inherit_systems do
104
- if (inherit_system)
105
- inherited_mapping = Mapping.for(inherit_system, depth: depth + 1)
106
-
107
- @rules = [inherited_mapping.rules, rules].flatten
108
- @postrules = [inherited_mapping.postrules, postrules].flatten
109
- @characters = (inherited_mapping.characters|| {}).merge(characters)
110
- @dictionary = (inherited_mapping.dictionary|| {}).merge(dictionary)
111
- end
119
+
120
+ inherit_systems.each do |inherit_system|
121
+ next unless inherit_system
122
+
123
+ inherited_mapping = Mapping.for(inherit_system, depth: depth + 1)
124
+
125
+ @rules = [inherited_mapping.rules, rules].flatten
126
+ @postrules = [inherited_mapping.postrules, postrules].flatten
127
+ @characters = (inherited_mapping.characters|| {}).merge(characters)
128
+ @dictionary = (inherited_mapping.dictionary|| {}).merge(dictionary)
112
129
  end
113
130
  end
114
131
 
115
- def build_hashes()
132
+ def build_hashes
116
133
  @characters_hash = characters&.sort_by { |k, _v| k.size }&.reverse&.to_h
117
134
  @dictionary_hash = dictionary&.sort_by { |k, _v| k.size }&.reverse&.to_h
118
135
  end
119
136
 
120
- def build_trie()
137
+ def build_trie
121
138
  @dictionary_trie = Rambling::Trie.create
122
139
  dictionary_trie.concat dictionary.keys
123
140
  end
124
141
  end
125
- end
142
+ end
@@ -0,0 +1,23 @@
1
+ module Interscript
2
+ module Opal
3
+ ALPHA_REGEXP = '\p{L}'
4
+
5
+ def mkregexp(regexpstring)
6
+ flags = 'u'
7
+ if regexpstring.include? "(?i)"
8
+ regexpstring = regexpstring.gsub("(?i)", "").gsub("(?-i)", "")
9
+ flags = 'ui'
10
+ end
11
+ Regexp.new("/#{regexpstring}/#{flags}")
12
+ end
13
+
14
+ def sub_replace(string, pos, size, repl)
15
+ string[0, pos] + repl + string[pos + size..-1]
16
+ end
17
+
18
+ def external_processing(mapping, string)
19
+ string
20
+ end
21
+
22
+ end
23
+ end