google_translate_diff 1.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +13 -0
- data/.rspec +2 -0
- data/.rubocop.yml +12 -0
- data/.travis.yml +12 -0
- data/Gemfile +3 -0
- data/README.md +114 -0
- data/Rakefile +6 -0
- data/google_translate_diff.gemspec +50 -0
- data/lib/google_translate_diff.rb +28 -0
- data/lib/google_translate_diff/cache.rb +36 -0
- data/lib/google_translate_diff/chunker.rb +54 -0
- data/lib/google_translate_diff/linearizer.rb +27 -0
- data/lib/google_translate_diff/redis_cache_store.rb +24 -0
- data/lib/google_translate_diff/redis_rate_limiter.rb +21 -0
- data/lib/google_translate_diff/request.rb +148 -0
- data/lib/google_translate_diff/spacing.rb +27 -0
- data/lib/google_translate_diff/tokenizer.rb +74 -0
- data/lib/google_translate_diff/version.rb +3 -0
- metadata +204 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 00ebbfbeaac94fc00141bbb2bfddedcfb9e733a6
|
4
|
+
data.tar.gz: d73f0d12f6d2df80bd43614856113931c3209067
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: dd4fd53436d30887c3d7d3990ef84387fbfa5b8c4b355c7258f2a05346f490cdd9288158ab910f6ee18886e5f3096f2ac1e603aeea3e64a71d897a6917233968
|
7
|
+
data.tar.gz: 9e60481244ce4520439c0d0184ea6a8e0dcf587c0717be6000e94051fe1b8dd915bf30fedd5b5f4f0024227e5be03428ab29779065169c7eef9b02143d053366
|
data/.gitignore
ADDED
data/.rspec
ADDED
data/.rubocop.yml
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/README.md
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
# GoogleTranslateDiff
|
2
|
+
|
3
|
+
Google Translate API wrapper helps to translate only changes between revisions of long texts.
|
4
|
+
|
5
|
+
<a href="https://evilmartians.com/?utm_source=google_translate_diff-gem">
|
6
|
+
<img src="https://evilmartians.com/badges/sponsored-by-evil-martians.svg" alt="Sponsored by Evil Martians" width="236" height="54">
|
7
|
+
</a>
|
8
|
+
|
9
|
+
[](https://travis-ci.org/gzigzigzeo/google_translate_diff) [](https://codeclimate.com/github/gzigzigzeo/google_translate_diff) [](https://codeclimate.com/github/gzigzigzeo/google_translate_diff/coverage)
|
10
|
+
|
11
|
+
## Use case
|
12
|
+
|
13
|
+
Assume your project contains a significant amount of products descriptions which:
|
14
|
+
- Require retranslation each time user edits them.
|
15
|
+
- Have a lot of equal parts (like return policy).
|
16
|
+
- Change frequently.
|
17
|
+
|
18
|
+
If your user changes a single word within the long description, you will be charged for the retranslation of the whole text.
|
19
|
+
|
20
|
+
Much better approach is to try to translate every repeated structural element (sentence) in your texts array just once to save money. This gem helps to make it done.
|
21
|
+
|
22
|
+
## Installation
|
23
|
+
|
24
|
+
Add this line to your application's Gemfile:
|
25
|
+
|
26
|
+
```ruby
|
27
|
+
gem 'google_translate_diff'
|
28
|
+
```
|
29
|
+
|
30
|
+
And then execute:
|
31
|
+
|
32
|
+
$ bundle
|
33
|
+
|
34
|
+
Or install it yourself as:
|
35
|
+
|
36
|
+
$ gem install google_translate_diff
|
37
|
+
|
38
|
+
## Usage
|
39
|
+
|
40
|
+
```ruby
|
41
|
+
require "google_translate_diff"
|
42
|
+
|
43
|
+
# This dependencies are not included, as you might need to roll your own cache based on different store
|
44
|
+
require "redis"
|
45
|
+
require "connection_pool"
|
46
|
+
require "redis-namespace"
|
47
|
+
require "ratelimit" # Optional, if you will use
|
48
|
+
|
49
|
+
# Setup https://github.com/GoogleCloudPlatform/google-cloud-ruby/tree/master/google-cloud-translate
|
50
|
+
ENV["TRANSLATE_KEY"] = "foobarkey"
|
51
|
+
|
52
|
+
# I always use pool for redis
|
53
|
+
pool = ConnectionPool.new(size: 10, timeout: 5) { Redis.new }
|
54
|
+
|
55
|
+
# Pass any options (like app id)
|
56
|
+
GoogleTranslateDiff.api = Google::Cloud::Translate.new
|
57
|
+
|
58
|
+
GoogleTranslateDiff.cache_store =
|
59
|
+
GoogleTranslateDiff::RedisCacheStore.new(pool, timeout: 7.days, namespace: "t")
|
60
|
+
|
61
|
+
# Optional
|
62
|
+
GoogleTranslateDiff.rate_limiter =
|
63
|
+
GoogleTranslateDiff::RedisRateLimiter.new(
|
64
|
+
pool, threshold: 8000, interval: 60, namespace: "t"
|
65
|
+
)
|
66
|
+
|
67
|
+
GoogleTranslateDiff.translate("test translations", from: "en", to: "es")
|
68
|
+
```
|
69
|
+
|
70
|
+
## How it works
|
71
|
+
|
72
|
+
- Text nodes are extracted from HTML.
|
73
|
+
- Every text node is split into sentences (using `punkt-segmenter` gem).
|
74
|
+
- Cache is checked for the presence of each sentence (using language couple and a hash of string).
|
75
|
+
- Missing sentences are translated via API and cached.
|
76
|
+
- Original HTML is recombined from translations and cache data.
|
77
|
+
|
78
|
+
*NOTE:* `:from` is required param. Cache can not be checked without specifying exact language couple, that's the limitation.
|
79
|
+
|
80
|
+
## Input
|
81
|
+
|
82
|
+
`::translate` can receive string, array or deep hash and will return the same, but translated.
|
83
|
+
|
84
|
+
```ruby
|
85
|
+
GoogleTranslateDiff.translate("test", from: "en", to: "es")
|
86
|
+
GoogleTranslateDiff.translate("test", "language", from: "en", to: "es")
|
87
|
+
GoogleTranslateDiff.translate(
|
88
|
+
{ title: "test", values: { type: "frequent" } }, from: "en", to: "es"
|
89
|
+
)
|
90
|
+
```
|
91
|
+
|
92
|
+
See `GoogleTranslateDiff::Linearizer` for details.
|
93
|
+
|
94
|
+
## HTML
|
95
|
+
|
96
|
+
You can pass HTML as like as plain text:
|
97
|
+
|
98
|
+
```ruby
|
99
|
+
GoogleTranslateDiff.translate("<b>Black</b>", from: "en", to: "es")
|
100
|
+
```
|
101
|
+
|
102
|
+
## Very long texts
|
103
|
+
|
104
|
+
Google API has a limitation: query can not be longer than approximately 4 KB. If your text is really that long, multiple queries will be used to translate it automatically.
|
105
|
+
|
106
|
+
## Development
|
107
|
+
|
108
|
+
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
109
|
+
|
110
|
+
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
111
|
+
|
112
|
+
## Contributing
|
113
|
+
|
114
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/gzigzigzeo/google_translate_diff.
|
data/Rakefile
ADDED
@@ -0,0 +1,50 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path("../lib", __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require "google_translate_diff/version"
|
5
|
+
|
6
|
+
# rubocop:disable Metrics/BlockLength
|
7
|
+
Gem::Specification.new do |spec|
|
8
|
+
spec.name = "google_translate_diff"
|
9
|
+
spec.version = GoogleTranslateDiff::VERSION
|
10
|
+
spec.authors = ["Victor Sokolov"]
|
11
|
+
spec.email = ["gzigzigzeo@evilmartians.com"]
|
12
|
+
|
13
|
+
spec.summary = %(
|
14
|
+
Google Translate API wrapper for Ruby which helps to translate only changes
|
15
|
+
between revisions of long texts.
|
16
|
+
|
17
|
+
)
|
18
|
+
spec.description = %(
|
19
|
+
Google Translate API wrapper for Ruby which helps to translate only changes
|
20
|
+
between revisions of long texts.
|
21
|
+
)
|
22
|
+
spec.homepage = "https://github.com/gzigzigzeo/google_translate_diff"
|
23
|
+
|
24
|
+
if spec.respond_to?(:metadata)
|
25
|
+
spec.metadata["allowed_push_host"] = "https://rubygems.org"
|
26
|
+
else
|
27
|
+
raise "RubyGems 2.0 or newer is required to protect against " \
|
28
|
+
"public gem pushes."
|
29
|
+
end
|
30
|
+
|
31
|
+
spec.files = `git ls-files -z`.split("\x0").reject do |f|
|
32
|
+
f.match(%r{^(test|spec|features)/})
|
33
|
+
end
|
34
|
+
spec.bindir = "exe"
|
35
|
+
spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
|
36
|
+
spec.require_paths = ["lib"]
|
37
|
+
|
38
|
+
spec.add_development_dependency "bundler", "~> 1.14"
|
39
|
+
spec.add_development_dependency "rake", "~> 10.0"
|
40
|
+
spec.add_development_dependency "rspec", "~> 3.0"
|
41
|
+
spec.add_development_dependency "rubocop"
|
42
|
+
spec.add_development_dependency "codeclimate-test-reporter", "~> 1.0.0"
|
43
|
+
spec.add_development_dependency "simplecov"
|
44
|
+
|
45
|
+
spec.add_dependency "google-cloud-translate"
|
46
|
+
spec.add_dependency "ox"
|
47
|
+
spec.add_dependency "dry-initializer"
|
48
|
+
spec.add_dependency "punkt-segmenter"
|
49
|
+
end
|
50
|
+
# rubocop:enable Metrics/BlockLength
|
@@ -0,0 +1,28 @@
|
|
1
|
+
require "ox"
|
2
|
+
require "punkt-segmenter"
|
3
|
+
require "dry/initializer"
|
4
|
+
require "google/cloud/translate"
|
5
|
+
|
6
|
+
require "google_translate_diff/version"
|
7
|
+
require "google_translate_diff/tokenizer"
|
8
|
+
require "google_translate_diff/linearizer"
|
9
|
+
require "google_translate_diff/chunker"
|
10
|
+
require "google_translate_diff/spacing"
|
11
|
+
require "google_translate_diff/cache"
|
12
|
+
require "google_translate_diff/redis_cache_store"
|
13
|
+
require "google_translate_diff/redis_rate_limiter"
|
14
|
+
require "google_translate_diff/request"
|
15
|
+
|
16
|
+
module GoogleTranslateDiff
|
17
|
+
class << self
|
18
|
+
attr_accessor :api
|
19
|
+
attr_accessor :cache_store
|
20
|
+
attr_accessor :rate_limiter
|
21
|
+
|
22
|
+
def translate(*args)
|
23
|
+
Request.new(*args).call
|
24
|
+
end
|
25
|
+
end
|
26
|
+
|
27
|
+
CACHE_NAMESPACE = "google-translate-diff".freeze
|
28
|
+
end
|
@@ -0,0 +1,36 @@
|
|
1
|
+
class GoogleTranslateDiff::Cache
|
2
|
+
extend Dry::Initializer::Mixin
|
3
|
+
|
4
|
+
param :from
|
5
|
+
param :to
|
6
|
+
|
7
|
+
def cached_and_missing(values)
|
8
|
+
keys = values.map { |v| key(v) }
|
9
|
+
cached = cache_store.read_multi(keys)
|
10
|
+
missing = values.map.with_index { |v, i| v if cached[i].nil? }.compact
|
11
|
+
|
12
|
+
[cached, missing]
|
13
|
+
end
|
14
|
+
|
15
|
+
def store(values, cached, updates)
|
16
|
+
cached.map.with_index do |value, index|
|
17
|
+
value || store_value(values[index], updates.shift)
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
private
|
22
|
+
|
23
|
+
def store_value(value, translation)
|
24
|
+
cache_store.write(key(value), translation)
|
25
|
+
translation
|
26
|
+
end
|
27
|
+
|
28
|
+
def key(value)
|
29
|
+
hash = Digest::MD5.hexdigest(value.strip) # No matter how much spaces
|
30
|
+
"#{from}:#{to}:#{hash}"
|
31
|
+
end
|
32
|
+
|
33
|
+
def cache_store
|
34
|
+
GoogleTranslateDiff.cache_store
|
35
|
+
end
|
36
|
+
end
|
@@ -0,0 +1,54 @@
|
|
1
|
+
class GoogleTranslateDiff::Chunker
|
2
|
+
extend ::Dry::Initializer::Mixin
|
3
|
+
|
4
|
+
class Error < StandardError; end
|
5
|
+
|
6
|
+
Chunk = Struct.new(:values, :size)
|
7
|
+
|
8
|
+
param :values
|
9
|
+
option :limit, default: proc { MAX_CHUNK_SIZE }
|
10
|
+
option :count_limit, default: proc { COUNT_LIMIT }
|
11
|
+
|
12
|
+
def call
|
13
|
+
chunks.map(&:values)
|
14
|
+
end
|
15
|
+
|
16
|
+
def chunks
|
17
|
+
values.each_with_object([]) do |value, chunks|
|
18
|
+
validate_value_size(value)
|
19
|
+
|
20
|
+
tail = chunks.last
|
21
|
+
|
22
|
+
if next_chunk?(tail, value)
|
23
|
+
chunks << Chunk.new([], 0)
|
24
|
+
tail = chunks.last
|
25
|
+
end
|
26
|
+
|
27
|
+
update_chunk(tail, value)
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
private
|
32
|
+
|
33
|
+
def next_chunk?(tail, value)
|
34
|
+
tail.nil? ||
|
35
|
+
(size(value) + tail.size > limit) ||
|
36
|
+
tail.values.size > count_limit
|
37
|
+
end
|
38
|
+
|
39
|
+
def size(text)
|
40
|
+
URI.encode(text).size
|
41
|
+
end
|
42
|
+
|
43
|
+
def update_chunk(chunk, value)
|
44
|
+
chunk.values << value
|
45
|
+
chunk.size = chunk.size + value.size
|
46
|
+
end
|
47
|
+
|
48
|
+
def validate_value_size(value)
|
49
|
+
raise Error, "Too long part #{value.size} > #{limit}" if value.size > limit
|
50
|
+
end
|
51
|
+
|
52
|
+
MAX_CHUNK_SIZE = 1700
|
53
|
+
COUNT_LIMIT = 120
|
54
|
+
end
|
@@ -0,0 +1,27 @@
|
|
1
|
+
class GoogleTranslateDiff::Linearizer
|
2
|
+
class << self
|
3
|
+
def linearize(struct, array = [])
|
4
|
+
case struct
|
5
|
+
when Hash then
|
6
|
+
struct.each { |_k, v| linearize(v, array) }
|
7
|
+
when Array then
|
8
|
+
struct.each { |v| linearize(v, array) }
|
9
|
+
else
|
10
|
+
array << struct
|
11
|
+
end
|
12
|
+
|
13
|
+
array
|
14
|
+
end
|
15
|
+
|
16
|
+
def restore(struct, array)
|
17
|
+
case struct
|
18
|
+
when Hash then
|
19
|
+
struct.each_with_object({}) { |(k, v), h| h[k] = restore(v, array) }
|
20
|
+
when Array then
|
21
|
+
struct.map { |v| restore(v, array) }
|
22
|
+
else
|
23
|
+
array.shift
|
24
|
+
end
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
@@ -0,0 +1,24 @@
|
|
1
|
+
class GoogleTranslateDiff::RedisCacheStore
|
2
|
+
extend Dry::Initializer::Mixin
|
3
|
+
|
4
|
+
param :connection_pool
|
5
|
+
|
6
|
+
option :timeout, default: proc { 60 * 60 * 24 * 7 }
|
7
|
+
option :namespace, default: proc { GoogleTranslateDiff::CACHE_NAMESPACE }
|
8
|
+
|
9
|
+
def read_multi(keys)
|
10
|
+
redis { |redis| redis.mget(*keys) }
|
11
|
+
end
|
12
|
+
|
13
|
+
def write(key, value)
|
14
|
+
redis { |redis| redis.setex(key, timeout, value) }
|
15
|
+
end
|
16
|
+
|
17
|
+
private
|
18
|
+
|
19
|
+
def redis
|
20
|
+
connection_pool.with do |redis|
|
21
|
+
yield Redis::Namespace.new(namespace, redis: redis)
|
22
|
+
end
|
23
|
+
end
|
24
|
+
end
|
@@ -0,0 +1,21 @@
|
|
1
|
+
class GoogleTranslateDiff::RedisRateLimiter
|
2
|
+
extend Dry::Initializer::Mixin
|
3
|
+
|
4
|
+
class RateLimitExceeded < StandardError; end
|
5
|
+
|
6
|
+
param :connection_pool
|
7
|
+
param :threshold, default: proc { 8000 }
|
8
|
+
param :interval, default: proc { 60 }
|
9
|
+
|
10
|
+
option :namespace, default: proc { GoogleTranslateDiff::CACHE_NAMESPACE }
|
11
|
+
|
12
|
+
def check(size)
|
13
|
+
connection_pool.with do |redis|
|
14
|
+
rate_limit = Ratelimit.new(namespace, redis: redis)
|
15
|
+
if rate_limit.exceeded?("call", threshold: threshold, interval: interval)
|
16
|
+
raise RateLimitExceeded
|
17
|
+
end
|
18
|
+
rate_limit.add size
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
@@ -0,0 +1,148 @@
|
|
1
|
+
class GoogleTranslateDiff::Request
|
2
|
+
extend Dry::Initializer::Mixin
|
3
|
+
extend Forwardable
|
4
|
+
|
5
|
+
param :values
|
6
|
+
param :options
|
7
|
+
|
8
|
+
def_delegators :GoogleTranslateDiff, :api, :cache_store, :rate_limiter
|
9
|
+
def_delegators :"GoogleTranslateDiff::Linearizer", :linearize, :restore
|
10
|
+
|
11
|
+
def call
|
12
|
+
validate_globals
|
13
|
+
|
14
|
+
return values if from == to || values.empty?
|
15
|
+
|
16
|
+
translation
|
17
|
+
end
|
18
|
+
|
19
|
+
private
|
20
|
+
|
21
|
+
def from
|
22
|
+
@from ||= options.fetch(:from)
|
23
|
+
end
|
24
|
+
|
25
|
+
def to
|
26
|
+
@to ||= options.fetch(:to)
|
27
|
+
end
|
28
|
+
|
29
|
+
def validate_globals
|
30
|
+
raise "Set GoogleTranslateDiff.api before calling ::translate" unless api
|
31
|
+
return if cache_store
|
32
|
+
raise "Set GoogleTranslateDiff.cache_store before calling ::translate"
|
33
|
+
end
|
34
|
+
|
35
|
+
# Extracts flat text array
|
36
|
+
# => "Name", "<b>Good</b> boy"
|
37
|
+
#
|
38
|
+
# #values might be something like { name: "Name", bio: "<b>Good</b> boy" }
|
39
|
+
def texts
|
40
|
+
@texts ||= linearize(values)
|
41
|
+
end
|
42
|
+
|
43
|
+
# Converts each array item to token list
|
44
|
+
# => [..., [["<b>", :markup], ["Good", :text], ...]]
|
45
|
+
def tokens
|
46
|
+
@tokens ||= texts.map do |value|
|
47
|
+
GoogleTranslateDiff::Tokenizer.tokenize(value)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
# Extracts text tokens from token list
|
52
|
+
# => { ..., "1_1" => "Good", 1_3 => "Boy", ... }
|
53
|
+
def text_tokens
|
54
|
+
@text_tokens ||= extract_text_tokens.to_h
|
55
|
+
end
|
56
|
+
|
57
|
+
def extract_text_tokens
|
58
|
+
tokens.each_with_object([]).with_index do |(group, result), group_index|
|
59
|
+
group.each_with_index do |(value, type), index|
|
60
|
+
result << ["#{group_index}_#{index}", value] if type == :text
|
61
|
+
end
|
62
|
+
end
|
63
|
+
end
|
64
|
+
|
65
|
+
# Extracts values from text tokens
|
66
|
+
# => [ ..., "Good", "Boy", ... ]
|
67
|
+
def text_tokens_texts
|
68
|
+
@text_tokens_texts ||= linearize(text_tokens).map(&:to_s).map(&:strip)
|
69
|
+
end
|
70
|
+
|
71
|
+
# Splits things requires translations to per-request chunks
|
72
|
+
# (groups less 2k sym)
|
73
|
+
# => [[ ..., "Good", "Boy", ... ]]
|
74
|
+
def chunks
|
75
|
+
@chunks ||= GoogleTranslateDiff::Chunker.new(text_tokens_texts).call
|
76
|
+
end
|
77
|
+
|
78
|
+
# Translates/loads from cache values from each chunk
|
79
|
+
# => [[ ..., "Horoshiy", "Malchik", ... ]]
|
80
|
+
def chunks_translated
|
81
|
+
@chunks_translated ||= chunks.map do |chunk|
|
82
|
+
cached, missing = cache.cached_and_missing(chunk)
|
83
|
+
if missing.empty?
|
84
|
+
cached
|
85
|
+
else
|
86
|
+
cache.store(chunk, cached, call_api(missing))
|
87
|
+
end
|
88
|
+
end
|
89
|
+
end
|
90
|
+
|
91
|
+
# Restores indexes for translated tokens
|
92
|
+
# => { ..., "1_1" => "Horoshiy", 1_3 => "Malchik", ... }
|
93
|
+
def text_tokens_translated
|
94
|
+
@text_tokens_texts_translated ||=
|
95
|
+
restore(text_tokens, chunks_translated.flatten)
|
96
|
+
end
|
97
|
+
|
98
|
+
# Restores tokens translated + adds same spacing as in source token
|
99
|
+
# => [[..., [ "Horoshiy", :text ], ...]]
|
100
|
+
# rubocop:disable Metrics/AbcSize
|
101
|
+
def tokens_translated
|
102
|
+
@tokens_translated ||= tokens.dup.tap do |tokens|
|
103
|
+
text_tokens_translated.each do |index, value|
|
104
|
+
group_index, index = index.split("_")
|
105
|
+
tokens[group_index.to_i][index.to_i][0] =
|
106
|
+
restore_spacing(tokens[group_index.to_i][index.to_i][0], value)
|
107
|
+
end
|
108
|
+
end
|
109
|
+
end
|
110
|
+
# rubocop:enable Metrics/AbcSize
|
111
|
+
|
112
|
+
def restore_spacing(source_value, value)
|
113
|
+
GoogleTranslateDiff::Spacing.restore(source_value, value)
|
114
|
+
end
|
115
|
+
|
116
|
+
# Restores texts from tokens
|
117
|
+
# [..., "<b>Horoshiy</b> Malchik", ...]
|
118
|
+
def texts_translated
|
119
|
+
@texts_translated ||= tokens_translated.map do |group|
|
120
|
+
group.map { |value, type| type == :text ? value : fix_ascii(value) }.join
|
121
|
+
end
|
122
|
+
end
|
123
|
+
|
124
|
+
# Final result
|
125
|
+
def translation
|
126
|
+
@translation ||= restore(values, texts_translated)
|
127
|
+
end
|
128
|
+
|
129
|
+
def call_api(values)
|
130
|
+
check_rate_limit(values)
|
131
|
+
[api.translate(*values, **options)].flatten.map(&:text)
|
132
|
+
end
|
133
|
+
|
134
|
+
def cache
|
135
|
+
@cache ||= GoogleTranslateDiff::Cache.new(from, to)
|
136
|
+
end
|
137
|
+
|
138
|
+
def check_rate_limit(values)
|
139
|
+
return if rate_limiter.nil?
|
140
|
+
size = values.map(&:size).inject(0) { |sum, x| sum + x }
|
141
|
+
rate_limiter.check(size)
|
142
|
+
end
|
143
|
+
|
144
|
+
# Markup should not contain control characters
|
145
|
+
def fix_ascii(value)
|
146
|
+
value.gsub(/[\u0000-\u001F]/, " ")
|
147
|
+
end
|
148
|
+
end
|
@@ -0,0 +1,27 @@
|
|
1
|
+
# Adds same count leading-trailing spaces left has to the right
|
2
|
+
class GoogleTranslateDiff::Spacing
|
3
|
+
class << self
|
4
|
+
# GoogleTranslateDiff::Spacing.restore(" a ", "Z") # => " Z "
|
5
|
+
def restore(left, right)
|
6
|
+
leading(left) + right.strip + trailing(left)
|
7
|
+
end
|
8
|
+
|
9
|
+
private
|
10
|
+
|
11
|
+
def spaces(count)
|
12
|
+
([" "] * count).join
|
13
|
+
end
|
14
|
+
|
15
|
+
def leading(value)
|
16
|
+
pos = value =~ /[^[:space:]]+/ui
|
17
|
+
return "" if pos.nil? || pos.zero?
|
18
|
+
value[0..(pos - 1)]
|
19
|
+
end
|
20
|
+
|
21
|
+
def trailing(value)
|
22
|
+
pos = value =~ /[[:space:]]+$/ui
|
23
|
+
return "" if pos.nil?
|
24
|
+
value[pos..-1]
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
@@ -0,0 +1,74 @@
|
|
1
|
+
class GoogleTranslateDiff::Tokenizer < ::Ox::Sax
|
2
|
+
def initialize(source)
|
3
|
+
@pos = nil
|
4
|
+
@prev = 1
|
5
|
+
@skip = false
|
6
|
+
@source = source
|
7
|
+
@tokens = []
|
8
|
+
end
|
9
|
+
|
10
|
+
attr_reader :texts, :tokens, :prev, :pos
|
11
|
+
|
12
|
+
def start_element(name)
|
13
|
+
@skip = true if name == :script
|
14
|
+
end
|
15
|
+
|
16
|
+
def end_element(name)
|
17
|
+
@skip = false if name == :script
|
18
|
+
end
|
19
|
+
|
20
|
+
def text(value)
|
21
|
+
return if @skip
|
22
|
+
value = fix_utf(value)
|
23
|
+
return if value.strip.empty?
|
24
|
+
|
25
|
+
token.tap { |t| @tokens << [fix_utf(t), :markup] if t }
|
26
|
+
@tokens.concat(sentences(value))
|
27
|
+
|
28
|
+
@prev = @pos + value.bytesize
|
29
|
+
end
|
30
|
+
|
31
|
+
def token
|
32
|
+
return if @prev == @pos
|
33
|
+
fix_utf(@source.byteslice((@prev - 1)..(@pos - 2)))
|
34
|
+
end
|
35
|
+
|
36
|
+
# Splits text by sentences
|
37
|
+
def sentences(value)
|
38
|
+
boundaries =
|
39
|
+
Punkt::SentenceTokenizer
|
40
|
+
.new(value)
|
41
|
+
.sentences_from_text(value)
|
42
|
+
|
43
|
+
return [[value, :text]] if boundaries.size == 1
|
44
|
+
|
45
|
+
boundaries.map.with_index do |(left, right), index|
|
46
|
+
next_boundary = boundaries[index + 1]
|
47
|
+
right = next_boundary[0] - 1 if next_boundary
|
48
|
+
|
49
|
+
[value[left..right], :text]
|
50
|
+
end
|
51
|
+
end
|
52
|
+
|
53
|
+
def cut_last_token
|
54
|
+
last_token = fix_utf(@source.byteslice((@prev - 1)..-1))
|
55
|
+
@tokens << [last_token, :markup] if last_token != ""
|
56
|
+
end
|
57
|
+
|
58
|
+
def fix_utf(value)
|
59
|
+
value.encode(
|
60
|
+
"UTF-8", undef: :replace, invalid: :replace, replace: " "
|
61
|
+
)
|
62
|
+
end
|
63
|
+
|
64
|
+
class << self
|
65
|
+
def tokenize(value)
|
66
|
+
return [] if value.nil?
|
67
|
+
tokenizer = new(value).tap do |h|
|
68
|
+
Ox.sax_parse(h, StringIO.new(value))
|
69
|
+
h.cut_last_token
|
70
|
+
end
|
71
|
+
tokenizer.tokens
|
72
|
+
end
|
73
|
+
end
|
74
|
+
end
|
metadata
ADDED
@@ -0,0 +1,204 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: google_translate_diff
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.0.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Victor Sokolov
|
8
|
+
autorequire:
|
9
|
+
bindir: exe
|
10
|
+
cert_chain: []
|
11
|
+
date: 2017-03-27 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: bundler
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - "~>"
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '1.14'
|
20
|
+
type: :development
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - "~>"
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '1.14'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: rake
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: '10.0'
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: '10.0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: rspec
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '3.0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '3.0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: rubocop
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - ">="
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - ">="
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '0'
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: codeclimate-test-reporter
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - "~>"
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: 1.0.0
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - "~>"
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: 1.0.0
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: simplecov
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - ">="
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: '0'
|
90
|
+
type: :development
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - ">="
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: '0'
|
97
|
+
- !ruby/object:Gem::Dependency
|
98
|
+
name: google-cloud-translate
|
99
|
+
requirement: !ruby/object:Gem::Requirement
|
100
|
+
requirements:
|
101
|
+
- - ">="
|
102
|
+
- !ruby/object:Gem::Version
|
103
|
+
version: '0'
|
104
|
+
type: :runtime
|
105
|
+
prerelease: false
|
106
|
+
version_requirements: !ruby/object:Gem::Requirement
|
107
|
+
requirements:
|
108
|
+
- - ">="
|
109
|
+
- !ruby/object:Gem::Version
|
110
|
+
version: '0'
|
111
|
+
- !ruby/object:Gem::Dependency
|
112
|
+
name: ox
|
113
|
+
requirement: !ruby/object:Gem::Requirement
|
114
|
+
requirements:
|
115
|
+
- - ">="
|
116
|
+
- !ruby/object:Gem::Version
|
117
|
+
version: '0'
|
118
|
+
type: :runtime
|
119
|
+
prerelease: false
|
120
|
+
version_requirements: !ruby/object:Gem::Requirement
|
121
|
+
requirements:
|
122
|
+
- - ">="
|
123
|
+
- !ruby/object:Gem::Version
|
124
|
+
version: '0'
|
125
|
+
- !ruby/object:Gem::Dependency
|
126
|
+
name: dry-initializer
|
127
|
+
requirement: !ruby/object:Gem::Requirement
|
128
|
+
requirements:
|
129
|
+
- - ">="
|
130
|
+
- !ruby/object:Gem::Version
|
131
|
+
version: '0'
|
132
|
+
type: :runtime
|
133
|
+
prerelease: false
|
134
|
+
version_requirements: !ruby/object:Gem::Requirement
|
135
|
+
requirements:
|
136
|
+
- - ">="
|
137
|
+
- !ruby/object:Gem::Version
|
138
|
+
version: '0'
|
139
|
+
- !ruby/object:Gem::Dependency
|
140
|
+
name: punkt-segmenter
|
141
|
+
requirement: !ruby/object:Gem::Requirement
|
142
|
+
requirements:
|
143
|
+
- - ">="
|
144
|
+
- !ruby/object:Gem::Version
|
145
|
+
version: '0'
|
146
|
+
type: :runtime
|
147
|
+
prerelease: false
|
148
|
+
version_requirements: !ruby/object:Gem::Requirement
|
149
|
+
requirements:
|
150
|
+
- - ">="
|
151
|
+
- !ruby/object:Gem::Version
|
152
|
+
version: '0'
|
153
|
+
description: "\nGoogle Translate API wrapper for Ruby which helps to translate only
|
154
|
+
changes\nbetween revisions of long texts.\n "
|
155
|
+
email:
|
156
|
+
- gzigzigzeo@evilmartians.com
|
157
|
+
executables: []
|
158
|
+
extensions: []
|
159
|
+
extra_rdoc_files: []
|
160
|
+
files:
|
161
|
+
- ".gitignore"
|
162
|
+
- ".rspec"
|
163
|
+
- ".rubocop.yml"
|
164
|
+
- ".travis.yml"
|
165
|
+
- Gemfile
|
166
|
+
- README.md
|
167
|
+
- Rakefile
|
168
|
+
- google_translate_diff.gemspec
|
169
|
+
- lib/google_translate_diff.rb
|
170
|
+
- lib/google_translate_diff/cache.rb
|
171
|
+
- lib/google_translate_diff/chunker.rb
|
172
|
+
- lib/google_translate_diff/linearizer.rb
|
173
|
+
- lib/google_translate_diff/redis_cache_store.rb
|
174
|
+
- lib/google_translate_diff/redis_rate_limiter.rb
|
175
|
+
- lib/google_translate_diff/request.rb
|
176
|
+
- lib/google_translate_diff/spacing.rb
|
177
|
+
- lib/google_translate_diff/tokenizer.rb
|
178
|
+
- lib/google_translate_diff/version.rb
|
179
|
+
homepage: https://github.com/gzigzigzeo/google_translate_diff
|
180
|
+
licenses: []
|
181
|
+
metadata:
|
182
|
+
allowed_push_host: https://rubygems.org
|
183
|
+
post_install_message:
|
184
|
+
rdoc_options: []
|
185
|
+
require_paths:
|
186
|
+
- lib
|
187
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
188
|
+
requirements:
|
189
|
+
- - ">="
|
190
|
+
- !ruby/object:Gem::Version
|
191
|
+
version: '0'
|
192
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
193
|
+
requirements:
|
194
|
+
- - ">="
|
195
|
+
- !ruby/object:Gem::Version
|
196
|
+
version: '0'
|
197
|
+
requirements: []
|
198
|
+
rubyforge_project:
|
199
|
+
rubygems_version: 2.6.10
|
200
|
+
signing_key:
|
201
|
+
specification_version: 4
|
202
|
+
summary: Google Translate API wrapper for Ruby which helps to translate only changes
|
203
|
+
between revisions of long texts.
|
204
|
+
test_files: []
|