whatlanguage 1.0.5 → 1.0.6
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/History.txt +5 -0
- data/README.md +48 -33
- data/lang/danish.lang +0 -0
- data/lib/whatlanguage.rb +59 -24
- data/lib/whatlanguage/string.rb +11 -0
- data/lib/whatlanguage/version.rb +1 -1
- data/test/test_whatlanguage.rb +28 -12
- metadata +10 -10
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: ccd0489249d639e17bda9ae90c496d33e2e6048a
|
4
|
+
data.tar.gz: 83a969f3f186de199996e19bd6c6f391ba7cae24
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: a1fbd9f4745e74637c8eddf8a0c4ac74fed02493d00b011c8e3de80887000253b0d6e39260d54afed9cc1918270c3dad227c963c07591840ad4692ecd775682a
|
7
|
+
data.tar.gz: 59a45845d1f19f073d0f1af57526fcedea5813126f93a7795fe4218842bcf6212d7169f6a34d04d060d587f24c172558c7548ae7d063345f7e5a4800c105f078
|
data/History.txt
CHANGED
data/README.md
CHANGED
@@ -2,59 +2,72 @@
|
|
2
2
|
|
3
3
|
by Peter Cooper
|
4
4
|
|
5
|
-
Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits.
|
5
|
+
Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits. It works well on texts of over 10 words in length (e.g. blog posts or comments) and *very poorly* on short or Twitter-esque text, so be aware.
|
6
6
|
|
7
7
|
Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.
|
8
8
|
|
9
9
|
## Important note
|
10
|
-
|
11
|
-
This library was first built in 2007 and has received a few minor updates over the years. There are now more efficient and effective algorithms for doing language detection which I am investigating for a WhatLanguage 2.0.
|
12
10
|
|
13
|
-
This library has
|
11
|
+
This library was first built in 2007 and has received only a few minor updates over the years. There are now more efficient and effective algorithms for doing language detection which I am investigating for a future WhatLanguage.
|
12
|
+
|
13
|
+
This library has been updated to be distributed and to work on modern Ruby implementations but other than that, has had no significant improvements.
|
14
14
|
|
15
15
|
## Synopsis
|
16
16
|
|
17
17
|
Full Example
|
18
18
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
19
|
+
```ruby
|
20
|
+
require 'whatlanguage/string'
|
21
|
+
|
22
|
+
texts = []
|
23
|
+
texts << %q{Deux autres personnes ont été arrêtées durant la nuit}
|
24
|
+
texts << %q{The links between the attempted car bombings in Glasgow and London are becoming clearer}
|
25
|
+
texts << %q{En estado de máxima alertaen su nivel de crítico}
|
26
|
+
texts << %q{Returns the object in enum with the maximum value.}
|
27
|
+
texts << %q{Propose des données au sujet de la langue espagnole.}
|
28
|
+
texts << %q{La palabra "mezquita" se usa en español para referirse a todo tipo de edificios dedicados.}
|
29
|
+
texts << %q{اللغة التي هي هذه؟}
|
30
|
+
texts << %q{Mitä kieltä tämä on?}
|
31
|
+
texts << %q{Ποια γλώσσα είναι αυτή;}
|
32
|
+
texts << %q{באיזו שפה זה?}
|
33
|
+
texts << %q{Milyen nyelv ez?}
|
34
|
+
texts << %q{이 어떤 언어인가?}
|
35
|
+
texts << %q{Hvilket språk er dette?}
|
36
|
+
texts << %q{W jakim języku to jest?}
|
37
|
+
|
38
|
+
texts.each { |text| puts "#{text[0..18]}... is in #{text.language.to_s.capitalize}" }
|
39
|
+
```
|
38
40
|
|
39
41
|
Initialize WhatLanguage with all filters
|
40
42
|
|
41
|
-
|
43
|
+
```ruby
|
44
|
+
wl = WhatLanguage.new(:all)
|
45
|
+
```
|
42
46
|
|
43
47
|
Return language with best score
|
44
48
|
|
45
|
-
|
49
|
+
```ruby
|
50
|
+
wl.language(text)
|
51
|
+
```
|
46
52
|
|
47
53
|
Return hash with scores for all relevant languages
|
48
54
|
|
49
|
-
|
55
|
+
```ruby
|
56
|
+
wl.process_text(text)
|
57
|
+
```
|
50
58
|
|
51
|
-
Convenience
|
59
|
+
Convenience methods on String
|
52
60
|
|
53
|
-
|
61
|
+
```ruby
|
62
|
+
"This is a test".language # => :english
|
63
|
+
"This is a test".language_iso # => :en
|
64
|
+
```
|
54
65
|
|
55
66
|
Initialize WhatLanguage with certain languages
|
56
67
|
|
57
|
-
|
68
|
+
```ruby
|
69
|
+
wl = WhatLanguage.new(:english, :german, :french)
|
70
|
+
```
|
58
71
|
|
59
72
|
## Requirements
|
60
73
|
|
@@ -66,18 +79,20 @@ None, minor libraries (BloominSimple and BitField) included with this release.
|
|
66
79
|
|
67
80
|
To test, go into irb, then:
|
68
81
|
|
69
|
-
|
70
|
-
|
82
|
+
```ruby
|
83
|
+
require 'whatlanguage'
|
84
|
+
"Je suis un homme".language
|
85
|
+
```
|
71
86
|
|
72
87
|
## Credits
|
73
88
|
|
74
|
-
Contributions from Konrad Reiche
|
89
|
+
Contributions from Konrad Reiche, Salimane Adjao Moustapha, and others appreciated.
|
75
90
|
|
76
91
|
## License
|
77
92
|
|
78
93
|
MIT License
|
79
94
|
|
80
|
-
Copyright (c) 2007-
|
95
|
+
Copyright (c) 2007-2016 Peter Cooper
|
81
96
|
|
82
97
|
Permission is hereby granted, free of charge, to any person obtaining
|
83
98
|
a copy of this software and associated documentation files (the
|
data/lang/danish.lang
ADDED
Binary file
|
data/lib/whatlanguage.rb
CHANGED
@@ -2,65 +2,100 @@ require 'whatlanguage/bloominsimple'
|
|
2
2
|
require 'whatlanguage/bitfield'
|
3
3
|
require 'digest/sha1'
|
4
4
|
|
5
|
-
class WhatLanguage
|
5
|
+
class WhatLanguage
|
6
6
|
HASHER = lambda { |item| Digest::SHA1.digest(item.downcase.strip).unpack("VV") }
|
7
|
-
|
7
|
+
|
8
8
|
BITFIELD_WIDTH = 2_000_000
|
9
|
-
|
9
|
+
|
10
|
+
ISO_CODES = {
|
11
|
+
nil => nil,
|
12
|
+
:arabic => :ar,
|
13
|
+
:danish => :da,
|
14
|
+
:dutch => :nl,
|
15
|
+
:english => :en,
|
16
|
+
:farsi => :fa,
|
17
|
+
:finnish => :fi,
|
18
|
+
:french => :fr,
|
19
|
+
:german => :de,
|
20
|
+
:greek => :el,
|
21
|
+
:hebrew => :he,
|
22
|
+
:hungarian => :hu,
|
23
|
+
:italian => :it,
|
24
|
+
:korean => :ko,
|
25
|
+
:norwegian => :no,
|
26
|
+
:pinyin => :zh,
|
27
|
+
:polish => :pl,
|
28
|
+
:portuguese => :pt,
|
29
|
+
:russian => :ru,
|
30
|
+
:spanish => :es,
|
31
|
+
:swedish => :sv
|
32
|
+
}
|
33
|
+
|
10
34
|
@@data = {}
|
11
|
-
|
35
|
+
|
12
36
|
def initialize(*selection)
|
13
37
|
@selection = (selection.empty?) ? [:all] : selection
|
14
|
-
|
15
|
-
|
16
|
-
|
38
|
+
if @@data.empty?
|
39
|
+
languages_folder = File.join(File.dirname(__FILE__), "..", "lang")
|
40
|
+
Dir.entries(languages_folder).grep(/\.lang/).each do |lang|
|
41
|
+
@@data[lang[/\w+/].to_sym] ||= BloominSimple.from_dump(File.new(File.join(languages_folder, lang), 'rb').read, &HASHER)
|
42
|
+
end
|
17
43
|
end
|
18
44
|
end
|
19
|
-
|
45
|
+
|
46
|
+
def languages
|
47
|
+
@languages ||=
|
48
|
+
begin
|
49
|
+
if @selection.include?(:all)
|
50
|
+
languages = @@data.keys
|
51
|
+
else
|
52
|
+
languages = @@data.keys & @selection # intersection
|
53
|
+
end
|
54
|
+
end
|
55
|
+
end
|
56
|
+
|
20
57
|
# Very inefficient method for now.. but still beats the non-Bloom alternatives.
|
21
58
|
# Change to better bit comparison technique later..
|
22
59
|
def process_text(text)
|
23
60
|
results = Hash.new(0)
|
24
61
|
it = 0
|
25
|
-
text.
|
62
|
+
to_lowercase(text).split.each do |word|
|
26
63
|
it += 1
|
27
64
|
|
28
|
-
if @selection.include?(:all)
|
29
|
-
languages = @@data.keys
|
30
|
-
else
|
31
|
-
languages = @@data.keys & @selection # intersection
|
32
|
-
end
|
33
|
-
|
34
65
|
languages.each do |lang|
|
35
66
|
results[lang] += 1 if @@data[lang].includes?(word)
|
36
67
|
end
|
37
|
-
|
68
|
+
|
38
69
|
# Every now and then check to see if we have a really convincing result.. if so, exit early.
|
39
70
|
if it % 4 == 0 && results.size > 1
|
40
71
|
top_results = results.sort_by{|a,b| -b}[0..1]
|
41
|
-
|
72
|
+
|
42
73
|
# Next line may need some tweaking one day..
|
43
74
|
break if top_results[0][1] > 4 && ((top_results[0][1] > top_results[1][1] * 2) || (top_results[0][1] - top_results[1][1] > 25))
|
44
75
|
end
|
45
|
-
|
76
|
+
|
46
77
|
#break if it > 100
|
47
78
|
end
|
48
79
|
results
|
49
80
|
end
|
50
|
-
|
81
|
+
|
51
82
|
def language(text)
|
52
83
|
process_text(text).max { |a,b| a[1] <=> b[1] }.first rescue nil
|
53
84
|
end
|
54
|
-
|
85
|
+
|
86
|
+
def language_iso(text)
|
87
|
+
ISO_CODES[language(text)]
|
88
|
+
end
|
89
|
+
|
55
90
|
def self.filter_from_dictionary(filename)
|
56
91
|
bf = BloominSimple.new(BITFIELD_WIDTH, &HASHER)
|
57
92
|
File.open(filename).each { |word| bf.add(word) }
|
58
93
|
bf
|
59
94
|
end
|
60
|
-
end
|
61
95
|
|
62
|
-
|
63
|
-
|
64
|
-
|
96
|
+
if !defined? UnicodeUtils
|
97
|
+
define_method(:to_lowercase) { |str| str.downcase }
|
98
|
+
else
|
99
|
+
define_method(:to_lowercase) { |str| UnicodeUtils.casefold(str) }
|
65
100
|
end
|
66
101
|
end
|
data/lib/whatlanguage/version.rb
CHANGED
data/test/test_whatlanguage.rb
CHANGED
@@ -1,17 +1,27 @@
|
|
1
1
|
# encoding: utf-8
|
2
2
|
require "test/unit"
|
3
3
|
|
4
|
-
|
4
|
+
# not a dependency
|
5
|
+
begin
|
6
|
+
require 'unicode_utils'
|
7
|
+
rescue LoadError
|
8
|
+
end
|
9
|
+
|
10
|
+
require 'whatlanguage/string'
|
5
11
|
|
6
12
|
class TestWhatLanguage < Test::Unit::TestCase
|
7
13
|
def setup
|
8
14
|
@wl = WhatLanguage.new(:all)
|
9
15
|
end
|
10
|
-
|
16
|
+
|
11
17
|
def test_string_method
|
12
18
|
assert_equal :english, "This is a test".language
|
13
19
|
end
|
14
20
|
|
21
|
+
def test_string_iso_method
|
22
|
+
assert_equal :en, "this is a test".language_iso
|
23
|
+
end
|
24
|
+
|
15
25
|
def test_arabic
|
16
26
|
assert_equal :arabic, @wl.language("اللغة التي هي هذه؟")
|
17
27
|
end
|
@@ -29,9 +39,9 @@ class TestWhatLanguage < Test::Unit::TestCase
|
|
29
39
|
end
|
30
40
|
|
31
41
|
def test_french
|
32
|
-
assert_equal :french, @wl.language("Bonjour, je m'appelle Sandrine. Voila
|
42
|
+
assert_equal :french, @wl.language("Bonjour, je m'appelle Sandrine. Voila mon chat.")
|
33
43
|
end
|
34
|
-
|
44
|
+
|
35
45
|
def test_german
|
36
46
|
assert_equal :german, @wl.language("Welche Sprache ist das?")
|
37
47
|
end
|
@@ -79,23 +89,23 @@ class TestWhatLanguage < Test::Unit::TestCase
|
|
79
89
|
def test_swedish
|
80
90
|
assert_equal :swedish, @wl.language("Vilket språk är detta?")
|
81
91
|
end
|
82
|
-
|
92
|
+
|
93
|
+
def test_danish
|
94
|
+
assert_equal :danish, @wl.language("Dansk er et nord-germansk sprog af den østnordiske (kontinentale) gruppe, der tales af ca. seks millioner mennesker.")
|
95
|
+
end
|
96
|
+
|
83
97
|
def test_nothing
|
84
98
|
assert_nil @wl.language("")
|
85
99
|
end
|
86
|
-
|
100
|
+
|
87
101
|
def test_something
|
88
102
|
assert_not_nil @wl.language("test")
|
89
103
|
end
|
90
|
-
|
104
|
+
|
91
105
|
def test_processor
|
92
106
|
assert_kind_of Hash, @wl.process_text("this is a test")
|
93
107
|
end
|
94
108
|
|
95
|
-
def test_italian
|
96
|
-
assert_equal :italian, @wl.language("Roma, capitale dell'impero romano, è stata per secoli il centro politico e culturale della civiltà occidentale.")
|
97
|
-
end
|
98
|
-
|
99
109
|
def test_language_selection
|
100
110
|
selective_wl = WhatLanguage.new(:german, :english)
|
101
111
|
assert_equal :german, selective_wl.language("der die das")
|
@@ -110,4 +120,10 @@ class TestWhatLanguage < Test::Unit::TestCase
|
|
110
120
|
selective_wl = WhatLanguage.new(:german, :all, :english)
|
111
121
|
assert_equal :russian, selective_wl.language("Все новости в хронологическом порядке")
|
112
122
|
end
|
113
|
-
|
123
|
+
|
124
|
+
if defined? UnicodeUtils
|
125
|
+
def test_casing_conversion
|
126
|
+
assert_equal "âncora cor âmbar".language, "ÂNCORA COR ÂMBAR".language
|
127
|
+
end
|
128
|
+
end
|
129
|
+
end
|
metadata
CHANGED
@@ -1,15 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: whatlanguage
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
5
|
-
prerelease:
|
4
|
+
version: 1.0.6
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Peter Cooper
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2016-01-28 00:00:00.000000000 Z
|
13
12
|
dependencies: []
|
14
13
|
description: WhatLanguage rapidly detects the language of a sample of text
|
15
14
|
email:
|
@@ -18,7 +17,7 @@ executables: []
|
|
18
17
|
extensions: []
|
19
18
|
extra_rdoc_files: []
|
20
19
|
files:
|
21
|
-
- .gitignore
|
20
|
+
- ".gitignore"
|
22
21
|
- Gemfile
|
23
22
|
- History.txt
|
24
23
|
- LICENSE.txt
|
@@ -30,6 +29,7 @@ files:
|
|
30
29
|
- copyright-en
|
31
30
|
- example.rb
|
32
31
|
- lang/arabic.lang
|
32
|
+
- lang/danish.lang
|
33
33
|
- lang/dutch.lang
|
34
34
|
- lang/english.lang
|
35
35
|
- lang/farsi.lang
|
@@ -51,32 +51,32 @@ files:
|
|
51
51
|
- lib/whatlanguage.rb
|
52
52
|
- lib/whatlanguage/bitfield.rb
|
53
53
|
- lib/whatlanguage/bloominsimple.rb
|
54
|
+
- lib/whatlanguage/string.rb
|
54
55
|
- lib/whatlanguage/version.rb
|
55
56
|
- test/test_whatlanguage.rb
|
56
57
|
- whatlanguage.gemspec
|
57
58
|
homepage: https://github.com/peterc/whatlanguage
|
58
59
|
licenses: []
|
60
|
+
metadata: {}
|
59
61
|
post_install_message:
|
60
62
|
rdoc_options: []
|
61
63
|
require_paths:
|
62
64
|
- lib
|
63
65
|
required_ruby_version: !ruby/object:Gem::Requirement
|
64
|
-
none: false
|
65
66
|
requirements:
|
66
|
-
- -
|
67
|
+
- - ">="
|
67
68
|
- !ruby/object:Gem::Version
|
68
69
|
version: '0'
|
69
70
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
70
|
-
none: false
|
71
71
|
requirements:
|
72
|
-
- -
|
72
|
+
- - ">="
|
73
73
|
- !ruby/object:Gem::Version
|
74
74
|
version: '0'
|
75
75
|
requirements: []
|
76
76
|
rubyforge_project:
|
77
|
-
rubygems_version:
|
77
|
+
rubygems_version: 2.4.5
|
78
78
|
signing_key:
|
79
|
-
specification_version:
|
79
|
+
specification_version: 4
|
80
80
|
summary: Natural language detection for text samples
|
81
81
|
test_files:
|
82
82
|
- test/test_whatlanguage.rb
|