whatlanguage 1.0.5 → 1.0.6

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: ccd0489249d639e17bda9ae90c496d33e2e6048a
4
+ data.tar.gz: 83a969f3f186de199996e19bd6c6f391ba7cae24
5
+ SHA512:
6
+ metadata.gz: a1fbd9f4745e74637c8eddf8a0c4ac74fed02493d00b011c8e3de80887000253b0d6e39260d54afed9cc1918270c3dad227c963c07591840ad4692ecd775682a
7
+ data.tar.gz: 59a45845d1f19f073d0f1af57526fcedea5813126f93a7795fe4218842bcf6212d7169f6a34d04d060d587f24c172558c7548ae7d063345f7e5a4800c105f078
@@ -1,3 +1,8 @@
1
+ == 1.0.6 / 2016-01-28
2
+
3
+ * Minor test fixes and tweaks
4
+ * New release taking into account a handful of pull requests
5
+
1
6
  == 1.0.5 / 2013-10-05
2
7
 
3
8
  * Many more languages supported
data/README.md CHANGED
@@ -2,59 +2,72 @@
2
2
 
3
3
  by Peter Cooper
4
4
 
5
- Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits.
5
+ Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits. It works well on texts of over 10 words in length (e.g. blog posts or comments) and *very poorly* on short or Twitter-esque text, so be aware.
6
6
 
7
7
  Works with Dutch, English, Farsi, French, German, Italian, Pinyin, Swedish, Portuguese, Russian, Arabic, Finnish, Greek, Hebrew, Hungarian, Korean, Norwegian, Polish and Spanish out of the box.
8
8
 
9
9
  ## Important note
10
-
11
- This library was first built in 2007 and has received a few minor updates over the years. There are now more efficient and effective algorithms for doing language detection which I am investigating for a WhatLanguage 2.0.
12
10
 
13
- This library has been updated to be distributed and to work on modern Ruby implementations but other than that, has had no improvements.
11
+ This library was first built in 2007 and has received only a few minor updates over the years. There are now more efficient and effective algorithms for doing language detection which I am investigating for a future WhatLanguage.
12
+
13
+ This library has been updated to be distributed and to work on modern Ruby implementations but other than that, has had no significant improvements.
14
14
 
15
15
  ## Synopsis
16
16
 
17
17
  Full Example
18
18
 
19
- require 'whatlanguage'
20
-
21
- texts = []
22
- texts << %q{Deux autres personnes ont été arrêtées durant la nuit}
23
- texts << %q{The links between the attempted car bombings in Glasgow and London are becoming clearer}
24
- texts << %q{En estado de máxima alertaen su nivel de crítico}
25
- texts << %q{Returns the object in enum with the maximum value.}
26
- texts << %q{Propose des données au sujet de la langue espagnole.}
27
- texts << %q{La palabra "mezquita" se usa en español para referirse a todo tipo de edificios dedicados.}
28
- texts << %q{اللغة التي هي هذه؟}
29
- texts << %q{Mitä kieltä tämä on?}
30
- texts << %q{Ποια γλώσσα είναι αυτή;}
31
- texts << %q{באיזו שפה זה?}
32
- texts << %q{Milyen nyelv ez?}
33
- texts << %q{ 어떤 언어인가?}
34
- texts << %q{Hvilket språk er dette?}
35
- texts << %q{W jakim języku to jest?}
36
-
37
- texts.each { |text| puts "#{text[0..18]}... is in #{text.language.to_s.capitalize}" }
19
+ ```ruby
20
+ require 'whatlanguage/string'
21
+
22
+ texts = []
23
+ texts << %q{Deux autres personnes ont été arrêtées durant la nuit}
24
+ texts << %q{The links between the attempted car bombings in Glasgow and London are becoming clearer}
25
+ texts << %q{En estado de máxima alertaen su nivel de crítico}
26
+ texts << %q{Returns the object in enum with the maximum value.}
27
+ texts << %q{Propose des données au sujet de la langue espagnole.}
28
+ texts << %q{La palabra "mezquita" se usa en español para referirse a todo tipo de edificios dedicados.}
29
+ texts << %q{اللغة التي هي هذه؟}
30
+ texts << %q{Mitä kieltä tämä on?}
31
+ texts << %q{Ποια γλώσσα είναι αυτή;}
32
+ texts << %q{באיזו שפה זה?}
33
+ texts << %q{Milyen nyelv ez?}
34
+ texts << %q{ 어떤 언어인가?}
35
+ texts << %q{Hvilket språk er dette?}
36
+ texts << %q{W jakim języku to jest?}
37
+
38
+ texts.each { |text| puts "#{text[0..18]}... is in #{text.language.to_s.capitalize}" }
39
+ ```
38
40
 
39
41
  Initialize WhatLanguage with all filters
40
42
 
41
- wl = WhatLanguage.new(:all)
43
+ ```ruby
44
+ wl = WhatLanguage.new(:all)
45
+ ```
42
46
 
43
47
  Return language with best score
44
48
 
45
- wl.language(text)
49
+ ```ruby
50
+ wl.language(text)
51
+ ```
46
52
 
47
53
  Return hash with scores for all relevant languages
48
54
 
49
- wl.process_text(text)
55
+ ```ruby
56
+ wl.process_text(text)
57
+ ```
50
58
 
51
- Convenience method on String
59
+ Convenience methods on String
52
60
 
53
- "This is a test".language # => "English"
61
+ ```ruby
62
+ "This is a test".language # => :english
63
+ "This is a test".language_iso # => :en
64
+ ```
54
65
 
55
66
  Initialize WhatLanguage with certain languages
56
67
 
57
- wl = WhatLanguage.new(:english, :german, :french)
68
+ ```ruby
69
+ wl = WhatLanguage.new(:english, :german, :french)
70
+ ```
58
71
 
59
72
  ## Requirements
60
73
 
@@ -66,18 +79,20 @@ None, minor libraries (BloominSimple and BitField) included with this release.
66
79
 
67
80
  To test, go into irb, then:
68
81
 
69
- require 'whatlanguage'
70
- "Je suis un homme".language
82
+ ```ruby
83
+ require 'whatlanguage'
84
+ "Je suis un homme".language
85
+ ```
71
86
 
72
87
  ## Credits
73
88
 
74
- Contributions from Konrad Reiche and Salimane Adjao Moustapha appreciated.
89
+ Contributions from Konrad Reiche, Salimane Adjao Moustapha, and others appreciated.
75
90
 
76
91
  ## License
77
92
 
78
93
  MIT License
79
94
 
80
- Copyright (c) 2007-2013 Peter Cooper
95
+ Copyright (c) 2007-2016 Peter Cooper
81
96
 
82
97
  Permission is hereby granted, free of charge, to any person obtaining
83
98
  a copy of this software and associated documentation files (the
Binary file
@@ -2,65 +2,100 @@ require 'whatlanguage/bloominsimple'
2
2
  require 'whatlanguage/bitfield'
3
3
  require 'digest/sha1'
4
4
 
5
- class WhatLanguage
5
+ class WhatLanguage
6
6
  HASHER = lambda { |item| Digest::SHA1.digest(item.downcase.strip).unpack("VV") }
7
-
7
+
8
8
  BITFIELD_WIDTH = 2_000_000
9
-
9
+
10
+ ISO_CODES = {
11
+ nil => nil,
12
+ :arabic => :ar,
13
+ :danish => :da,
14
+ :dutch => :nl,
15
+ :english => :en,
16
+ :farsi => :fa,
17
+ :finnish => :fi,
18
+ :french => :fr,
19
+ :german => :de,
20
+ :greek => :el,
21
+ :hebrew => :he,
22
+ :hungarian => :hu,
23
+ :italian => :it,
24
+ :korean => :ko,
25
+ :norwegian => :no,
26
+ :pinyin => :zh,
27
+ :polish => :pl,
28
+ :portuguese => :pt,
29
+ :russian => :ru,
30
+ :spanish => :es,
31
+ :swedish => :sv
32
+ }
33
+
10
34
  @@data = {}
11
-
35
+
12
36
  def initialize(*selection)
13
37
  @selection = (selection.empty?) ? [:all] : selection
14
- languages_folder = File.join(File.dirname(__FILE__), "..", "lang")
15
- Dir.entries(languages_folder).grep(/\.lang/).each do |lang|
16
- @@data[lang[/\w+/].to_sym] ||= BloominSimple.from_dump(File.new(File.join(languages_folder, lang), 'rb').read, &HASHER)
38
+ if @@data.empty?
39
+ languages_folder = File.join(File.dirname(__FILE__), "..", "lang")
40
+ Dir.entries(languages_folder).grep(/\.lang/).each do |lang|
41
+ @@data[lang[/\w+/].to_sym] ||= BloominSimple.from_dump(File.new(File.join(languages_folder, lang), 'rb').read, &HASHER)
42
+ end
17
43
  end
18
44
  end
19
-
45
+
46
+ def languages
47
+ @languages ||=
48
+ begin
49
+ if @selection.include?(:all)
50
+ languages = @@data.keys
51
+ else
52
+ languages = @@data.keys & @selection # intersection
53
+ end
54
+ end
55
+ end
56
+
20
57
  # Very inefficient method for now.. but still beats the non-Bloom alternatives.
21
58
  # Change to better bit comparison technique later..
22
59
  def process_text(text)
23
60
  results = Hash.new(0)
24
61
  it = 0
25
- text.downcase.split.each do |word|
62
+ to_lowercase(text).split.each do |word|
26
63
  it += 1
27
64
 
28
- if @selection.include?(:all)
29
- languages = @@data.keys
30
- else
31
- languages = @@data.keys & @selection # intersection
32
- end
33
-
34
65
  languages.each do |lang|
35
66
  results[lang] += 1 if @@data[lang].includes?(word)
36
67
  end
37
-
68
+
38
69
  # Every now and then check to see if we have a really convincing result.. if so, exit early.
39
70
  if it % 4 == 0 && results.size > 1
40
71
  top_results = results.sort_by{|a,b| -b}[0..1]
41
-
72
+
42
73
  # Next line may need some tweaking one day..
43
74
  break if top_results[0][1] > 4 && ((top_results[0][1] > top_results[1][1] * 2) || (top_results[0][1] - top_results[1][1] > 25))
44
75
  end
45
-
76
+
46
77
  #break if it > 100
47
78
  end
48
79
  results
49
80
  end
50
-
81
+
51
82
  def language(text)
52
83
  process_text(text).max { |a,b| a[1] <=> b[1] }.first rescue nil
53
84
  end
54
-
85
+
86
+ def language_iso(text)
87
+ ISO_CODES[language(text)]
88
+ end
89
+
55
90
  def self.filter_from_dictionary(filename)
56
91
  bf = BloominSimple.new(BITFIELD_WIDTH, &HASHER)
57
92
  File.open(filename).each { |word| bf.add(word) }
58
93
  bf
59
94
  end
60
- end
61
95
 
62
- class String
63
- def language
64
- WhatLanguage.new(:all).language(self)
96
+ if !defined? UnicodeUtils
97
+ define_method(:to_lowercase) { |str| str.downcase }
98
+ else
99
+ define_method(:to_lowercase) { |str| UnicodeUtils.casefold(str) }
65
100
  end
66
101
  end
@@ -0,0 +1,11 @@
1
+ require 'whatlanguage'
2
+
3
+ class String
4
+ def language
5
+ WhatLanguage.new(:all).language(self)
6
+ end
7
+
8
+ def language_iso
9
+ WhatLanguage.new(:all).language_iso(self)
10
+ end
11
+ end
@@ -1,3 +1,3 @@
1
1
  class WhatLanguage
2
- VERSION = '1.0.5'
2
+ VERSION = '1.0.6'
3
3
  end
@@ -1,17 +1,27 @@
1
1
  # encoding: utf-8
2
2
  require "test/unit"
3
3
 
4
- require 'whatlanguage'
4
+ # not a dependency
5
+ begin
6
+ require 'unicode_utils'
7
+ rescue LoadError
8
+ end
9
+
10
+ require 'whatlanguage/string'
5
11
 
6
12
  class TestWhatLanguage < Test::Unit::TestCase
7
13
  def setup
8
14
  @wl = WhatLanguage.new(:all)
9
15
  end
10
-
16
+
11
17
  def test_string_method
12
18
  assert_equal :english, "This is a test".language
13
19
  end
14
20
 
21
+ def test_string_iso_method
22
+ assert_equal :en, "this is a test".language_iso
23
+ end
24
+
15
25
  def test_arabic
16
26
  assert_equal :arabic, @wl.language("اللغة التي هي هذه؟")
17
27
  end
@@ -29,9 +39,9 @@ class TestWhatLanguage < Test::Unit::TestCase
29
39
  end
30
40
 
31
41
  def test_french
32
- assert_equal :french, @wl.language("Bonjour, je m'appelle Sandrine. Voila ma chatte.")
42
+ assert_equal :french, @wl.language("Bonjour, je m'appelle Sandrine. Voila mon chat.")
33
43
  end
34
-
44
+
35
45
  def test_german
36
46
  assert_equal :german, @wl.language("Welche Sprache ist das?")
37
47
  end
@@ -79,23 +89,23 @@ class TestWhatLanguage < Test::Unit::TestCase
79
89
  def test_swedish
80
90
  assert_equal :swedish, @wl.language("Vilket språk är detta?")
81
91
  end
82
-
92
+
93
+ def test_danish
94
+ assert_equal :danish, @wl.language("Dansk er et nord-germansk sprog af den østnordiske (kontinentale) gruppe, der tales af ca. seks millioner mennesker.")
95
+ end
96
+
83
97
  def test_nothing
84
98
  assert_nil @wl.language("")
85
99
  end
86
-
100
+
87
101
  def test_something
88
102
  assert_not_nil @wl.language("test")
89
103
  end
90
-
104
+
91
105
  def test_processor
92
106
  assert_kind_of Hash, @wl.process_text("this is a test")
93
107
  end
94
108
 
95
- def test_italian
96
- assert_equal :italian, @wl.language("Roma, capitale dell'impero romano, è stata per secoli il centro politico e culturale della civiltà occidentale.")
97
- end
98
-
99
109
  def test_language_selection
100
110
  selective_wl = WhatLanguage.new(:german, :english)
101
111
  assert_equal :german, selective_wl.language("der die das")
@@ -110,4 +120,10 @@ class TestWhatLanguage < Test::Unit::TestCase
110
120
  selective_wl = WhatLanguage.new(:german, :all, :english)
111
121
  assert_equal :russian, selective_wl.language("Все новости в хронологическом порядке")
112
122
  end
113
- end
123
+
124
+ if defined? UnicodeUtils
125
+ def test_casing_conversion
126
+ assert_equal "âncora cor âmbar".language, "ÂNCORA COR ÂMBAR".language
127
+ end
128
+ end
129
+ end
metadata CHANGED
@@ -1,15 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: whatlanguage
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.5
5
- prerelease:
4
+ version: 1.0.6
6
5
  platform: ruby
7
6
  authors:
8
7
  - Peter Cooper
9
8
  autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2013-10-05 00:00:00.000000000 Z
11
+ date: 2016-01-28 00:00:00.000000000 Z
13
12
  dependencies: []
14
13
  description: WhatLanguage rapidly detects the language of a sample of text
15
14
  email:
@@ -18,7 +17,7 @@ executables: []
18
17
  extensions: []
19
18
  extra_rdoc_files: []
20
19
  files:
21
- - .gitignore
20
+ - ".gitignore"
22
21
  - Gemfile
23
22
  - History.txt
24
23
  - LICENSE.txt
@@ -30,6 +29,7 @@ files:
30
29
  - copyright-en
31
30
  - example.rb
32
31
  - lang/arabic.lang
32
+ - lang/danish.lang
33
33
  - lang/dutch.lang
34
34
  - lang/english.lang
35
35
  - lang/farsi.lang
@@ -51,32 +51,32 @@ files:
51
51
  - lib/whatlanguage.rb
52
52
  - lib/whatlanguage/bitfield.rb
53
53
  - lib/whatlanguage/bloominsimple.rb
54
+ - lib/whatlanguage/string.rb
54
55
  - lib/whatlanguage/version.rb
55
56
  - test/test_whatlanguage.rb
56
57
  - whatlanguage.gemspec
57
58
  homepage: https://github.com/peterc/whatlanguage
58
59
  licenses: []
60
+ metadata: {}
59
61
  post_install_message:
60
62
  rdoc_options: []
61
63
  require_paths:
62
64
  - lib
63
65
  required_ruby_version: !ruby/object:Gem::Requirement
64
- none: false
65
66
  requirements:
66
- - - ! '>='
67
+ - - ">="
67
68
  - !ruby/object:Gem::Version
68
69
  version: '0'
69
70
  required_rubygems_version: !ruby/object:Gem::Requirement
70
- none: false
71
71
  requirements:
72
- - - ! '>='
72
+ - - ">="
73
73
  - !ruby/object:Gem::Version
74
74
  version: '0'
75
75
  requirements: []
76
76
  rubyforge_project:
77
- rubygems_version: 1.8.24
77
+ rubygems_version: 2.4.5
78
78
  signing_key:
79
- specification_version: 3
79
+ specification_version: 4
80
80
  summary: Natural language detection for text samples
81
81
  test_files:
82
82
  - test/test_whatlanguage.rb