whatlanguage 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt ADDED
@@ -0,0 +1,4 @@
1
+ == 1.0.0 / 2007-07-02
2
+
3
+ * First version with pre-built English, French, and Spanish filters
4
+
data/Manifest.txt ADDED
@@ -0,0 +1,19 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.txt
4
+ Rakefile
5
+ build_filter.rb
6
+ example.rb
7
+ lang/dutch.lang
8
+ lang/farsi.lang
9
+ lang/german.lang
10
+ lang/pinyin.lang
11
+ lang/russian.lang
12
+ lang/english.lang
13
+ lang/portuguese.lang
14
+ lang/french.lang
15
+ lang/spanish.lang
16
+ lib/bitfield.rb
17
+ lib/bloominsimple.rb
18
+ lib/whatlanguage.rb
19
+ test/test_whatlanguage.rb
data/README.txt ADDED
@@ -0,0 +1,71 @@
1
+ whatlanguage
2
+ by Peter Cooper
3
+ http://www.petercooper.co.uk/
4
+ http://www.rubyinside.com/
5
+
6
+ == DESCRIPTION:
7
+
8
+ Text language detection. Quick, fast, memory efficient, and all in pure Ruby. Uses Bloom filters for aforementioned speed and memory benefits.
9
+
10
+ == FEATURES/PROBLEMS:
11
+
12
+ * Only does French, English and Spanish out of the box. Very easy to train new languages though.
13
+ * It can be made far more efficient at the comparison stage, but all in good time..! It still beats literal dictionary approaches.
14
+ * No filter selection yet, you get 'em all loaded.
15
+ * Tests are reasonably light.
16
+
17
+ == SYNOPSIS:
18
+
19
+ Full Example
20
+ require 'whatlanguage'
21
+
22
+ texts = []
23
+ texts << %q{Deux autres personnes ont été arrêtées durant la nuit}
24
+ texts << %q{The links between the attempted car bombings in Glasgow and London are becoming clearer}
25
+ texts << %q{En estado de máxima alertaen su nivel de crítico}
26
+ texts << %q{Returns the object in enum with the maximum value.}
27
+ texts << %q{Propose des données au sujet de la langue espagnole.}
28
+ texts << %q{La palabra "mezquita" se usa en español para referirse a todo tipo de edificios dedicados.}
29
+
30
+ texts.each { |text| puts "#{text[0..18]}... is in #{text.language.to_s.capitalize}" }
31
+
32
+ Initialize WhatLanguage with all filters
33
+ wl = WhatLanguage.new(:all)
34
+
35
+ Return language with best score
36
+ wl.language(text)
37
+
38
+ Return hash with scores for all relevant languages
39
+ wl.process_text(text)
40
+
41
+ Convenience method on String
42
+ "This is a test".language # => "English"
43
+
44
+ == REQUIREMENTS:
45
+
46
+ * None, minor libraries (BloominSimple and BitField) included with this release.
47
+
48
+ == LICENSE:
49
+
50
+ (The MIT License)
51
+
52
+ Copyright (c) 2007-2008 Peter Cooper
53
+
54
+ Permission is hereby granted, free of charge, to any person obtaining
55
+ a copy of this software and associated documentation files (the
56
+ 'Software'), to deal in the Software without restriction, including
57
+ without limitation the rights to use, copy, modify, merge, publish,
58
+ distribute, sublicense, and/or sell copies of the Software, and to
59
+ permit persons to whom the Software is furnished to do so, subject to
60
+ the following conditions:
61
+
62
+ The above copyright notice and this permission notice shall be
63
+ included in all copies or substantial portions of the Software.
64
+
65
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
66
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
67
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
68
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
69
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
70
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
71
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/Rakefile ADDED
@@ -0,0 +1,17 @@
1
+ # -*- ruby -*-
2
+
3
+ require 'rubygems'
4
+ require 'hoe'
5
+ require './lib/whatlanguage.rb'
6
+
7
+ Hoe.new('whatlanguage', WhatLanguage::VERSION) do |p|
8
+ p.rubyforge_name = 'whatlanguage'
9
+ p.author = 'Peter Cooper'
10
+ p.email = 'whatlanguage@peterc.org'
11
+ p.summary = 'Fast, quick, textual language detection'
12
+ p.description = p.paragraphs_of('README.txt', 2..5).join("\n\n")
13
+ p.url = "http://rubyforge.org/projects/whatlanguage/"
14
+ p.changes = p.paragraphs_of('History.txt', 0..1).join("\n\n")
15
+ end
16
+
17
+ # vim: syntax=Ruby
data/build_filter.rb ADDED
@@ -0,0 +1,9 @@
1
+ # Use this to build new filters (for other languages, ideally) from /usr/share/dict/words style dictionaries..
2
+ #
3
+ # Call like so..
4
+ # ruby build_filter.rb /usr/share/dict/words lang/english.lang
5
+ # (replace params as necessary)
6
+
7
+ require 'lib/whatlanguage'
8
+ filter = WhatLanguage.filter_from_dictionary(ARGV[0])
9
+ File.open(ARGV[1], 'w') { |f| f.write filter.dump }
data/example.rb ADDED
@@ -0,0 +1,51 @@
1
+ require 'lib/whatlanguage'
2
+ require 'benchmark'
3
+
4
+ texts = []
5
+
6
+ texts << %q{Deux autres personnes ont été arrêtées durant la nuit. "Induite par le quinquennat, la modernisation de nos institutions (...) est un facteur de modernité et d'efficacité", a déclaré François Fillon. "Devant cet exécutif plus resserré et plus efficace, les pouvoirs du Parlement doivent être renforcés", a-t-il ajouté, en évoquant par exemple le contrôle parlementaire des nominations à "certains postes publics."
7
+
8
+ Le premier ministre, indiquant que le président Nicolas Sarkozy entendait "réunir une commission réunissant des personnalités incontestables pour leurs compétences et représentatives de notre diversité politique qui sera chargée d'éclairer ses choix" en matière de modernisation des institutions.
9
+
10
+ "Faut-il faire élire quelques députés au scrutin proportionnel? (...) Aucun sujet ne doit être tabou", a-t-il lâché. "Enfin nous devrons engager, comme le demande le Conseil constitutionnel, une révision de la carte des circonscriptions législatives. Ce travail sera engagé dans la transparence et en y associant l'opposition".}
11
+
12
+ texts << %q{The links between the attempted car bombings in Glasgow and London are becoming clearer. Seven are believed to be doctors or medical students, while one formerly worked as a laboratory technician.
13
+
14
+ Australian media have identified a man arrested at Brisbane airport as Dr Mohammed Haneef, 27.
15
+
16
+ Two men have been arrested in Blackburn under terror laws but police have not confirmed a link with the car bombs.
17
+
18
+ The pair were detained on an industrial estate and are being held at a police station in Lancashire on suspicion of offences under the Terrorism Act 2000.
19
+
20
+ Thousands of passengers travelling from Heathrow Airport's Terminal 4 face major delays after a suspect bag sparked a security alert.
21
+
22
+ BAA said the departure lounge was partially evacuated and departing passengers were being re-screened, causing some cancellations and delays.}
23
+
24
+ texts << %q{En estado de máxima alertaen su nivel de crítico. Cinco detenciones hasta el momento parecen indicar que los coches bomba de Londres y Glasgow fueron obra de una misma célula de terroristas islámicos con residencia en el Reino Unido, posiblemente con alguna conexión con otros grupos previamente desarticulados. La relación con éstos, singularmente con el núcleo de Dhiren Barot (condenado a 30 años por sus planes de llenar limusinas con bombonas de gas para provocar una masiva explosión), podría haber llevado a su detención tiempo atrás y su puesta en libertad al no existir suficientes pruebas contra ellos.}
25
+
26
+ texts << %q{Fern- und Regionalzüge, aber auch die S-Bahnen in den Großstädten stehen still. Gerade hat die Kanzlerin die Ergebnisse des Energiegipfels mit der Wirtschaft, den Energieerzeugern und Verbraucherschützern referiert, die "sachliche Atmosphäre" gelobt. Da wird der Umweltminister von einem Journalisten an seinen Ausspruch vom "Wirtschaftsstalinisten" erinnert, mit dem er jüngst den BASF-Vorstandschef Jürgen Hambrecht belegt hat im Streit um die ehrgeizigen Ziele der deutschen Klimapolitik. Wie haben denn seine Beiträge in der Runde für eine sachliche Atmosphäre ausgesehen? Gabriel überlegt, aber die Kanzlerin ist schneller.}
27
+
28
+ #texts << %q{Deux autres personnes ont été arrêtées durant la nuit}
29
+ #texts << %q{The links between the attempted car bombings in Glasgow and London are becoming clearer}
30
+ #texts << %q{En estado de máxima alertaen su nivel de crítico}
31
+ #texts << %q{Returns the object in enum with the maximum value.}
32
+ #texts << %q{Propose des données au sujet de la langue espagnole.}
33
+ #texts << %q{La palabra "mezquita" se usa en español para referirse a todo tipo de edificios dedicados.}
34
+ #texts << %q{Fern- und Regionalzüge, aber auch die S-Bahnen in den Großstädten stehen still.}
35
+
36
+ #texts.collect! { |t| (t + " ") * 5 }
37
+
38
+ @wl = WhatLanguage.new(:all)
39
+
40
+ puts Benchmark.measure {
41
+
42
+ 100.times do
43
+ texts.each { |text|
44
+ lang = text.language.to_s.capitalize
45
+ # puts "#{text[0..18]}... is in #{lang}"
46
+ # puts @wl.process_text(text).sort_by{|a,b| b }.reverse.inspect
47
+ # puts "---"
48
+ }
49
+ end
50
+
51
+ }
data/lang/dutch.lang ADDED
Binary file
data/lang/english.lang ADDED
Binary file
data/lang/farsi.lang ADDED
Binary file
data/lang/french.lang ADDED
Binary file
data/lang/german.lang ADDED
Binary file
data/lang/pinyin.lang ADDED
Binary file
Binary file
data/lang/russian.lang ADDED
Binary file
data/lang/spanish.lang ADDED
Binary file
data/lib/bitfield.rb ADDED
@@ -0,0 +1,64 @@
1
+ # NAME: BitField
2
+ # AUTHOR: Peter Cooper
3
+ # LICENSE: MIT ( http://www.opensource.org/licenses/mit-license.php )
4
+ # COPYRIGHT: (c) 2007 Peter Cooper (http://www.petercooper.co.uk/)
5
+ # VERSION: v4
6
+ # HISTORY: v4 (better support for loading and dumping fields)
7
+ # v3 (supports dynamic bitwidths for array elements.. now doing 32 bit widths default)
8
+ # v2 (now uses 1 << y, rather than 2 ** y .. it's 21.8 times faster!)
9
+ # v1 (first release)
10
+ #
11
+ # DESCRIPTION: Basic, pure Ruby bit field. Pretty fast (for what it is) and memory efficient.
12
+ # I've written a pretty intensive test suite for it and it passes great.
13
+ # Works well for Bloom filters (the reason I wrote it).
14
+ #
15
+ # Create a bit field 1000 bits wide
16
+ # bf = BitField.new(1000)
17
+ #
18
+ # Setting and reading bits
19
+ # bf[100] = 1
20
+ # bf[100] .. => 1
21
+ # bf[100] = 0
22
+ #
23
+ # More
24
+ # bf.to_s = "10101000101010101" (example)
25
+ # bf.total_set .. => 10 (example - 10 bits are set to "1")
26
+
27
+ class BitField
28
+ attr_reader :size
29
+ attr_accessor :field
30
+ include Enumerable
31
+
32
+ ELEMENT_WIDTH = 32
33
+
34
+ def initialize(size)
35
+ @size = size
36
+ @field = Array.new(((size - 1) / ELEMENT_WIDTH) + 1, 0)
37
+ end
38
+
39
+ # Set a bit (1/0)
40
+ def []=(position, value)
41
+ value == 1 ? @field[position / ELEMENT_WIDTH] |= 1 << (position % ELEMENT_WIDTH) : @field[position / ELEMENT_WIDTH] ^= 1 << (position % ELEMENT_WIDTH)
42
+ end
43
+
44
+ # Read a bit (1/0)
45
+ def [](position)
46
+ @field[position / ELEMENT_WIDTH] & 1 << (position % ELEMENT_WIDTH) > 0 ? 1 : 0
47
+ end
48
+
49
+ # Iterate over each bit
50
+ def each(&block)
51
+ @size.times { |position| yield self[position] }
52
+ end
53
+
54
+ # Returns the field as a string like "0101010100111100," etc.
55
+ def to_s
56
+ inject("") { |a, b| a + b.to_s }
57
+ end
58
+
59
+ # Returns the total number of bits that are set
60
+ # (The technique used here is about 6 times faster than using each or inject direct on the bitfield)
61
+ def total_set
62
+ @field.inject(0) { |a, byte| a += byte & 1 and byte >>= 1 until byte == 0; a }
63
+ end
64
+ end
@@ -0,0 +1,88 @@
1
+ # NAME: BloominSimple
2
+ # AUTHOR: Peter Cooper
3
+ # LICENSE: MIT ( http://www.opensource.org/licenses/mit-license.php )
4
+ # COPYRIGHT: (c) 2007 Peter Cooper
5
+ # DESCRIPTION: Very basic, pure Ruby Bloom filter. Uses my BitField, pure Ruby
6
+ # bit field library (http://snippets.dzone.com/posts/show/4234).
7
+ # Supports custom hashing (default is 3).
8
+ #
9
+ # Create a Bloom filter that uses default hashing with 1Mbit wide bitfield
10
+ # bf = BloominSimple.new(1_000_000)
11
+ #
12
+ # Add items to it
13
+ # File.open('/usr/share/dict/words').each { |a| bf.add(a) }
14
+ #
15
+ # Check for existence of items in the filter
16
+ # bf.includes?("people") # => true
17
+ # bf.includes?("kwyjibo") # => false
18
+ #
19
+ # Add better hashing (c'est easy!)
20
+ # require 'digest/sha1'
21
+ # b = BloominSimple.new(1_000_000) do |item|
22
+ # Digest::SHA1.digest(item.downcase.strip).unpack("VVVV")
23
+ # end
24
+ #
25
+ # More
26
+ # %w{wonderful ball stereo jester flag shshshshsh nooooooo newyorkcity}.each do |a|
27
+ # puts "#{sprintf("%15s", a)}: #{b.includes?(a)}"
28
+ # end
29
+ #
30
+ # # => wonderful: true
31
+ # # => ball: true
32
+ # # => stereo: true
33
+ # # => jester: true
34
+ # # => flag: true
35
+ # # => shshshshsh: false
36
+ # # => nooooooo: false
37
+ # # => newyorkcity: false
38
+
39
+ require File.join(File.dirname(__FILE__), 'bitfield')
40
+
41
+ class BloominSimple
42
+ attr_accessor :bitfield, :hasher
43
+
44
+ def initialize(bitsize, &block)
45
+ @bitfield = BitField.new(bitsize)
46
+ @size = bitsize
47
+ @hasher = block || lambda do |word|
48
+ word = word.downcase.strip
49
+ [h1 = word.sum, h2 = word.hash, h2 + h1 ** 3]
50
+ end
51
+ end
52
+
53
+ # Add item to the filter
54
+ def add(item)
55
+ @hasher[item].each { |hi| @bitfield[hi % @size] = 1 }
56
+ end
57
+
58
+ # Find out if the filter possibly contains the supplied item
59
+ def includes?(item)
60
+ @hasher[item].each { |hi| return false unless @bitfield[hi % @size] == 1 } and true
61
+ end
62
+
63
+ # Allows comparison between two filters. Returns number of same bits.
64
+ def &(other)
65
+ raise "Wrong sizes" if self.bitfield.size != other.bitfield.size
66
+ same = 0
67
+ #self.bitfield.size.times do |pos|
68
+ # same += 1 if self.bitfield[pos] & other.bitfield[pos] == 1
69
+ #end
70
+ self.bitfield.total_set.to_s + "--" + other.bitfield.total_set.to_s
71
+ end
72
+
73
+ # Dumps the bitfield for a bloom filter for storage
74
+ def dump
75
+ [@size, *@bitfield.field].pack("I*")
76
+ #Marshal.dump([@size, @bitfield])
77
+ end
78
+
79
+ # Creates a new bloom filter object from a stored dump (hasher has to be resent though for additions)
80
+ def self.from_dump(data, &block)
81
+ data = data.unpack("I*")
82
+ #data = Marshal.load(data)
83
+ temp = new(data[0], &block)
84
+ temp.bitfield.field = data[1..-1]
85
+ temp
86
+ end
87
+ end
88
+
@@ -0,0 +1,59 @@
1
+ require File.join(File.dirname(__FILE__), 'bloominsimple')
2
+ require 'digest/sha1'
3
+
4
+ class WhatLanguage
5
+ VERSION = '1.0.0'
6
+
7
+ HASHER = lambda { |item| Digest::SHA1.digest(item.downcase.strip).unpack("VV") }
8
+
9
+ BITFIELD_WIDTH = 2_000_000
10
+
11
+ @@data = {}
12
+
13
+ def initialize(options)
14
+ languages_folder = File.join(File.dirname(__FILE__), "..", "lang")
15
+ Dir.entries(languages_folder).grep(/\.lang/).each do |lang|
16
+ @@data[lang[/\w+/].to_sym] ||= BloominSimple.from_dump(File.read(File.join(languages_folder, lang)), &HASHER)
17
+ end
18
+ end
19
+
20
+ # Very inefficient method for now.. but still beats the non-Bloom alternatives.
21
+ # Change to better bit comparison technique later..
22
+ def process_text(text)
23
+ results = Hash.new(0)
24
+ it = 0
25
+ text.split.collect {|a| a.downcase }.each do |word|
26
+ it += 1
27
+ @@data.keys.each do |lang|
28
+ results[lang] += 1 if @@data[lang].includes?(word)
29
+ end
30
+
31
+ # Every now and then check to see if we have a really convincing result.. if so, exit early.
32
+ if it % 4 == 0 && results.size > 1
33
+ top_results = results.sort_by{|a,b| b}.reverse[0..1]
34
+
35
+ # Next line may need some tweaking one day..
36
+ break if top_results[0][1] > 4 && ((top_results[0][1] > top_results[1][1] * 2) || (top_results[0][1] - top_results[1][1] > 25))
37
+ end
38
+
39
+ #break if it > 100
40
+ end
41
+ results
42
+ end
43
+
44
+ def language(text)
45
+ process_text(text).max { |a,b| a[1] <=> b[1] }.first rescue nil
46
+ end
47
+
48
+ def self.filter_from_dictionary(filename)
49
+ bf = BloominSimple.new(BITFIELD_WIDTH, &HASHER)
50
+ File.open(filename).each { |word| bf.add(word) }
51
+ bf
52
+ end
53
+ end
54
+
55
+ class String
56
+ def language
57
+ WhatLanguage.new(:all).language(self)
58
+ end
59
+ end
@@ -0,0 +1,33 @@
1
+ require "test/unit"
2
+
3
+ require File.join(File.dirname(__FILE__), "..", "lib", "whatlanguage")
4
+
5
+ class TestWhatLanguage < Test::Unit::TestCase
6
+ def setup
7
+ @wl = WhatLanguage.new(:all)
8
+ end
9
+
10
+ def test_string_method
11
+ assert_equal :english, "This is a test".language
12
+ end
13
+
14
+ def test_french
15
+ assert_equal :french, @wl.language("Bonjour, je m'appelle Sandrine. Voila ma chatte.")
16
+ end
17
+
18
+ def test_spanish
19
+ assert_equal :spanish, @wl.language("La palabra mezquita se usa en español para referirse a todo tipo de edificios dedicados.")
20
+ end
21
+
22
+ def test_nothing
23
+ assert_nil @wl.language("")
24
+ end
25
+
26
+ def test_something
27
+ assert_not_nil @wl.language("test")
28
+ end
29
+
30
+ def test_processor
31
+ assert_kind_of Hash, @wl.process_text("this is a test")
32
+ end
33
+ end
metadata ADDED
@@ -0,0 +1,83 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: whatlanguage
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - Peter Cooper
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2008-08-22 00:00:00 +01:00
13
+ default_executable:
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ name: hoe
17
+ type: :development
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ">="
22
+ - !ruby/object:Gem::Version
23
+ version: 1.7.0
24
+ version:
25
+ description: "== FEATURES/PROBLEMS: * Only does French, English and Spanish out of the box. Very easy to train new languages though. * It can be made far more efficient at the comparison stage, but all in good time..! It still beats literal dictionary approaches. * No filter selection yet, you get 'em all loaded. * Tests are reasonably light. == SYNOPSIS: Full Example require 'whatlanguage' texts = [] texts << %q{Deux autres personnes ont \xC3\xA9t\xC3\xA9 arr\xC3\xAAt\xC3\xA9es durant la nuit} texts << %q{The links between the attempted car bombings in Glasgow and London are becoming clearer} texts << %q{En estado de m\xC3\xA1xima alertaen su nivel de cr\xC3\xADtico} texts << %q{Returns the object in enum with the maximum value.} texts << %q{Propose des donn\xC3\xA9es au sujet de la langue espagnole.} texts << %q{La palabra \"mezquita\" se usa en espa\xC3\xB1ol para referirse a todo tipo de edificios dedicados.} texts.each { |text| puts \"#{text[0..18]}... is in #{text.language.to_s.capitalize}\" } Initialize WhatLanguage with all filters wl = WhatLanguage.new(:all)"
26
+ email: whatlanguage@peterc.org
27
+ executables: []
28
+
29
+ extensions: []
30
+
31
+ extra_rdoc_files:
32
+ - History.txt
33
+ - Manifest.txt
34
+ - README.txt
35
+ files:
36
+ - History.txt
37
+ - Manifest.txt
38
+ - README.txt
39
+ - Rakefile
40
+ - build_filter.rb
41
+ - example.rb
42
+ - lang/dutch.lang
43
+ - lang/farsi.lang
44
+ - lang/german.lang
45
+ - lang/pinyin.lang
46
+ - lang/russian.lang
47
+ - lang/english.lang
48
+ - lang/portuguese.lang
49
+ - lang/french.lang
50
+ - lang/spanish.lang
51
+ - lib/bitfield.rb
52
+ - lib/bloominsimple.rb
53
+ - lib/whatlanguage.rb
54
+ - test/test_whatlanguage.rb
55
+ has_rdoc: true
56
+ homepage: http://rubyforge.org/projects/whatlanguage/
57
+ post_install_message:
58
+ rdoc_options:
59
+ - --main
60
+ - README.txt
61
+ require_paths:
62
+ - lib
63
+ required_ruby_version: !ruby/object:Gem::Requirement
64
+ requirements:
65
+ - - ">="
66
+ - !ruby/object:Gem::Version
67
+ version: "0"
68
+ version:
69
+ required_rubygems_version: !ruby/object:Gem::Requirement
70
+ requirements:
71
+ - - ">="
72
+ - !ruby/object:Gem::Version
73
+ version: "0"
74
+ version:
75
+ requirements: []
76
+
77
+ rubyforge_project: whatlanguage
78
+ rubygems_version: 1.2.0
79
+ signing_key:
80
+ specification_version: 2
81
+ summary: Fast, quick, textual language detection
82
+ test_files:
83
+ - test/test_whatlanguage.rb