scylla 0.2.0 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README.rdoc +27 -11
- data/VERSION +1 -1
- data/lib/scylla/generator.rb +1 -0
- data/lib/scylla/tasks.rb +3 -2
- data/scylla.gemspec +1 -1
- data/test/loader_test.rb +1 -1
- metadata +3 -3
data/README.rdoc
CHANGED
@@ -2,27 +2,43 @@
|
|
2
2
|
|
3
3
|
Scylla is a language categorizing gem that allows you to guess the language of a given text. Scylla is a Ruby port of TextCat (http://www.let.rug.nl/~vannoord/TextCat) and is based on the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, ``N-Gram-Based Text Categorization'' In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994.
|
4
4
|
|
5
|
-
Installation
|
5
|
+
== Installation
|
6
6
|
|
7
|
-
gem install scylla
|
7
|
+
gem install scylla
|
8
8
|
|
9
|
-
Usage
|
9
|
+
== Usage
|
10
10
|
|
11
|
-
require 'scylla'
|
11
|
+
require 'scylla'
|
12
12
|
|
13
|
-
"this is english text".language
|
13
|
+
"this is english text".language
|
14
14
|
=> "english"
|
15
15
|
|
16
|
-
"Este es un texto español".language
|
17
|
-
=> "spanish"
|
16
|
+
"Este es un texto español".language
|
17
|
+
=> "spanish"
|
18
18
|
|
19
19
|
Multiple results for other possible languages:
|
20
20
|
|
21
|
-
"isso poderia ser confundido com espanhol, bem".language
|
22
|
-
=> "portuguese"
|
21
|
+
"isso poderia ser confundido com espanhol, bem".language
|
22
|
+
=> "portuguese"
|
23
23
|
|
24
|
-
"isso poderia ser confundido com espanhol, bem".guess
|
25
|
-
=> ["portuguese", "spanish"]
|
24
|
+
"isso poderia ser confundido com espanhol, bem".guess
|
25
|
+
=> ["portuguese", "spanish"]
|
26
|
+
|
27
|
+
== Training
|
28
|
+
|
29
|
+
You can train scylla in new languages by providing sample texts in different languages. The default set is located in the 'source_texts' folder in the gem directory. Add new .txt files to this directory named according to the language i.e. a text file full of Hebrew text should be called 'hebrew.txt'. At least 500 lines of text recommended. Then, in the gem folder, run this:
|
30
|
+
|
31
|
+
rake scylla:train
|
32
|
+
|
33
|
+
If you want to store texts in your own folder, you can specify that to the rake task.
|
34
|
+
WARNING: specifying a different folder deletes all language support for files located in the default directory if they are not copied over.
|
35
|
+
|
36
|
+
rake scylla:train[/Users/hash/mytextdir]
|
37
|
+
"Creating language map for /Users/hash/mytextdir/english.txt"
|
38
|
+
"Creating language map for /Users/hash/mytextdir/kannada.txt"
|
39
|
+
.
|
40
|
+
.
|
41
|
+
etc
|
26
42
|
|
27
43
|
== Contributing to scylla
|
28
44
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.
|
1
|
+
0.3.0
|
data/lib/scylla/generator.rb
CHANGED
@@ -6,6 +6,7 @@ module Scylla
|
|
6
6
|
# minsize: The minimum size of the ngrams that you would like to store
|
7
7
|
def initialize(dirtext = DEFAULT_SOURCE_DIR, dirlm = DEFAULT_TARGET_DIR, minsize = 0, silent = false)
|
8
8
|
@dirtext = dirtext
|
9
|
+
p @dirtext
|
9
10
|
@dirlm = dirlm
|
10
11
|
@minsize = minsize
|
11
12
|
end
|
data/lib/scylla/tasks.rb
CHANGED
@@ -10,8 +10,9 @@ module Scylla
|
|
10
10
|
def define_training_task
|
11
11
|
namespace :scylla do
|
12
12
|
desc "Trains Scylla in new languages"
|
13
|
-
task :train do
|
14
|
-
|
13
|
+
task :train, :dir do |t, args|
|
14
|
+
args.with_defaults(:dir => DEFAULT_SOURCE_DIR)
|
15
|
+
sg = Scylla::Generator.new(args[:dir])
|
15
16
|
sg.train
|
16
17
|
end
|
17
18
|
end
|
data/scylla.gemspec
CHANGED
data/test/loader_test.rb
CHANGED
@@ -9,7 +9,7 @@ class LoaderTest < Test::Unit::TestCase
|
|
9
9
|
end
|
10
10
|
|
11
11
|
context "when being read" do
|
12
|
-
|
12
|
+
should "only load from disk once" do
|
13
13
|
Scylla::Loader.expects(:load_language_maps).once.returns([])
|
14
14
|
Scylla::Loader.languages
|
15
15
|
Scylla::Loader.languages
|
metadata
CHANGED