pragmatic_tokenizer 2.1.0 → 2.2.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (40) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +77 -13
  3. data/README.md +3 -3
  4. data/lib/pragmatic_tokenizer/full_stop_separator.rb +2 -2
  5. data/lib/pragmatic_tokenizer/languages.rb +27 -26
  6. data/lib/pragmatic_tokenizer/languages/arabic.rb +2 -2
  7. data/lib/pragmatic_tokenizer/languages/bulgarian.rb +2 -2
  8. data/lib/pragmatic_tokenizer/languages/catalan.rb +2 -2
  9. data/lib/pragmatic_tokenizer/languages/common.rb +11 -11
  10. data/lib/pragmatic_tokenizer/languages/czech.rb +2 -2
  11. data/lib/pragmatic_tokenizer/languages/danish.rb +2 -2
  12. data/lib/pragmatic_tokenizer/languages/deutsch.rb +4 -4
  13. data/lib/pragmatic_tokenizer/languages/dutch.rb +2 -2
  14. data/lib/pragmatic_tokenizer/languages/english.rb +2 -2
  15. data/lib/pragmatic_tokenizer/languages/finnish.rb +2 -2
  16. data/lib/pragmatic_tokenizer/languages/french.rb +2 -2
  17. data/lib/pragmatic_tokenizer/languages/greek.rb +2 -2
  18. data/lib/pragmatic_tokenizer/languages/indonesian.rb +2 -2
  19. data/lib/pragmatic_tokenizer/languages/italian.rb +2 -2
  20. data/lib/pragmatic_tokenizer/languages/latvian.rb +2 -2
  21. data/lib/pragmatic_tokenizer/languages/norwegian.rb +2 -2
  22. data/lib/pragmatic_tokenizer/languages/persian.rb +2 -2
  23. data/lib/pragmatic_tokenizer/languages/polish.rb +2 -2
  24. data/lib/pragmatic_tokenizer/languages/portuguese.rb +2 -2
  25. data/lib/pragmatic_tokenizer/languages/romanian.rb +2 -2
  26. data/lib/pragmatic_tokenizer/languages/russian.rb +2 -2
  27. data/lib/pragmatic_tokenizer/languages/slovak.rb +2 -2
  28. data/lib/pragmatic_tokenizer/languages/spanish.rb +2 -2
  29. data/lib/pragmatic_tokenizer/languages/swedish.rb +2 -2
  30. data/lib/pragmatic_tokenizer/languages/turkish.rb +2 -2
  31. data/lib/pragmatic_tokenizer/post_processor.rb +11 -13
  32. data/lib/pragmatic_tokenizer/tokenizer.rb +195 -187
  33. data/lib/pragmatic_tokenizer/version.rb +1 -1
  34. data/pragmatic_tokenizer.gemspec +1 -1
  35. data/spec/languages/bulgarian_spec.rb +4 -8
  36. data/spec/languages/deutsch_spec.rb +25 -49
  37. data/spec/languages/english_spec.rb +238 -364
  38. data/spec/languages/french_spec.rb +1 -2
  39. data/spec/performance_spec.rb +15 -16
  40. metadata +4 -4
@@ -5,10 +5,9 @@ describe PragmaticTokenizer do
5
5
  it 'tokenizes a string #001' do
6
6
  text = "L'art de l'univers, c'est un art"
7
7
  pt = PragmaticTokenizer::Tokenizer.new(
8
- text,
9
8
  language: 'fr'
10
9
  )
11
- expect(pt.tokenize).to eq(["l'", "art", "de", "l'", "univers", ",", "c'est", "un", "art"])
10
+ expect(pt.tokenize(text)).to eq(["l'", "art", "de", "l'", "univers", ",", "c'est", "un", "art"])
12
11
  end
13
12
  end
14
13
  end
@@ -8,21 +8,18 @@ describe PragmaticTokenizer do
8
8
 
9
9
  # it 'is fast?' do
10
10
  # string = "Hello World. My name is Jonas. What is your name? My name is Jonas. There it is! I found it. My name is Jonas E. Smith. Please turn to p. 55. Were Jane and co. at the party? They closed the deal with Pitt, Briggs & Co. at noon. Let's ask Jane and co. They should know. They closed the deal with Pitt, Briggs & Co. It closed yesterday. I can't see Mt. Fuji from here. St. Michael's Church is on 5th st. near the light. That is JFK Jr.'s book. I visited the U.S.A. last year. I live in the E.U. How about you? I live in the U.S. How about you? I work for the U.S. Government in Virginia. I have lived in the U.S. for 20 years. She has $100.00 in her bag. She has $100.00. It is in her bag. He teaches science (He previously worked for 5 years as an engineer.) at the local University. Her email is Jane.Doe@example.com. I sent her an email. The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out. She turned to him, 'This is great.' she said. She turned to him, \"This is great.\" she said. She turned to him, \"This is great.\" She held the book out to show him. Hello!! Long time no see. Hello?? Who is there? Hello!? Is that you? Hello?! Is that you? 1.) The first item 2.) The second item 1.) The first item. 2.) The second item. 1) The first item 2) The second item 1) The first item. 2) The second item. 1. The first item 2. The second item 1. The first item. 2. The second item. • 9. The first item • 10. The second item ⁃9. The first item ⁃10. The second item a. The first item b. The second item c. The third list item This is a sentence\ncut off in the middle because pdf. It was a cold \nnight in the city. features\ncontact manager\nevents, activities\n You can find it at N°. 1026.253.553. That is where the treasure is. She works at Yahoo! in the accounting department. We make a good team, you and I. Did you see Albert I. Jones yesterday? Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .” \"Bohr [...] used the analogy of parallel stairways [...]\" (Smith 55). If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence. I never meant that.... She left the store. I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it. One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . ."
11
- # benchmark do
12
- # 10.times do
13
- # data = StackProf.run(mode: :cpu, interval: 1000) do
14
- # PragmaticTokenizer::Tokenizer.new(string * 100,
15
- # language: 'en',
16
- # clean: true,
17
- # remove_numbers: true,
18
- # minimum_length: 3,
19
- # expand_contractions: true,
20
- # remove_stop_words: true
21
- # ).tokenize
22
- # end
23
- # puts StackProf::Report.new(data).print_text
24
- # end
11
+ # data = StackProf.run(mode: :cpu, interval: 1000) do
12
+ # PragmaticTokenizer::Tokenizer.new(
13
+ # language: 'en',
14
+ # clean: true,
15
+ # minimum_length: 3,
16
+ # expand_contractions: true,
17
+ # remove_stop_words: true,
18
+ # numbers: :none,
19
+ # punctuation: :none
20
+ # ).tokenize(string * 100)
25
21
  # end
22
+ # puts StackProf::Report.new(data).print_text
26
23
  # end
27
24
 
28
25
  # 26.8
@@ -30,11 +27,13 @@ describe PragmaticTokenizer do
30
27
  # 9.6
31
28
  # 23.25
32
29
  # 24.2
30
+ # 23.2
31
+ # 11.6
33
32
  # it 'is fast? (long strings)' do
34
33
  # string = "Hello World. My name is Jonas. What is your name? My name is Jonas IV Smith. There it is! I found it. My name is Jonas E. Smith. Please turn to p. 55. Were Jane and co. at the party? They closed the deal with Pitt, Briggs & Co. at noon. Let's ask Jane and co. They should know. They closed the deal with Pitt, Briggs & Co. It closed yesterday. I can't see Mt. Fuji from here. St. Michael's Church is on 5th st. near the light. That is JFK Jr.'s book. I visited the U.S.A. last year. I live in the E.U. How about you? I live in the U.S. How about you? I work for the U.S. Government in Virginia. I have lived in the U.S. for 20 years. She has $100.00 in her bag. She has $100.00. It is in her bag. He teaches science (He previously worked for 5 years as an engineer.) at the local University. Her email is Jane.Doe@example.com. I sent her an email. The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out. She turned to him, 'This is great.' she said. She turned to him, \"This is great.\" she said. She turned to him, \"This is great.\" She held the book out to show him. Hello!! Long time no see. Hello?? Who is there? Hello!? Is that you? Hello?! Is that you? 1.) The first item 2.) The second item 1.) The first item. 2.) The second item. 1) The first item 2) The second item 1) The first item. 2) The second item. 1. The first item 2. The second item 1. The first item. 2. The second item. • 9. The first item • 10. The second item ⁃9. The first item ⁃10. The second item a. The first item b. The second item c. The third list item This is a sentence\ncut off in the middle because pdf. It was a cold \nnight in the city. features\ncontact manager\nevents, activities\n You can find it at N°. 1026.253.553. That is where the treasure is. She works at Yahoo! in the accounting department. We make a good team, you and I. Did you see Albert I. Jones yesterday? Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .” \"Bohr [...] used the analogy of parallel stairways [...]\" (Smith 55). If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence. I never meant that.... She left the store. I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it. One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . ." * 1000
35
34
  # puts "LENGTH: #{string.length}"
36
35
  # benchmark do
37
- # PragmaticTokenizer::Tokenizer.new(string,
36
+ # PragmaticTokenizer::Tokenizer.new(
38
37
  # language: 'en',
39
38
  # clean: true,
40
39
  # minimum_length: 3,
@@ -42,7 +41,7 @@ describe PragmaticTokenizer do
42
41
  # remove_stop_words: true,
43
42
  # numbers: :none,
44
43
  # punctuation: :none
45
- # ).tokenize
44
+ # ).tokenize(string)
46
45
  # end
47
46
  # end
48
47
 
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pragmatic_tokenizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.1.0
4
+ version: 2.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kevin S. Dias
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2016-01-27 00:00:00.000000000 Z
11
+ date: 2016-02-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: unicode_case_converter
@@ -16,14 +16,14 @@ dependencies:
16
16
  requirements:
17
17
  - - "~>"
18
18
  - !ruby/object:Gem::Version
19
- version: '0.4'
19
+ version: '1.0'
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
24
  - - "~>"
25
25
  - !ruby/object:Gem::Version
26
- version: '0.4'
26
+ version: '1.0'
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: bundler
29
29
  requirement: !ruby/object:Gem::Requirement