greeb 0.2.0.pre2 → 0.2.0.pre3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 80ad5b1112ea9576ba14eb52ce56a844e7ee1e94
4
- data.tar.gz: 00cd04458ffaf7962d5ed828611cd1decc323852
3
+ metadata.gz: 618591e00b61f1df11f98bdd045bd650d34ba863
4
+ data.tar.gz: 88d1b8448e98c18e6d9759e4d992d2fbea7c1d63
5
5
  SHA512:
6
- metadata.gz: 965fc9a4d9ebbe6b2ed7601fa9aa7f34ccb82f722a48c8bf595786370ff1428c4cfcb4985fc5c9b9f83d5cee1a284e6ca6a440c772e071f0b3245f02717d01e8
7
- data.tar.gz: 9a912f214f5c9b12eebbe7a834e3ce9c1e3d3735b1b5fcd9453029301061683019b6f02a2f8da6be70eac26b044a675d6fc87b23b44d3668efc454e7725cd2b0
6
+ metadata.gz: e8113e47988e80aabfc07314268a5f8220cce88edbf06bd69b35602623c0a310c3c460e300143943596decae621ee69b4909371b9f43a7d9225bceb336bf21f6
7
+ data.tar.gz: 7ebe3c3e0a603bf1fc0072376c3b2b544b43ae38e31e8bc5ff9e34fcaf362b8c474ba565db67363c09f10a2f4960fdb0bf7a165ee6c0b90d657b3914231cc07a
data/.rubocop.yml ADDED
@@ -0,0 +1,3 @@
1
+ # Don't use sprintf instead of %
2
+ FavorSprintf:
3
+ Enabled: false
data/.travis.yml CHANGED
@@ -4,4 +4,3 @@ branches:
4
4
  rvm:
5
5
  - 2.0.0
6
6
  - jruby-19mode
7
- - rbx-19mode
data/Gemfile CHANGED
@@ -1,5 +1,9 @@
1
1
  # encoding: utf-8
2
2
 
3
- source 'http://rubygems.org'
3
+ source 'https://rubygems.org'
4
4
 
5
5
  gemspec
6
+
7
+ group :test do
8
+ gem 'simplecov'
9
+ end
data/README.md CHANGED
@@ -1,11 +1,8 @@
1
- Greeb
2
- =====
3
-
4
- Greeb is a simple yet awesome and Unicode-aware text segmentator
1
+ # Greeb
2
+ Greeb [grʲip] is a simple yet awesome and Unicode-aware text segmentator
5
3
  that is based on regular expressions.
6
4
 
7
5
  ## Installation
8
-
9
6
  Add this line to your application's Gemfile:
10
7
 
11
8
  ```ruby
@@ -21,8 +18,26 @@ Or install it yourself as:
21
18
  $ gem install greeb
22
19
 
23
20
  ## Usage
21
+ Greeb can help you solve simple text processing problems such as
22
+ tokenization and segmentation.
23
+
24
+ It is available as a command line application that reads the input
25
+ text from STDIN and prints one token per line into STDOUT.
26
+
27
+ ```
28
+ % echo 'Hello http://nlpub.ru guys, how are you?' | greeb
29
+ Hello
30
+ http://nlpub.ru
31
+ guys
32
+ ,
33
+ how
34
+ are
35
+ you
36
+ ?
37
+ ```
24
38
 
25
- Greeb can help you to solve simple text processing problems:
39
+ ### Tokenization API
40
+ Greeb has a very convinient API that makes you happy.
26
41
 
27
42
  ```ruby
28
43
  pp Greeb::Tokenizer.tokenize('Hello!')
@@ -32,7 +47,7 @@ pp Greeb::Tokenizer.tokenize('Hello!')
32
47
  =end
33
48
  ```
34
49
 
35
- It should be noted that it is possible to process much complex texts:
50
+ It should be noted that it is possible to process much complex texts.
36
51
 
37
52
  ```ruby
38
53
  text =<<-EOF
@@ -74,8 +89,9 @@ pp Greeb::Tokenizer.tokenize(text)
74
89
  =end
75
90
  ```
76
91
 
92
+ ### Segmentation API
77
93
  Also it can be used to solve the text segmentation problems
78
- such as sentence detection tasks:
94
+ such as sentence detection tasks.
79
95
 
80
96
  ```ruby
81
97
  text = 'Hello! How are you?'
@@ -88,7 +104,7 @@ pp Greeb::Segmentator.new(tokens).sentences
88
104
  ```
89
105
 
90
106
  It is possible to extract tokens that were processed by the text
91
- segmentator:
107
+ segmentator.
92
108
 
93
109
  ```ruby
94
110
  text = 'Hello! How are you?'
@@ -109,18 +125,36 @@ pp segmentator.extract(segmentator.sentences)
109
125
  =end
110
126
  ```
111
127
 
112
- ## Tokens
128
+ ### Parsing API
129
+ Texts are often include some special entities such as URLs and e-mail
130
+ addresses. Greeb can help you in these strings retrieval.
131
+
132
+ ```ruby
133
+ text = 'My website is http://nlpub.ru and e-mail is example@example.com.'
134
+
135
+ pp Greeb::Parser.urls(text).map { |e| [e, text[e.from...e.to]] }
136
+ =begin
137
+ [[#<struct Greeb::Entity from=14, to=29, type=:url>, "http://nlpub.ru"]]
138
+ =end
139
+
140
+ pp Greeb::Parser.emails(text).map { |e| [e, text[e.from...e.to]] }
141
+ =begin
142
+ [[#<struct Greeb::Entity from=44, to=63, type=:email>, "example@example.com"]]
143
+ =end
144
+ ```
145
+
146
+ Please don't use Greeb in spam lists development purposes.
113
147
 
148
+ ## Tokens
114
149
  Greeb operates with entities, tuples of *(from, to, kind)*, where
115
150
  *from* is a beginning of the entity, *to* is an ending of the entity,
116
151
  and *kind* is a type of the entity.
117
152
 
118
- There are several entity types: `:letter`, `:float`, `:integer`,
119
- `:separ`, `:punct` (for punctuation), `:spunct` (for in-sentence
120
- punctuation), and `:break`.
153
+ There are several entity types at the tokenization stage: `:letter`,
154
+ `:float`, `:integer`, `:separ`, `:punct` (for punctuation), `:spunct`
155
+ (for in-sentence punctuation), and `:break`.
121
156
 
122
157
  ## Contributing
123
-
124
158
  1. Fork it;
125
159
  2. Create your feature branch (`git checkout -b my-new-feature`);
126
160
  3. Commit your changes (`git commit -am 'Added some feature'`);
@@ -131,6 +165,8 @@ punctuation), and `:break`.
131
165
 
132
166
  ## Dependency Status [<img src="https://gemnasium.com/ustalov/greeb.png"/>](https://gemnasium.com/ustalov/greeb)
133
167
 
168
+ ## Code Climate [<img src="https://codeclimate.com/github/ustalov/greeb.png"/>](https://codeclimate.com/github/ustalov/greeb)
169
+
134
170
  ## Copyright
135
171
 
136
172
  Copyright (c) 2010-2013 [Dmitry Ustalov]. See LICENSE for details.
data/bin/greeb CHANGED
@@ -1,7 +1,9 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- $:.unshift File.expand_path('../../lib', __FILE__)
4
- require 'rubygems'
3
+ if File.exists? File.expand_path('../../.git', __FILE__)
4
+ $:.unshift File.expand_path('../../lib', __FILE__)
5
+ end
6
+
5
7
  require 'greeb'
6
8
 
7
9
  text = STDIN.read
data/greeb.gemspec CHANGED
@@ -17,8 +17,6 @@ Gem::Specification.new do |s|
17
17
 
18
18
  s.add_development_dependency 'rake'
19
19
  s.add_development_dependency 'minitest', '>= 2.11'
20
- s.add_development_dependency 'simplecov'
21
- s.add_development_dependency 'yard'
22
20
 
23
21
  s.files = `git ls-files`.split("\n")
24
22
  s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
data/lib/greeb/parser.rb CHANGED
@@ -7,11 +7,11 @@
7
7
  module Greeb::Parser
8
8
  extend self
9
9
 
10
- # URL pattern. Not so precise, but IDN-compatible.
11
- URL = /\b(([\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\p{L}\w\d]+\)|([^.\s]|\/)))/ui
10
+ # An URL pattern. Not so precise, but IDN-compatible.
11
+ URL = %r{\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\p{L}\w\d]+\)|([^.\s]|/)))}i
12
12
 
13
- # Horrible e-mail pattern.
14
- EMAIL = /[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/ui
13
+ # A horrible e-mail pattern.
14
+ EMAIL = /[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i
15
15
 
16
16
  # Recognize URLs in the input text. Actually, URL is obsolete standard
17
17
  # and this code should be rewritten to use the URI concept.
@@ -15,7 +15,7 @@ class Greeb::Segmentator
15
15
  #
16
16
  # @param tokens [Array<Greeb::Entity>] tokens from [Greeb::Tokenizer].
17
17
  #
18
- def initialize tokens
18
+ def initialize(tokens)
19
19
  @tokens = tokens
20
20
  end
21
21
 
@@ -44,7 +44,7 @@ class Greeb::Segmentator
44
44
  # @return [Hash<Greeb::Entity, Array<Greeb::Entity>>] a hash with
45
45
  # sentences as keys and tokens arrays as values.
46
46
  #
47
- def extract sentences
47
+ def extract(sentences)
48
48
  Hash[
49
49
  sentences.map do |s|
50
50
  [s, tokens.select { |t| t.from >= s.from and t.to <= s.to }]
@@ -59,7 +59,7 @@ class Greeb::Segmentator
59
59
  # @return [Hash<Greeb::Entity, Array<Greeb::Entity>>] a hash with
60
60
  # sentences as keys and subsentences arrays as values.
61
61
  #
62
- def subextract sentences
62
+ def subextract(sentences)
63
63
  Hash[
64
64
  sentences.map do |s|
65
65
  [s, subsentences.select { |ss| ss.from >= s.from and ss.to <= s.to }]
@@ -88,8 +88,7 @@ class Greeb::Segmentator
88
88
  if :punct == token.type
89
89
  sentence.to = tokens.
90
90
  select { |t| t.from >= token.from }.
91
- inject(token) { |r, t| break r if t.type != token.type; t }.
92
- to
91
+ inject(token) { |r, t| break r if t.type != token.type; t }.to
93
92
 
94
93
  @sentences << sentence
95
94
  sentence = new_sentence
@@ -100,7 +99,7 @@ class Greeb::Segmentator
100
99
  sentence
101
100
  end
102
101
 
103
- nil.tap { @sentences << rest if rest.from and rest.to }
102
+ nil.tap { @sentences << rest if rest.from && rest.to }
104
103
  end
105
104
 
106
105
  # Implementation of the subsentence detection method. This method
@@ -112,19 +111,18 @@ class Greeb::Segmentator
112
111
  @subsentences = SortedSet.new
113
112
 
114
113
  rest = tokens.inject(new_subsentence) do |subsentence, token|
115
- if !subsentence.from and SENTENCE_DOESNT_START.include?(token.type)
114
+ if !subsentence.from && SENTENCE_DOESNT_START.include?(token.type)
116
115
  next subsentence
117
116
  end
118
117
 
119
118
  subsentence.from = token.from unless subsentence.from
120
119
 
121
- next subsentence if subsentence.to and subsentence.to > token.to
120
+ next subsentence if subsentence.to && subsentence.to > token.to
122
121
 
123
122
  if [:punct, :spunct].include? token.type
124
123
  subsentence.to = tokens.
125
124
  select { |t| t.from >= token.from }.
126
- inject(token) { |r, t| break r if t.type != token.type; t }.
127
- to
125
+ inject(token) { |r, t| break r if t.type != token.type; t }.to
128
126
 
129
127
  @subsentences << subsentence
130
128
  subsentence = new_subsentence
@@ -135,7 +133,7 @@ class Greeb::Segmentator
135
133
  subsentence
136
134
  end
137
135
 
138
- nil.tap { @subsentences << rest if rest.from and rest.to }
136
+ nil.tap { @subsentences << rest if rest.from && rest.to }
139
137
  end
140
138
 
141
139
  private
data/lib/greeb/version.rb CHANGED
@@ -5,5 +5,5 @@
5
5
  module Greeb
6
6
  # Version of Greeb.
7
7
  #
8
- VERSION = '0.2.0.pre2'
8
+ VERSION = '0.2.0.pre3'
9
9
  end
data/spec/bin_spec.rb ADDED
@@ -0,0 +1,24 @@
1
+ # encoding: utf-8
2
+
3
+ require_relative 'spec_helper'
4
+
5
+ describe 'CLI' do
6
+ it 'should do nothing when ran without input' do
7
+ invoke('').must_be_empty
8
+ end
9
+
10
+ it 'should tokenize text when input is given' do
11
+ invoke(stdin: 'Hello guys!').must_equal(
12
+ %w(Hello guys !))
13
+ end
14
+
15
+ it 'should extract URLs' do
16
+ invoke(stdin: 'Hello http://nlpub.ru guys!').must_equal(
17
+ %w(Hello http://nlpub.ru guys !))
18
+ end
19
+
20
+ it 'should extract e-mails' do
21
+ invoke(stdin: 'Hello example@example.com guys!').must_equal(
22
+ %w(Hello example@example.com guys !))
23
+ end
24
+ end
@@ -1,6 +1,6 @@
1
1
  # encoding: utf-8
2
2
 
3
- require File.expand_path('../spec_helper', __FILE__)
3
+ require_relative 'spec_helper'
4
4
 
5
5
  module Greeb
6
6
  describe Segmentator do
data/spec/spec_helper.rb CHANGED
@@ -2,13 +2,9 @@
2
2
 
3
3
  require 'rubygems'
4
4
 
5
- $:.unshift File.expand_path('../../lib', __FILE__)
6
-
7
- if RUBY_VERSION == '1.8'
8
- gem 'minitest'
9
- end
10
-
5
+ gem 'minitest'
11
6
  require 'minitest/autorun'
7
+ require 'minitest/hell'
12
8
 
13
9
  unless 'true' == ENV['TRAVIS']
14
10
  require 'simplecov'
@@ -17,4 +13,7 @@ unless 'true' == ENV['TRAVIS']
17
13
  end
18
14
  end
19
15
 
16
+ $LOAD_PATH.unshift File.expand_path('../../lib', __FILE__)
20
17
  require 'greeb'
18
+
19
+ Dir[File.expand_path('../support/**/*.rb', __FILE__)].each { |f| require f }
@@ -0,0 +1,29 @@
1
+ # encoding: utf-8
2
+
3
+ require 'open3'
4
+
5
+ # http://dota2.ru/guides/880-invokirkhakha-sanstrajk-ni-azhydal-da/
6
+ #
7
+ class MiniTest::Unit::TestCase
8
+ # Quas Wex Exort.
9
+ #
10
+ def invoke_cache
11
+ @invoke_cache ||= {}
12
+ end
13
+
14
+ # So begins a new age of knowledge.
15
+ #
16
+ def invoke(*argv)
17
+ return invoke_cache[argv] if invoke_cache.has_key? argv
18
+
19
+ arguments = argv.dup
20
+ options = (arguments.last.is_a? Hash) ? arguments.pop : {}
21
+ executable = File.expand_path('../../../bin/greeb', __FILE__)
22
+
23
+ Open3.popen3(executable, *arguments) do |i, o, *_|
24
+ i.puts options[:stdin] if options[:stdin]
25
+ i.close
26
+ invoke_cache[argv] = o.readlines.map(&:chomp!)
27
+ end
28
+ end
29
+ end
@@ -1,6 +1,6 @@
1
1
  # encoding: utf-8
2
2
 
3
- require File.expand_path('../spec_helper', __FILE__)
3
+ require_relative 'spec_helper'
4
4
 
5
5
  module Greeb
6
6
  describe Tokenizer do
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: greeb
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0.pre2
4
+ version: 0.2.0.pre3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Ustalov
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-04-21 00:00:00.000000000 Z
11
+ date: 2013-04-30 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rake
@@ -38,34 +38,6 @@ dependencies:
38
38
  - - '>='
39
39
  - !ruby/object:Gem::Version
40
40
  version: '2.11'
41
- - !ruby/object:Gem::Dependency
42
- name: simplecov
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - '>='
46
- - !ruby/object:Gem::Version
47
- version: '0'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - '>='
53
- - !ruby/object:Gem::Version
54
- version: '0'
55
- - !ruby/object:Gem::Dependency
56
- name: yard
57
- requirement: !ruby/object:Gem::Requirement
58
- requirements:
59
- - - '>='
60
- - !ruby/object:Gem::Version
61
- version: '0'
62
- type: :development
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - '>='
67
- - !ruby/object:Gem::Version
68
- version: '0'
69
41
  description: Greeb is a simple yet awesome and Unicode-aware regexp-based tokenizer,
70
42
  written in Ruby.
71
43
  email:
@@ -76,6 +48,7 @@ extensions: []
76
48
  extra_rdoc_files: []
77
49
  files:
78
50
  - .gitignore
51
+ - .rubocop.yml
79
52
  - .travis.yml
80
53
  - .yardopts
81
54
  - Gemfile
@@ -90,9 +63,11 @@ files:
90
63
  - lib/greeb/strscan.rb
91
64
  - lib/greeb/tokenizer.rb
92
65
  - lib/greeb/version.rb
66
+ - spec/bin_spec.rb
93
67
  - spec/parser_spec.rb
94
68
  - spec/segmentator_spec.rb
95
69
  - spec/spec_helper.rb
70
+ - spec/support/invoker.rb
96
71
  - spec/tokenizer_spec.rb
97
72
  homepage: https://github.com/ustalov/greeb
98
73
  licenses: []
@@ -113,13 +88,15 @@ required_rubygems_version: !ruby/object:Gem::Requirement
113
88
  version: 1.3.1
114
89
  requirements: []
115
90
  rubyforge_project: greeb
116
- rubygems_version: 2.0.3
91
+ rubygems_version: 2.0.0
117
92
  signing_key:
118
93
  specification_version: 4
119
94
  summary: Greeb is a simple Unicode-aware regexp-based tokenizer.
120
95
  test_files:
96
+ - spec/bin_spec.rb
121
97
  - spec/parser_spec.rb
122
98
  - spec/segmentator_spec.rb
123
99
  - spec/spec_helper.rb
100
+ - spec/support/invoker.rb
124
101
  - spec/tokenizer_spec.rb
125
102
  has_rdoc: