greeb 0.2.3 → 0.2.4
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +4 -3
- data/LICENSE +1 -1
- data/README.md +40 -21
- data/greeb.gemspec +2 -2
- data/lib/greeb/parser.rb +1 -3
- data/lib/greeb/tokenizer.rb +10 -24
- data/lib/greeb/version.rb +1 -1
- metadata +5 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0b353d307d409d5f67e7a7932e4d9aba26fd9dc8
|
4
|
+
data.tar.gz: 134d541cd2ccf1f6d71d6faea69edca8b650f08a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: cdac0fb93910ef3a3c2e78a9d8dbeb2aeb9b906982c00ca51cfe385e40a75595cd26e3241a6b5ae6bf3b6e3e3090688ba2fd5d2890c38b9483fd09b405d3a2f4
|
7
|
+
data.tar.gz: ad5b4faf513f95359b864699af9f5f9e34d6b78d8681ba57d8eada5f2b4ec833d2532571660a9e36f1537792eebf79d3144236cec48e7b3be349281433db8b70
|
data/.travis.yml
CHANGED
data/LICENSE
CHANGED
data/README.md
CHANGED
@@ -1,9 +1,14 @@
|
|
1
1
|
# Greeb
|
2
|
+
|
2
3
|
Greeb [grʲip] is a simple yet awesome and Unicode-aware text segmentator
|
3
|
-
|
4
|
-
|
4
|
+
based on regular expressions. The API documentation is available on
|
5
|
+
[RubyDoc.info]. The software demonstration is available on
|
6
|
+
<https://greeb.herokuapp.com>.
|
7
|
+
|
8
|
+
[RubyDoc.info]: http://www.rubydoc.info/github/dustalov/greeb/master
|
5
9
|
|
6
10
|
## Installation
|
11
|
+
|
7
12
|
Add this line to your application's Gemfile:
|
8
13
|
|
9
14
|
```ruby
|
@@ -19,11 +24,15 @@ Or install it yourself as:
|
|
19
24
|
$ gem install greeb
|
20
25
|
|
21
26
|
## Usage
|
22
|
-
Greeb can help you solve simple text processing problems such as
|
23
|
-
tokenization and segmentation.
|
24
27
|
|
25
|
-
|
26
|
-
|
28
|
+
Greeb can approach such essential text processing problems as
|
29
|
+
tokenization and segmentation. There are two ways to use it:
|
30
|
+
1) as a command-line application, 2) as a Ruby library.
|
31
|
+
|
32
|
+
### Command-Line Interface
|
33
|
+
|
34
|
+
The `greeb` application reads the input text from `STDIN` and
|
35
|
+
writes one token per line to `STDOUT`.
|
27
36
|
|
28
37
|
```
|
29
38
|
% echo 'Hello http://nlpub.ru guys, how are you?' | greeb
|
@@ -38,6 +47,7 @@ you
|
|
38
47
|
```
|
39
48
|
|
40
49
|
### Tokenization API
|
50
|
+
|
41
51
|
Greeb has a very convinient API that makes you happy.
|
42
52
|
|
43
53
|
```ruby
|
@@ -48,7 +58,8 @@ pp Greeb::Tokenizer.tokenize('Hello!')
|
|
48
58
|
=end
|
49
59
|
```
|
50
60
|
|
51
|
-
It should be noted that it is possible to process much
|
61
|
+
It should be noted that it is also possible to process much
|
62
|
+
complex texts than the present one.
|
52
63
|
|
53
64
|
```ruby
|
54
65
|
text =<<-EOF
|
@@ -91,8 +102,8 @@ pp Greeb::Tokenizer.tokenize(text)
|
|
91
102
|
```
|
92
103
|
|
93
104
|
### Segmentation API
|
94
|
-
|
95
|
-
|
105
|
+
|
106
|
+
The analyzer can also perform sentence detection.
|
96
107
|
|
97
108
|
```ruby
|
98
109
|
text = 'Hello! How are you?'
|
@@ -104,8 +115,8 @@ pp Greeb::Segmentator.new(tokens).sentences
|
|
104
115
|
=end
|
105
116
|
```
|
106
117
|
|
107
|
-
|
108
|
-
|
118
|
+
Having obtained the sentence boundaries, it is possible to
|
119
|
+
extract tokens covered by these sentences.
|
109
120
|
|
110
121
|
```ruby
|
111
122
|
text = 'Hello! How are you?'
|
@@ -127,10 +138,12 @@ pp segmentator.extract(segmentator.sentences)
|
|
127
138
|
```
|
128
139
|
|
129
140
|
### Parsing API
|
130
|
-
Texts are often include some special spans such as URLs and e-mail
|
131
|
-
addresses. Greeb can help you in these strings retrieval.
|
132
141
|
|
133
|
-
|
142
|
+
It is often that a text includes such special entries as URLs
|
143
|
+
and e-mail addresses. Greeb can assist you in extracting them.
|
144
|
+
|
145
|
+
#### Extraction of URLs and e-mails
|
146
|
+
|
134
147
|
```ruby
|
135
148
|
text = 'My website is http://nlpub.ru and e-mail is example@example.com.'
|
136
149
|
|
@@ -145,9 +158,10 @@ pp Greeb::Parser.emails(text).map { |e| [e, e.slice(text)] }
|
|
145
158
|
=end
|
146
159
|
```
|
147
160
|
|
148
|
-
Please
|
161
|
+
Please do not use Greeb for the development of spam lists. Spam sucks.
|
162
|
+
|
163
|
+
#### Extraction of abbreviations
|
149
164
|
|
150
|
-
#### Abbreviation retrieval
|
151
165
|
```ruby
|
152
166
|
text = 'Hello, G.L.H.F. everyone!'
|
153
167
|
|
@@ -160,7 +174,8 @@ pp Greeb::Parser.abbrevs(text).map { |e| [e, e.slice(text)] }
|
|
160
174
|
The algorithm is not so accurate, but still useful in many practical
|
161
175
|
situations.
|
162
176
|
|
163
|
-
####
|
177
|
+
#### Extraction of time stamps
|
178
|
+
|
164
179
|
```ruby
|
165
180
|
text = 'Our time is running out: 13:37 or 14:89.'
|
166
181
|
|
@@ -171,7 +186,8 @@ pp Greeb::Parser.time(text).map { |e| [e, e.slice(text)] }
|
|
171
186
|
```
|
172
187
|
|
173
188
|
## Spans
|
174
|
-
|
189
|
+
|
190
|
+
Greeb operates with spans, which are tuples of *(from, to, kind)*, where
|
175
191
|
*from* is a beginning of the span, *to* is an ending of the span,
|
176
192
|
and *kind* is a type of the span.
|
177
193
|
|
@@ -180,20 +196,23 @@ There are several span types at the tokenization stage: `:letter`,
|
|
180
196
|
(for in-sentence punctuation), `:space`, and `:break`.
|
181
197
|
|
182
198
|
## Contributing
|
199
|
+
|
183
200
|
1. Fork it;
|
184
201
|
2. Create your feature branch (`git checkout -b my-new-feature`);
|
185
202
|
3. Commit your changes (`git commit -am 'Added some feature'`);
|
186
203
|
4. Push to the branch (`git push origin my-new-feature`);
|
187
204
|
5. Create new Pull Request.
|
188
205
|
|
189
|
-
## Build Status [<img src="https://secure.travis-ci.org/
|
206
|
+
## Build Status [<img src="https://secure.travis-ci.org/dustalov/greeb.png"/>](http://travis-ci.org/dustalov/greeb)
|
190
207
|
|
191
208
|
## Dependency Status [<img src="https://gemnasium.com/dmchk/greeb.png"/>](https://gemnasium.com/dmchk/greeb)
|
192
209
|
|
193
210
|
## Code Climate [<img src="https://codeclimate.com/github/dmchk/greeb.png"/>](https://codeclimate.com/github/dmchk/greeb)
|
194
211
|
|
212
|
+
## DOI [<img src="https://zenodo.org/badge/doi/10.5281/zenodo.10119.png"/>](http://dx.doi.org/10.5281/zenodo.10119)
|
213
|
+
|
195
214
|
## Copyright
|
196
215
|
|
197
|
-
Copyright (c) 2010-
|
216
|
+
Copyright (c) 2010-2015 [Dmitry Ustalov]. See LICENSE for details.
|
198
217
|
|
199
|
-
[Dmitry Ustalov]:
|
218
|
+
[Dmitry Ustalov]: https://ustalov.name/
|
data/greeb.gemspec
CHANGED
@@ -8,13 +8,13 @@ Gem::Specification.new do |spec|
|
|
8
8
|
spec.platform = Gem::Platform::RUBY
|
9
9
|
spec.authors = ['Dmitry Ustalov']
|
10
10
|
spec.email = ['dmitry@eveel.ru']
|
11
|
-
spec.homepage = 'https://github.com/
|
11
|
+
spec.homepage = 'https://github.com/dustalov/greeb'
|
12
12
|
spec.summary = 'Greeb is a simple Unicode-aware regexp-based tokenizer.'
|
13
13
|
spec.description = 'Greeb is a simple yet awesome and Unicode-aware ' \
|
14
14
|
'regexp-based tokenizer, written in Ruby.'
|
15
15
|
spec.license = 'MIT'
|
16
16
|
|
17
|
-
spec.
|
17
|
+
spec.required_ruby_version = '>= 1.9.1'
|
18
18
|
|
19
19
|
spec.files = `git ls-files`.split("\n")
|
20
20
|
spec.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
data/lib/greeb/parser.rb
CHANGED
@@ -4,9 +4,7 @@
|
|
4
4
|
# text. These entities are URLs, e-mail addresses, names, etc. This module
|
5
5
|
# includes several helpers that could help to solve these problems.
|
6
6
|
#
|
7
|
-
module Greeb::Parser
|
8
|
-
extend self
|
9
|
-
|
7
|
+
module Greeb::Parser extend self
|
10
8
|
# An URL pattern. Not so precise, but IDN-compatible.
|
11
9
|
#
|
12
10
|
URL = %r{\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\p{L}\w\d]+\)|([^.\s]|/)))}i
|
data/lib/greeb/tokenizer.rb
CHANGED
@@ -5,10 +5,7 @@
|
|
5
5
|
# Unicode character categories been obtained from
|
6
6
|
# <http://www.fileformat.info/info/unicode/category/index.htm>.
|
7
7
|
#
|
8
|
-
module Greeb::Tokenizer
|
9
|
-
# http://www.youtube.com/watch?v=eF1lU-CrQfc
|
10
|
-
extend self
|
11
|
-
|
8
|
+
module Greeb::Tokenizer extend self
|
12
9
|
# English and Russian letters.
|
13
10
|
#
|
14
11
|
LETTERS = /[\p{L}]+/u
|
@@ -55,7 +52,15 @@ module Greeb::Tokenizer
|
|
55
52
|
scanner = Greeb::StringScanner.new(text)
|
56
53
|
tokens = []
|
57
54
|
while !scanner.eos?
|
58
|
-
|
55
|
+
parse! scanner, tokens, LETTERS, :letter or
|
56
|
+
parse! scanner, tokens, FLOATS, :float or
|
57
|
+
parse! scanner, tokens, INTEGERS, :integer or
|
58
|
+
split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
|
59
|
+
split_parse! scanner, tokens, PUNCTUATIONS, :punct or
|
60
|
+
split_parse! scanner, tokens, SEPARATORS, :separ or
|
61
|
+
split_parse! scanner, tokens, SPACES, :space or
|
62
|
+
split_parse! scanner, tokens, BREAKS, :break or
|
63
|
+
parse! scanner, tokens, RESIDUALS, :residual or
|
59
64
|
raise Greeb::UnknownSpan.new(text, scanner.char_pos)
|
60
65
|
end
|
61
66
|
tokens
|
@@ -78,25 +83,6 @@ module Greeb::Tokenizer
|
|
78
83
|
end
|
79
84
|
|
80
85
|
protected
|
81
|
-
# One iteration of the tokenization process.
|
82
|
-
#
|
83
|
-
# @param scanner [Greeb::StringScanner] string scanner.
|
84
|
-
# @param tokens [Array<Greeb::Span>] result array.
|
85
|
-
#
|
86
|
-
# @return [Array<Greeb::Span>] the modified set of extracted tokens.
|
87
|
-
#
|
88
|
-
def step scanner, tokens
|
89
|
-
parse! scanner, tokens, LETTERS, :letter or
|
90
|
-
parse! scanner, tokens, FLOATS, :float or
|
91
|
-
parse! scanner, tokens, INTEGERS, :integer or
|
92
|
-
split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
|
93
|
-
split_parse! scanner, tokens, PUNCTUATIONS, :punct or
|
94
|
-
split_parse! scanner, tokens, SEPARATORS, :separ or
|
95
|
-
split_parse! scanner, tokens, SPACES, :space or
|
96
|
-
split_parse! scanner, tokens, BREAKS, :break or
|
97
|
-
parse! scanner, tokens, RESIDUALS, :residual
|
98
|
-
end
|
99
|
-
|
100
86
|
# Try to parse one small piece of text that is covered by pattern
|
101
87
|
# of necessary type.
|
102
88
|
#
|
data/lib/greeb/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: greeb
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.4
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dmitry Ustalov
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2015-01-14 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: minitest
|
@@ -58,7 +58,7 @@ files:
|
|
58
58
|
- spec/spec_helper.rb
|
59
59
|
- spec/support/invoker.rb
|
60
60
|
- spec/tokenizer_spec.rb
|
61
|
-
homepage: https://github.com/
|
61
|
+
homepage: https://github.com/dustalov/greeb
|
62
62
|
licenses:
|
63
63
|
- MIT
|
64
64
|
metadata: {}
|
@@ -70,14 +70,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
70
70
|
requirements:
|
71
71
|
- - ">="
|
72
72
|
- !ruby/object:Gem::Version
|
73
|
-
version:
|
73
|
+
version: 1.9.1
|
74
74
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
75
75
|
requirements:
|
76
76
|
- - ">="
|
77
77
|
- !ruby/object:Gem::Version
|
78
78
|
version: '0'
|
79
79
|
requirements: []
|
80
|
-
rubyforge_project:
|
80
|
+
rubyforge_project:
|
81
81
|
rubygems_version: 2.2.2
|
82
82
|
signing_key:
|
83
83
|
specification_version: 4
|