greeb 0.2.3 → 0.2.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.travis.yml +4 -3
- data/LICENSE +1 -1
- data/README.md +40 -21
- data/greeb.gemspec +2 -2
- data/lib/greeb/parser.rb +1 -3
- data/lib/greeb/tokenizer.rb +10 -24
- data/lib/greeb/version.rb +1 -1
- metadata +5 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0b353d307d409d5f67e7a7932e4d9aba26fd9dc8
|
4
|
+
data.tar.gz: 134d541cd2ccf1f6d71d6faea69edca8b650f08a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: cdac0fb93910ef3a3c2e78a9d8dbeb2aeb9b906982c00ca51cfe385e40a75595cd26e3241a6b5ae6bf3b6e3e3090688ba2fd5d2890c38b9483fd09b405d3a2f4
|
7
|
+
data.tar.gz: ad5b4faf513f95359b864699af9f5f9e34d6b78d8681ba57d8eada5f2b4ec833d2532571660a9e36f1537792eebf79d3144236cec48e7b3be349281433db8b70
|
data/.travis.yml
CHANGED
data/LICENSE
CHANGED
data/README.md
CHANGED
@@ -1,9 +1,14 @@
|
|
1
1
|
# Greeb
|
2
|
+
|
2
3
|
Greeb [grʲip] is a simple yet awesome and Unicode-aware text segmentator
|
3
|
-
|
4
|
-
|
4
|
+
based on regular expressions. The API documentation is available on
|
5
|
+
[RubyDoc.info]. The software demonstration is available on
|
6
|
+
<https://greeb.herokuapp.com>.
|
7
|
+
|
8
|
+
[RubyDoc.info]: http://www.rubydoc.info/github/dustalov/greeb/master
|
5
9
|
|
6
10
|
## Installation
|
11
|
+
|
7
12
|
Add this line to your application's Gemfile:
|
8
13
|
|
9
14
|
```ruby
|
@@ -19,11 +24,15 @@ Or install it yourself as:
|
|
19
24
|
$ gem install greeb
|
20
25
|
|
21
26
|
## Usage
|
22
|
-
Greeb can help you solve simple text processing problems such as
|
23
|
-
tokenization and segmentation.
|
24
27
|
|
25
|
-
|
26
|
-
|
28
|
+
Greeb can approach such essential text processing problems as
|
29
|
+
tokenization and segmentation. There are two ways to use it:
|
30
|
+
1) as a command-line application, 2) as a Ruby library.
|
31
|
+
|
32
|
+
### Command-Line Interface
|
33
|
+
|
34
|
+
The `greeb` application reads the input text from `STDIN` and
|
35
|
+
writes one token per line to `STDOUT`.
|
27
36
|
|
28
37
|
```
|
29
38
|
% echo 'Hello http://nlpub.ru guys, how are you?' | greeb
|
@@ -38,6 +47,7 @@ you
|
|
38
47
|
```
|
39
48
|
|
40
49
|
### Tokenization API
|
50
|
+
|
41
51
|
Greeb has a very convinient API that makes you happy.
|
42
52
|
|
43
53
|
```ruby
|
@@ -48,7 +58,8 @@ pp Greeb::Tokenizer.tokenize('Hello!')
|
|
48
58
|
=end
|
49
59
|
```
|
50
60
|
|
51
|
-
It should be noted that it is possible to process much
|
61
|
+
It should be noted that it is also possible to process much
|
62
|
+
complex texts than the present one.
|
52
63
|
|
53
64
|
```ruby
|
54
65
|
text =<<-EOF
|
@@ -91,8 +102,8 @@ pp Greeb::Tokenizer.tokenize(text)
|
|
91
102
|
```
|
92
103
|
|
93
104
|
### Segmentation API
|
94
|
-
|
95
|
-
|
105
|
+
|
106
|
+
The analyzer can also perform sentence detection.
|
96
107
|
|
97
108
|
```ruby
|
98
109
|
text = 'Hello! How are you?'
|
@@ -104,8 +115,8 @@ pp Greeb::Segmentator.new(tokens).sentences
|
|
104
115
|
=end
|
105
116
|
```
|
106
117
|
|
107
|
-
|
108
|
-
|
118
|
+
Having obtained the sentence boundaries, it is possible to
|
119
|
+
extract tokens covered by these sentences.
|
109
120
|
|
110
121
|
```ruby
|
111
122
|
text = 'Hello! How are you?'
|
@@ -127,10 +138,12 @@ pp segmentator.extract(segmentator.sentences)
|
|
127
138
|
```
|
128
139
|
|
129
140
|
### Parsing API
|
130
|
-
Texts are often include some special spans such as URLs and e-mail
|
131
|
-
addresses. Greeb can help you in these strings retrieval.
|
132
141
|
|
133
|
-
|
142
|
+
It is often that a text includes such special entries as URLs
|
143
|
+
and e-mail addresses. Greeb can assist you in extracting them.
|
144
|
+
|
145
|
+
#### Extraction of URLs and e-mails
|
146
|
+
|
134
147
|
```ruby
|
135
148
|
text = 'My website is http://nlpub.ru and e-mail is example@example.com.'
|
136
149
|
|
@@ -145,9 +158,10 @@ pp Greeb::Parser.emails(text).map { |e| [e, e.slice(text)] }
|
|
145
158
|
=end
|
146
159
|
```
|
147
160
|
|
148
|
-
Please
|
161
|
+
Please do not use Greeb for the development of spam lists. Spam sucks.
|
162
|
+
|
163
|
+
#### Extraction of abbreviations
|
149
164
|
|
150
|
-
#### Abbreviation retrieval
|
151
165
|
```ruby
|
152
166
|
text = 'Hello, G.L.H.F. everyone!'
|
153
167
|
|
@@ -160,7 +174,8 @@ pp Greeb::Parser.abbrevs(text).map { |e| [e, e.slice(text)] }
|
|
160
174
|
The algorithm is not so accurate, but still useful in many practical
|
161
175
|
situations.
|
162
176
|
|
163
|
-
####
|
177
|
+
#### Extraction of time stamps
|
178
|
+
|
164
179
|
```ruby
|
165
180
|
text = 'Our time is running out: 13:37 or 14:89.'
|
166
181
|
|
@@ -171,7 +186,8 @@ pp Greeb::Parser.time(text).map { |e| [e, e.slice(text)] }
|
|
171
186
|
```
|
172
187
|
|
173
188
|
## Spans
|
174
|
-
|
189
|
+
|
190
|
+
Greeb operates with spans, which are tuples of *(from, to, kind)*, where
|
175
191
|
*from* is a beginning of the span, *to* is an ending of the span,
|
176
192
|
and *kind* is a type of the span.
|
177
193
|
|
@@ -180,20 +196,23 @@ There are several span types at the tokenization stage: `:letter`,
|
|
180
196
|
(for in-sentence punctuation), `:space`, and `:break`.
|
181
197
|
|
182
198
|
## Contributing
|
199
|
+
|
183
200
|
1. Fork it;
|
184
201
|
2. Create your feature branch (`git checkout -b my-new-feature`);
|
185
202
|
3. Commit your changes (`git commit -am 'Added some feature'`);
|
186
203
|
4. Push to the branch (`git push origin my-new-feature`);
|
187
204
|
5. Create new Pull Request.
|
188
205
|
|
189
|
-
## Build Status [<img src="https://secure.travis-ci.org/
|
206
|
+
## Build Status [<img src="https://secure.travis-ci.org/dustalov/greeb.png"/>](http://travis-ci.org/dustalov/greeb)
|
190
207
|
|
191
208
|
## Dependency Status [<img src="https://gemnasium.com/dmchk/greeb.png"/>](https://gemnasium.com/dmchk/greeb)
|
192
209
|
|
193
210
|
## Code Climate [<img src="https://codeclimate.com/github/dmchk/greeb.png"/>](https://codeclimate.com/github/dmchk/greeb)
|
194
211
|
|
212
|
+
## DOI [<img src="https://zenodo.org/badge/doi/10.5281/zenodo.10119.png"/>](http://dx.doi.org/10.5281/zenodo.10119)
|
213
|
+
|
195
214
|
## Copyright
|
196
215
|
|
197
|
-
Copyright (c) 2010-
|
216
|
+
Copyright (c) 2010-2015 [Dmitry Ustalov]. See LICENSE for details.
|
198
217
|
|
199
|
-
[Dmitry Ustalov]:
|
218
|
+
[Dmitry Ustalov]: https://ustalov.name/
|
data/greeb.gemspec
CHANGED
@@ -8,13 +8,13 @@ Gem::Specification.new do |spec|
|
|
8
8
|
spec.platform = Gem::Platform::RUBY
|
9
9
|
spec.authors = ['Dmitry Ustalov']
|
10
10
|
spec.email = ['dmitry@eveel.ru']
|
11
|
-
spec.homepage = 'https://github.com/
|
11
|
+
spec.homepage = 'https://github.com/dustalov/greeb'
|
12
12
|
spec.summary = 'Greeb is a simple Unicode-aware regexp-based tokenizer.'
|
13
13
|
spec.description = 'Greeb is a simple yet awesome and Unicode-aware ' \
|
14
14
|
'regexp-based tokenizer, written in Ruby.'
|
15
15
|
spec.license = 'MIT'
|
16
16
|
|
17
|
-
spec.
|
17
|
+
spec.required_ruby_version = '>= 1.9.1'
|
18
18
|
|
19
19
|
spec.files = `git ls-files`.split("\n")
|
20
20
|
spec.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
data/lib/greeb/parser.rb
CHANGED
@@ -4,9 +4,7 @@
|
|
4
4
|
# text. These entities are URLs, e-mail addresses, names, etc. This module
|
5
5
|
# includes several helpers that could help to solve these problems.
|
6
6
|
#
|
7
|
-
module Greeb::Parser
|
8
|
-
extend self
|
9
|
-
|
7
|
+
module Greeb::Parser extend self
|
10
8
|
# An URL pattern. Not so precise, but IDN-compatible.
|
11
9
|
#
|
12
10
|
URL = %r{\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\p{L}\w\d]+\)|([^.\s]|/)))}i
|
data/lib/greeb/tokenizer.rb
CHANGED
@@ -5,10 +5,7 @@
|
|
5
5
|
# Unicode character categories been obtained from
|
6
6
|
# <http://www.fileformat.info/info/unicode/category/index.htm>.
|
7
7
|
#
|
8
|
-
module Greeb::Tokenizer
|
9
|
-
# http://www.youtube.com/watch?v=eF1lU-CrQfc
|
10
|
-
extend self
|
11
|
-
|
8
|
+
module Greeb::Tokenizer extend self
|
12
9
|
# English and Russian letters.
|
13
10
|
#
|
14
11
|
LETTERS = /[\p{L}]+/u
|
@@ -55,7 +52,15 @@ module Greeb::Tokenizer
|
|
55
52
|
scanner = Greeb::StringScanner.new(text)
|
56
53
|
tokens = []
|
57
54
|
while !scanner.eos?
|
58
|
-
|
55
|
+
parse! scanner, tokens, LETTERS, :letter or
|
56
|
+
parse! scanner, tokens, FLOATS, :float or
|
57
|
+
parse! scanner, tokens, INTEGERS, :integer or
|
58
|
+
split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
|
59
|
+
split_parse! scanner, tokens, PUNCTUATIONS, :punct or
|
60
|
+
split_parse! scanner, tokens, SEPARATORS, :separ or
|
61
|
+
split_parse! scanner, tokens, SPACES, :space or
|
62
|
+
split_parse! scanner, tokens, BREAKS, :break or
|
63
|
+
parse! scanner, tokens, RESIDUALS, :residual or
|
59
64
|
raise Greeb::UnknownSpan.new(text, scanner.char_pos)
|
60
65
|
end
|
61
66
|
tokens
|
@@ -78,25 +83,6 @@ module Greeb::Tokenizer
|
|
78
83
|
end
|
79
84
|
|
80
85
|
protected
|
81
|
-
# One iteration of the tokenization process.
|
82
|
-
#
|
83
|
-
# @param scanner [Greeb::StringScanner] string scanner.
|
84
|
-
# @param tokens [Array<Greeb::Span>] result array.
|
85
|
-
#
|
86
|
-
# @return [Array<Greeb::Span>] the modified set of extracted tokens.
|
87
|
-
#
|
88
|
-
def step scanner, tokens
|
89
|
-
parse! scanner, tokens, LETTERS, :letter or
|
90
|
-
parse! scanner, tokens, FLOATS, :float or
|
91
|
-
parse! scanner, tokens, INTEGERS, :integer or
|
92
|
-
split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
|
93
|
-
split_parse! scanner, tokens, PUNCTUATIONS, :punct or
|
94
|
-
split_parse! scanner, tokens, SEPARATORS, :separ or
|
95
|
-
split_parse! scanner, tokens, SPACES, :space or
|
96
|
-
split_parse! scanner, tokens, BREAKS, :break or
|
97
|
-
parse! scanner, tokens, RESIDUALS, :residual
|
98
|
-
end
|
99
|
-
|
100
86
|
# Try to parse one small piece of text that is covered by pattern
|
101
87
|
# of necessary type.
|
102
88
|
#
|
data/lib/greeb/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: greeb
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.2.
|
4
|
+
version: 0.2.4
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dmitry Ustalov
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2015-01-14 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: minitest
|
@@ -58,7 +58,7 @@ files:
|
|
58
58
|
- spec/spec_helper.rb
|
59
59
|
- spec/support/invoker.rb
|
60
60
|
- spec/tokenizer_spec.rb
|
61
|
-
homepage: https://github.com/
|
61
|
+
homepage: https://github.com/dustalov/greeb
|
62
62
|
licenses:
|
63
63
|
- MIT
|
64
64
|
metadata: {}
|
@@ -70,14 +70,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
70
70
|
requirements:
|
71
71
|
- - ">="
|
72
72
|
- !ruby/object:Gem::Version
|
73
|
-
version:
|
73
|
+
version: 1.9.1
|
74
74
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
75
75
|
requirements:
|
76
76
|
- - ">="
|
77
77
|
- !ruby/object:Gem::Version
|
78
78
|
version: '0'
|
79
79
|
requirements: []
|
80
|
-
rubyforge_project:
|
80
|
+
rubyforge_project:
|
81
81
|
rubygems_version: 2.2.2
|
82
82
|
signing_key:
|
83
83
|
specification_version: 4
|