greeb 0.2.3 → 0.2.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 07673b32254cd2b0ab0edf0664fa59e46231dbe3
4
- data.tar.gz: f8eaac92c0fd4d7dda99c4441117b5e2b34c5caa
3
+ metadata.gz: 0b353d307d409d5f67e7a7932e4d9aba26fd9dc8
4
+ data.tar.gz: 134d541cd2ccf1f6d71d6faea69edca8b650f08a
5
5
  SHA512:
6
- metadata.gz: ebfda44f713c3dcda9df0439f073a1c075360c14cc41e99c400db8eec25f06033ba2d788b5c4b7b8715eeb5cc085e6c704f6e9bad08bfd6af7b4bc4d051a8c32
7
- data.tar.gz: cb60f13ddad1e17a7cdbbd19add866677f92590cb7072dd5dc3cf70b80c93a31e0e857366908ff214bb0cc5b2c585415abd78338bd8479ea3150ebf58f5d2117
6
+ metadata.gz: cdac0fb93910ef3a3c2e78a9d8dbeb2aeb9b906982c00ca51cfe385e40a75595cd26e3241a6b5ae6bf3b6e3e3090688ba2fd5d2890c38b9483fd09b405d3a2f4
7
+ data.tar.gz: ad5b4faf513f95359b864699af9f5f9e34d6b78d8681ba57d8eada5f2b4ec833d2532571660a9e36f1537792eebf79d3144236cec48e7b3be349281433db8b70
@@ -1,8 +1,9 @@
1
+ sudo: false
1
2
  language: ruby
2
3
  rvm:
3
- - 2.0.0
4
+ - ruby
5
+ - rbx
4
6
  - jruby-19mode
5
- - rbx-19mode
6
7
  matrix:
7
8
  allow_failures:
8
- - rvm: rbx-19mode
9
+ - rvm: rbx
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2010-2014 Dmitry Ustalov
1
+ Copyright (c) 2010-2015 Dmitry Ustalov
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,9 +1,14 @@
1
1
  # Greeb
2
+
2
3
  Greeb [grʲip] is a simple yet awesome and Unicode-aware text segmentator
3
- that is based on regular expressions. The API documentation is available
4
- at <http://rubydoc.info/github/dmchk/greeb/master/frames>.
4
+ based on regular expressions. The API documentation is available on
5
+ [RubyDoc.info]. The software demonstration is available on
6
+ <https://greeb.herokuapp.com>.
7
+
8
+ [RubyDoc.info]: http://www.rubydoc.info/github/dustalov/greeb/master
5
9
 
6
10
  ## Installation
11
+
7
12
  Add this line to your application's Gemfile:
8
13
 
9
14
  ```ruby
@@ -19,11 +24,15 @@ Or install it yourself as:
19
24
  $ gem install greeb
20
25
 
21
26
  ## Usage
22
- Greeb can help you solve simple text processing problems such as
23
- tokenization and segmentation.
24
27
 
25
- It is available as a command line application that reads the input
26
- text from STDIN and prints one token per line into STDOUT.
28
+ Greeb can approach such essential text processing problems as
29
+ tokenization and segmentation. There are two ways to use it:
30
+ 1) as a command-line application, 2) as a Ruby library.
31
+
32
+ ### Command-Line Interface
33
+
34
+ The `greeb` application reads the input text from `STDIN` and
35
+ writes one token per line to `STDOUT`.
27
36
 
28
37
  ```
29
38
  % echo 'Hello http://nlpub.ru guys, how are you?' | greeb
@@ -38,6 +47,7 @@ you
38
47
  ```
39
48
 
40
49
  ### Tokenization API
50
+
41
51
  Greeb has a very convinient API that makes you happy.
42
52
 
43
53
  ```ruby
@@ -48,7 +58,8 @@ pp Greeb::Tokenizer.tokenize('Hello!')
48
58
  =end
49
59
  ```
50
60
 
51
- It should be noted that it is possible to process much complex texts.
61
+ It should be noted that it is also possible to process much
62
+ complex texts than the present one.
52
63
 
53
64
  ```ruby
54
65
  text =<<-EOF
@@ -91,8 +102,8 @@ pp Greeb::Tokenizer.tokenize(text)
91
102
  ```
92
103
 
93
104
  ### Segmentation API
94
- Also it can be used to solve the text segmentation problems
95
- such as sentence detection tasks.
105
+
106
+ The analyzer can also perform sentence detection.
96
107
 
97
108
  ```ruby
98
109
  text = 'Hello! How are you?'
@@ -104,8 +115,8 @@ pp Greeb::Segmentator.new(tokens).sentences
104
115
  =end
105
116
  ```
106
117
 
107
- It is possible to extract tokens that were processed by the text
108
- segmentator.
118
+ Having obtained the sentence boundaries, it is possible to
119
+ extract tokens covered by these sentences.
109
120
 
110
121
  ```ruby
111
122
  text = 'Hello! How are you?'
@@ -127,10 +138,12 @@ pp segmentator.extract(segmentator.sentences)
127
138
  ```
128
139
 
129
140
  ### Parsing API
130
- Texts are often include some special spans such as URLs and e-mail
131
- addresses. Greeb can help you in these strings retrieval.
132
141
 
133
- #### URL and E-mail retrieval
142
+ It is often that a text includes such special entries as URLs
143
+ and e-mail addresses. Greeb can assist you in extracting them.
144
+
145
+ #### Extraction of URLs and e-mails
146
+
134
147
  ```ruby
135
148
  text = 'My website is http://nlpub.ru and e-mail is example@example.com.'
136
149
 
@@ -145,9 +158,10 @@ pp Greeb::Parser.emails(text).map { |e| [e, e.slice(text)] }
145
158
  =end
146
159
  ```
147
160
 
148
- Please don't use Greeb in spam lists development purposes.
161
+ Please do not use Greeb for the development of spam lists. Spam sucks.
162
+
163
+ #### Extraction of abbreviations
149
164
 
150
- #### Abbreviation retrieval
151
165
  ```ruby
152
166
  text = 'Hello, G.L.H.F. everyone!'
153
167
 
@@ -160,7 +174,8 @@ pp Greeb::Parser.abbrevs(text).map { |e| [e, e.slice(text)] }
160
174
  The algorithm is not so accurate, but still useful in many practical
161
175
  situations.
162
176
 
163
- #### Timestamps retrieval
177
+ #### Extraction of time stamps
178
+
164
179
  ```ruby
165
180
  text = 'Our time is running out: 13:37 or 14:89.'
166
181
 
@@ -171,7 +186,8 @@ pp Greeb::Parser.time(text).map { |e| [e, e.slice(text)] }
171
186
  ```
172
187
 
173
188
  ## Spans
174
- Greeb operates with spans, tuples of *(from, to, kind)*, where
189
+
190
+ Greeb operates with spans, which are tuples of *(from, to, kind)*, where
175
191
  *from* is a beginning of the span, *to* is an ending of the span,
176
192
  and *kind* is a type of the span.
177
193
 
@@ -180,20 +196,23 @@ There are several span types at the tokenization stage: `:letter`,
180
196
  (for in-sentence punctuation), `:space`, and `:break`.
181
197
 
182
198
  ## Contributing
199
+
183
200
  1. Fork it;
184
201
  2. Create your feature branch (`git checkout -b my-new-feature`);
185
202
  3. Commit your changes (`git commit -am 'Added some feature'`);
186
203
  4. Push to the branch (`git push origin my-new-feature`);
187
204
  5. Create new Pull Request.
188
205
 
189
- ## Build Status [<img src="https://secure.travis-ci.org/dmchk/greeb.png"/>](http://travis-ci.org/dmchk/greeb)
206
+ ## Build Status [<img src="https://secure.travis-ci.org/dustalov/greeb.png"/>](http://travis-ci.org/dustalov/greeb)
190
207
 
191
208
  ## Dependency Status [<img src="https://gemnasium.com/dmchk/greeb.png"/>](https://gemnasium.com/dmchk/greeb)
192
209
 
193
210
  ## Code Climate [<img src="https://codeclimate.com/github/dmchk/greeb.png"/>](https://codeclimate.com/github/dmchk/greeb)
194
211
 
212
+ ## DOI [<img src="https://zenodo.org/badge/doi/10.5281/zenodo.10119.png"/>](http://dx.doi.org/10.5281/zenodo.10119)
213
+
195
214
  ## Copyright
196
215
 
197
- Copyright (c) 2010-2014 [Dmitry Ustalov]. See LICENSE for details.
216
+ Copyright (c) 2010-2015 [Dmitry Ustalov]. See LICENSE for details.
198
217
 
199
- [Dmitry Ustalov]: http://ustalov.name/
218
+ [Dmitry Ustalov]: https://ustalov.name/
@@ -8,13 +8,13 @@ Gem::Specification.new do |spec|
8
8
  spec.platform = Gem::Platform::RUBY
9
9
  spec.authors = ['Dmitry Ustalov']
10
10
  spec.email = ['dmitry@eveel.ru']
11
- spec.homepage = 'https://github.com/dmchk/greeb'
11
+ spec.homepage = 'https://github.com/dustalov/greeb'
12
12
  spec.summary = 'Greeb is a simple Unicode-aware regexp-based tokenizer.'
13
13
  spec.description = 'Greeb is a simple yet awesome and Unicode-aware ' \
14
14
  'regexp-based tokenizer, written in Ruby.'
15
15
  spec.license = 'MIT'
16
16
 
17
- spec.rubyforge_project = 'greeb'
17
+ spec.required_ruby_version = '>= 1.9.1'
18
18
 
19
19
  spec.files = `git ls-files`.split("\n")
20
20
  spec.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
@@ -4,9 +4,7 @@
4
4
  # text. These entities are URLs, e-mail addresses, names, etc. This module
5
5
  # includes several helpers that could help to solve these problems.
6
6
  #
7
- module Greeb::Parser
8
- extend self
9
-
7
+ module Greeb::Parser extend self
10
8
  # An URL pattern. Not so precise, but IDN-compatible.
11
9
  #
12
10
  URL = %r{\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\p{L}\w\d]+\)|([^.\s]|/)))}i
@@ -5,10 +5,7 @@
5
5
  # Unicode character categories been obtained from
6
6
  # <http://www.fileformat.info/info/unicode/category/index.htm>.
7
7
  #
8
- module Greeb::Tokenizer
9
- # http://www.youtube.com/watch?v=eF1lU-CrQfc
10
- extend self
11
-
8
+ module Greeb::Tokenizer extend self
12
9
  # English and Russian letters.
13
10
  #
14
11
  LETTERS = /[\p{L}]+/u
@@ -55,7 +52,15 @@ module Greeb::Tokenizer
55
52
  scanner = Greeb::StringScanner.new(text)
56
53
  tokens = []
57
54
  while !scanner.eos?
58
- step scanner, tokens or
55
+ parse! scanner, tokens, LETTERS, :letter or
56
+ parse! scanner, tokens, FLOATS, :float or
57
+ parse! scanner, tokens, INTEGERS, :integer or
58
+ split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
59
+ split_parse! scanner, tokens, PUNCTUATIONS, :punct or
60
+ split_parse! scanner, tokens, SEPARATORS, :separ or
61
+ split_parse! scanner, tokens, SPACES, :space or
62
+ split_parse! scanner, tokens, BREAKS, :break or
63
+ parse! scanner, tokens, RESIDUALS, :residual or
59
64
  raise Greeb::UnknownSpan.new(text, scanner.char_pos)
60
65
  end
61
66
  tokens
@@ -78,25 +83,6 @@ module Greeb::Tokenizer
78
83
  end
79
84
 
80
85
  protected
81
- # One iteration of the tokenization process.
82
- #
83
- # @param scanner [Greeb::StringScanner] string scanner.
84
- # @param tokens [Array<Greeb::Span>] result array.
85
- #
86
- # @return [Array<Greeb::Span>] the modified set of extracted tokens.
87
- #
88
- def step scanner, tokens
89
- parse! scanner, tokens, LETTERS, :letter or
90
- parse! scanner, tokens, FLOATS, :float or
91
- parse! scanner, tokens, INTEGERS, :integer or
92
- split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
93
- split_parse! scanner, tokens, PUNCTUATIONS, :punct or
94
- split_parse! scanner, tokens, SEPARATORS, :separ or
95
- split_parse! scanner, tokens, SPACES, :space or
96
- split_parse! scanner, tokens, BREAKS, :break or
97
- parse! scanner, tokens, RESIDUALS, :residual
98
- end
99
-
100
86
  # Try to parse one small piece of text that is covered by pattern
101
87
  # of necessary type.
102
88
  #
@@ -5,5 +5,5 @@
5
5
  module Greeb
6
6
  # Version of Greeb.
7
7
  #
8
- VERSION = '0.2.3'
8
+ VERSION = '0.2.4'
9
9
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: greeb
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.3
4
+ version: 0.2.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Ustalov
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-05-25 00:00:00.000000000 Z
11
+ date: 2015-01-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: minitest
@@ -58,7 +58,7 @@ files:
58
58
  - spec/spec_helper.rb
59
59
  - spec/support/invoker.rb
60
60
  - spec/tokenizer_spec.rb
61
- homepage: https://github.com/dmchk/greeb
61
+ homepage: https://github.com/dustalov/greeb
62
62
  licenses:
63
63
  - MIT
64
64
  metadata: {}
@@ -70,14 +70,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
70
70
  requirements:
71
71
  - - ">="
72
72
  - !ruby/object:Gem::Version
73
- version: '0'
73
+ version: 1.9.1
74
74
  required_rubygems_version: !ruby/object:Gem::Requirement
75
75
  requirements:
76
76
  - - ">="
77
77
  - !ruby/object:Gem::Version
78
78
  version: '0'
79
79
  requirements: []
80
- rubyforge_project: greeb
80
+ rubyforge_project:
81
81
  rubygems_version: 2.2.2
82
82
  signing_key:
83
83
  specification_version: 4