greeb 0.2.3 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 07673b32254cd2b0ab0edf0664fa59e46231dbe3
4
- data.tar.gz: f8eaac92c0fd4d7dda99c4441117b5e2b34c5caa
3
+ metadata.gz: 0b353d307d409d5f67e7a7932e4d9aba26fd9dc8
4
+ data.tar.gz: 134d541cd2ccf1f6d71d6faea69edca8b650f08a
5
5
  SHA512:
6
- metadata.gz: ebfda44f713c3dcda9df0439f073a1c075360c14cc41e99c400db8eec25f06033ba2d788b5c4b7b8715eeb5cc085e6c704f6e9bad08bfd6af7b4bc4d051a8c32
7
- data.tar.gz: cb60f13ddad1e17a7cdbbd19add866677f92590cb7072dd5dc3cf70b80c93a31e0e857366908ff214bb0cc5b2c585415abd78338bd8479ea3150ebf58f5d2117
6
+ metadata.gz: cdac0fb93910ef3a3c2e78a9d8dbeb2aeb9b906982c00ca51cfe385e40a75595cd26e3241a6b5ae6bf3b6e3e3090688ba2fd5d2890c38b9483fd09b405d3a2f4
7
+ data.tar.gz: ad5b4faf513f95359b864699af9f5f9e34d6b78d8681ba57d8eada5f2b4ec833d2532571660a9e36f1537792eebf79d3144236cec48e7b3be349281433db8b70
@@ -1,8 +1,9 @@
1
+ sudo: false
1
2
  language: ruby
2
3
  rvm:
3
- - 2.0.0
4
+ - ruby
5
+ - rbx
4
6
  - jruby-19mode
5
- - rbx-19mode
6
7
  matrix:
7
8
  allow_failures:
8
- - rvm: rbx-19mode
9
+ - rvm: rbx
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2010-2014 Dmitry Ustalov
1
+ Copyright (c) 2010-2015 Dmitry Ustalov
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,9 +1,14 @@
1
1
  # Greeb
2
+
2
3
  Greeb [grʲip] is a simple yet awesome and Unicode-aware text segmentator
3
- that is based on regular expressions. The API documentation is available
4
- at <http://rubydoc.info/github/dmchk/greeb/master/frames>.
4
+ based on regular expressions. The API documentation is available on
5
+ [RubyDoc.info]. The software demonstration is available on
6
+ <https://greeb.herokuapp.com>.
7
+
8
+ [RubyDoc.info]: http://www.rubydoc.info/github/dustalov/greeb/master
5
9
 
6
10
  ## Installation
11
+
7
12
  Add this line to your application's Gemfile:
8
13
 
9
14
  ```ruby
@@ -19,11 +24,15 @@ Or install it yourself as:
19
24
  $ gem install greeb
20
25
 
21
26
  ## Usage
22
- Greeb can help you solve simple text processing problems such as
23
- tokenization and segmentation.
24
27
 
25
- It is available as a command line application that reads the input
26
- text from STDIN and prints one token per line into STDOUT.
28
+ Greeb can approach such essential text processing problems as
29
+ tokenization and segmentation. There are two ways to use it:
30
+ 1) as a command-line application, 2) as a Ruby library.
31
+
32
+ ### Command-Line Interface
33
+
34
+ The `greeb` application reads the input text from `STDIN` and
35
+ writes one token per line to `STDOUT`.
27
36
 
28
37
  ```
29
38
  % echo 'Hello http://nlpub.ru guys, how are you?' | greeb
@@ -38,6 +47,7 @@ you
38
47
  ```
39
48
 
40
49
  ### Tokenization API
50
+
41
51
  Greeb has a very convinient API that makes you happy.
42
52
 
43
53
  ```ruby
@@ -48,7 +58,8 @@ pp Greeb::Tokenizer.tokenize('Hello!')
48
58
  =end
49
59
  ```
50
60
 
51
- It should be noted that it is possible to process much complex texts.
61
+ It should be noted that it is also possible to process much
62
+ complex texts than the present one.
52
63
 
53
64
  ```ruby
54
65
  text =<<-EOF
@@ -91,8 +102,8 @@ pp Greeb::Tokenizer.tokenize(text)
91
102
  ```
92
103
 
93
104
  ### Segmentation API
94
- Also it can be used to solve the text segmentation problems
95
- such as sentence detection tasks.
105
+
106
+ The analyzer can also perform sentence detection.
96
107
 
97
108
  ```ruby
98
109
  text = 'Hello! How are you?'
@@ -104,8 +115,8 @@ pp Greeb::Segmentator.new(tokens).sentences
104
115
  =end
105
116
  ```
106
117
 
107
- It is possible to extract tokens that were processed by the text
108
- segmentator.
118
+ Having obtained the sentence boundaries, it is possible to
119
+ extract tokens covered by these sentences.
109
120
 
110
121
  ```ruby
111
122
  text = 'Hello! How are you?'
@@ -127,10 +138,12 @@ pp segmentator.extract(segmentator.sentences)
127
138
  ```
128
139
 
129
140
  ### Parsing API
130
- Texts are often include some special spans such as URLs and e-mail
131
- addresses. Greeb can help you in these strings retrieval.
132
141
 
133
- #### URL and E-mail retrieval
142
+ It is often that a text includes such special entries as URLs
143
+ and e-mail addresses. Greeb can assist you in extracting them.
144
+
145
+ #### Extraction of URLs and e-mails
146
+
134
147
  ```ruby
135
148
  text = 'My website is http://nlpub.ru and e-mail is example@example.com.'
136
149
 
@@ -145,9 +158,10 @@ pp Greeb::Parser.emails(text).map { |e| [e, e.slice(text)] }
145
158
  =end
146
159
  ```
147
160
 
148
- Please don't use Greeb in spam lists development purposes.
161
+ Please do not use Greeb for the development of spam lists. Spam sucks.
162
+
163
+ #### Extraction of abbreviations
149
164
 
150
- #### Abbreviation retrieval
151
165
  ```ruby
152
166
  text = 'Hello, G.L.H.F. everyone!'
153
167
 
@@ -160,7 +174,8 @@ pp Greeb::Parser.abbrevs(text).map { |e| [e, e.slice(text)] }
160
174
  The algorithm is not so accurate, but still useful in many practical
161
175
  situations.
162
176
 
163
- #### Timestamps retrieval
177
+ #### Extraction of time stamps
178
+
164
179
  ```ruby
165
180
  text = 'Our time is running out: 13:37 or 14:89.'
166
181
 
@@ -171,7 +186,8 @@ pp Greeb::Parser.time(text).map { |e| [e, e.slice(text)] }
171
186
  ```
172
187
 
173
188
  ## Spans
174
- Greeb operates with spans, tuples of *(from, to, kind)*, where
189
+
190
+ Greeb operates with spans, which are tuples of *(from, to, kind)*, where
175
191
  *from* is a beginning of the span, *to* is an ending of the span,
176
192
  and *kind* is a type of the span.
177
193
 
@@ -180,20 +196,23 @@ There are several span types at the tokenization stage: `:letter`,
180
196
  (for in-sentence punctuation), `:space`, and `:break`.
181
197
 
182
198
  ## Contributing
199
+
183
200
  1. Fork it;
184
201
  2. Create your feature branch (`git checkout -b my-new-feature`);
185
202
  3. Commit your changes (`git commit -am 'Added some feature'`);
186
203
  4. Push to the branch (`git push origin my-new-feature`);
187
204
  5. Create new Pull Request.
188
205
 
189
- ## Build Status [<img src="https://secure.travis-ci.org/dmchk/greeb.png"/>](http://travis-ci.org/dmchk/greeb)
206
+ ## Build Status [<img src="https://secure.travis-ci.org/dustalov/greeb.png"/>](http://travis-ci.org/dustalov/greeb)
190
207
 
191
208
  ## Dependency Status [<img src="https://gemnasium.com/dmchk/greeb.png"/>](https://gemnasium.com/dmchk/greeb)
192
209
 
193
210
  ## Code Climate [<img src="https://codeclimate.com/github/dmchk/greeb.png"/>](https://codeclimate.com/github/dmchk/greeb)
194
211
 
212
+ ## DOI [<img src="https://zenodo.org/badge/doi/10.5281/zenodo.10119.png"/>](http://dx.doi.org/10.5281/zenodo.10119)
213
+
195
214
  ## Copyright
196
215
 
197
- Copyright (c) 2010-2014 [Dmitry Ustalov]. See LICENSE for details.
216
+ Copyright (c) 2010-2015 [Dmitry Ustalov]. See LICENSE for details.
198
217
 
199
- [Dmitry Ustalov]: http://ustalov.name/
218
+ [Dmitry Ustalov]: https://ustalov.name/
@@ -8,13 +8,13 @@ Gem::Specification.new do |spec|
8
8
  spec.platform = Gem::Platform::RUBY
9
9
  spec.authors = ['Dmitry Ustalov']
10
10
  spec.email = ['dmitry@eveel.ru']
11
- spec.homepage = 'https://github.com/dmchk/greeb'
11
+ spec.homepage = 'https://github.com/dustalov/greeb'
12
12
  spec.summary = 'Greeb is a simple Unicode-aware regexp-based tokenizer.'
13
13
  spec.description = 'Greeb is a simple yet awesome and Unicode-aware ' \
14
14
  'regexp-based tokenizer, written in Ruby.'
15
15
  spec.license = 'MIT'
16
16
 
17
- spec.rubyforge_project = 'greeb'
17
+ spec.required_ruby_version = '>= 1.9.1'
18
18
 
19
19
  spec.files = `git ls-files`.split("\n")
20
20
  spec.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
@@ -4,9 +4,7 @@
4
4
  # text. These entities are URLs, e-mail addresses, names, etc. This module
5
5
  # includes several helpers that could help to solve these problems.
6
6
  #
7
- module Greeb::Parser
8
- extend self
9
-
7
+ module Greeb::Parser extend self
10
8
  # An URL pattern. Not so precise, but IDN-compatible.
11
9
  #
12
10
  URL = %r{\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\p{L}\w\d]+\)|([^.\s]|/)))}i
@@ -5,10 +5,7 @@
5
5
  # Unicode character categories been obtained from
6
6
  # <http://www.fileformat.info/info/unicode/category/index.htm>.
7
7
  #
8
- module Greeb::Tokenizer
9
- # http://www.youtube.com/watch?v=eF1lU-CrQfc
10
- extend self
11
-
8
+ module Greeb::Tokenizer extend self
12
9
  # English and Russian letters.
13
10
  #
14
11
  LETTERS = /[\p{L}]+/u
@@ -55,7 +52,15 @@ module Greeb::Tokenizer
55
52
  scanner = Greeb::StringScanner.new(text)
56
53
  tokens = []
57
54
  while !scanner.eos?
58
- step scanner, tokens or
55
+ parse! scanner, tokens, LETTERS, :letter or
56
+ parse! scanner, tokens, FLOATS, :float or
57
+ parse! scanner, tokens, INTEGERS, :integer or
58
+ split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
59
+ split_parse! scanner, tokens, PUNCTUATIONS, :punct or
60
+ split_parse! scanner, tokens, SEPARATORS, :separ or
61
+ split_parse! scanner, tokens, SPACES, :space or
62
+ split_parse! scanner, tokens, BREAKS, :break or
63
+ parse! scanner, tokens, RESIDUALS, :residual or
59
64
  raise Greeb::UnknownSpan.new(text, scanner.char_pos)
60
65
  end
61
66
  tokens
@@ -78,25 +83,6 @@ module Greeb::Tokenizer
78
83
  end
79
84
 
80
85
  protected
81
- # One iteration of the tokenization process.
82
- #
83
- # @param scanner [Greeb::StringScanner] string scanner.
84
- # @param tokens [Array<Greeb::Span>] result array.
85
- #
86
- # @return [Array<Greeb::Span>] the modified set of extracted tokens.
87
- #
88
- def step scanner, tokens
89
- parse! scanner, tokens, LETTERS, :letter or
90
- parse! scanner, tokens, FLOATS, :float or
91
- parse! scanner, tokens, INTEGERS, :integer or
92
- split_parse! scanner, tokens, SENTENCE_PUNCTUATIONS, :spunct or
93
- split_parse! scanner, tokens, PUNCTUATIONS, :punct or
94
- split_parse! scanner, tokens, SEPARATORS, :separ or
95
- split_parse! scanner, tokens, SPACES, :space or
96
- split_parse! scanner, tokens, BREAKS, :break or
97
- parse! scanner, tokens, RESIDUALS, :residual
98
- end
99
-
100
86
  # Try to parse one small piece of text that is covered by pattern
101
87
  # of necessary type.
102
88
  #
@@ -5,5 +5,5 @@
5
5
  module Greeb
6
6
  # Version of Greeb.
7
7
  #
8
- VERSION = '0.2.3'
8
+ VERSION = '0.2.4'
9
9
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: greeb
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.3
4
+ version: 0.2.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dmitry Ustalov
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-05-25 00:00:00.000000000 Z
11
+ date: 2015-01-14 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: minitest
@@ -58,7 +58,7 @@ files:
58
58
  - spec/spec_helper.rb
59
59
  - spec/support/invoker.rb
60
60
  - spec/tokenizer_spec.rb
61
- homepage: https://github.com/dmchk/greeb
61
+ homepage: https://github.com/dustalov/greeb
62
62
  licenses:
63
63
  - MIT
64
64
  metadata: {}
@@ -70,14 +70,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
70
70
  requirements:
71
71
  - - ">="
72
72
  - !ruby/object:Gem::Version
73
- version: '0'
73
+ version: 1.9.1
74
74
  required_rubygems_version: !ruby/object:Gem::Requirement
75
75
  requirements:
76
76
  - - ">="
77
77
  - !ruby/object:Gem::Version
78
78
  version: '0'
79
79
  requirements: []
80
- rubyforge_project: greeb
80
+ rubyforge_project:
81
81
  rubygems_version: 2.2.2
82
82
  signing_key:
83
83
  specification_version: 4