corpus-processor 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 0b40f1ccc5e1f007f584f6c0bf037b0221d65cec
4
- data.tar.gz: 5b486e05f2372b163a1399244ed2861c239bea02
3
+ metadata.gz: 5e5579e5e275e1eeeada264b8a42cb0a87f1389a
4
+ data.tar.gz: 8adc37111c8b7e4a550bef2ff3a919db4ff82ca4
5
5
  SHA512:
6
- metadata.gz: ec94f33cf3ff79a6874130ddbfbb10df20186ff9a5ffb176de48d92aca56b43bfd0d679e6e88f4cc4e215dfd253d7d697b8f68cd99fabbd05fce7b0ab8e761e4
7
- data.tar.gz: 9e82dff64190b3dd04c33a31d5df024b5b3522de70c13075dc29b2d6c6b431020aded272818315d7cbcef5997bc6fa71e0e00aa1c001c692ec3be8b181f4c2ab
6
+ metadata.gz: a35d00e8ad4b2ed8f0a98e176dfdd4967d3e55a8a6368304a746b6d8df6a21ad5efd9cf75ccc3dae39cb4032e6d30a749dae777531868958a39a7e7cb244012c
7
+ data.tar.gz: dd32457b4f1bc1e45fcefa92d0c8bae6a0461f33e60784238e0ce95c6ea5dc078daf07449450ead6f26a46d14a9e0999d6f0adcf844d60419885f08026e97e11
data/.travis.yml ADDED
@@ -0,0 +1,5 @@
1
+ ---
2
+ language: ruby
3
+ rvm:
4
+ - '2.0.0'
5
+ script: bundle exec rspec spec
data/.yardopts ADDED
@@ -0,0 +1 @@
1
+ --markup markdown
data/README.md CHANGED
@@ -1,26 +1,36 @@
1
1
  Corpus Processor
2
2
  ================
3
3
 
4
- [![Gem Version](https://fury-badge.herokuapp.com/rb/corpus-processor.png)](http://badge.fury.io/rb/corpus-processor)
4
+ [![Gem Version][1]](http://badge.fury.io/rb/corpus-processor)
5
+ [![Build Status][2]](https://travis-ci.org/dasdad/corpus-processor)
6
+ [![Code Climate][3]](https://codeclimate.com/github/dasdad/corpus-processor)
7
+ [![Dependency Status][4]](https://gemnasium.com/dasdad/corpus-processor)
8
+ [![Coverage Status][5]](https://coveralls.io/r/dasdad/corpus-processor)
5
9
 
6
- * [Versão em português](#versao-em-portugues)
7
- * [English version](#english-version)
10
+ * [Versão em português][6]
11
+ * [English version][7]
8
12
 
9
- Versão em portuguễs
13
+ Versão em português
10
14
  ===================
11
15
 
12
- Corpus Processor é uma ferramenta para trabalhar com [Linguística de Corpus](http://pt.wikipedia.org/wiki/Lingu%C3%ADstica_de_corpus). Ele converte _corpora_ entre diferentes formatos para serem usado em ferramentas de Processamento de Linguagem Natural (NLP).
16
+ Corpus Processor é uma ferramenta para trabalhar com [Linguística de
17
+ Corpus][8]. Ele converte _corpora_ entre diferentes formatos para serem usados
18
+ em ferramentas de Processamento de Linguagem Natural (NLP).
13
19
 
14
- O primeiro propósito do Corpus Processor e seu único recurso implementado até agora é transformar _corpora_ encontrados na [Linguateca](http://www.linguateca.pt) para o formato usado pelo treinamento do [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
20
+ O primeiro propósito do Corpus Processor e seu único recurso implementado
21
+ até agora é transformar _corpora_ encontrados na [Linguateca][9] para o
22
+ formato usado pelo treinamento do [Stanford NER][10].
15
23
 
16
- [Linguateca](http://www.linguateca.pt) é uma fonte de _corpora_ em português.
24
+ [Linguateca][11] é uma fonte de _corpora_ em português.
17
25
 
18
- [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) é uma implementação de [Reconhecimento de Entidade Mencionada (NER)](http://pt.wikipedia.org/wiki/Reconhecimento_de_entidade_mencionada).
26
+ [Stanford NER][12] é uma implementação de [Reconhecimento de Entidade
27
+ Mencionada (NER)][13].
19
28
 
20
29
  Instalação
21
30
  ----------
22
31
 
23
- Corpus Processor é uma [Ruby](http://www.ruby-lang.org/) [Gem](http://rubygems.org/). Para instalar, dada uma instalação de Ruby, rode:
32
+ Corpus Processor é uma [Ruby][14] [Gem][15]. Para instalar, dada uma
33
+ instalação de Ruby, rode:
24
34
 
25
35
  ```bash
26
36
  $ gem install corpus_processor
@@ -35,43 +45,106 @@ Converter _corpus_ do formato do LâMPADA 2.0 para o formato do Stanford-NER:
35
45
  $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
36
46
  ```
37
47
 
48
+ As classes reconhecidas por padrão pelo Corpus Processor são `PESSOA`, `LOCAL`
49
+ e `ORGANIZACAO`. Para configurar outras classes, veja o arquivo de configuração
50
+ em `lib/corpus-processor/categories/default.yml`.
51
+
52
+ Para usar outras configurações, veja as opções com:
53
+
54
+ ```bash
55
+ $ corpus-processor help process
56
+ ```
57
+
38
58
  Resultados
39
59
  ----------
40
60
 
41
- Para um exemplo de conversão usando o Corpus Processor, veja este [gist](https://gist.github.com/leafac/5259008).
61
+ Os resultados do uso do [Corpus Processor][16] com um _corpus_ do
62
+ [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection][17] - disponível na
63
+ [Linguateca][20] - estão [neste diretório][19]:
64
+
65
+ * `ner-pt_br.training.txt`: O _corpus_ da [Linguateca][20] convertido com o
66
+ [Corpus Processor][21] para o formato de treinamento do [Stanford NER][22].
67
+ * `ner-pt_br.training-partial.txt`: Os primeiros 95% do _corpus_ em
68
+ `ner-pt_br.training.txt`, usados para o teste de precisão do
69
+ [Stanford NER][23].
70
+ * `ner-pt_br.test.txt`: Os últimos 5% do curpus em `ner-pt_br.training.txt`,
71
+ usado para testar o modelo linguístico.
72
+ * `ner-pt_br.prop`: O arquivo de propriedados no formato do [Stanford NER][24]
73
+ que é usado para treinar com o `ner-pt_br.training.txt`.
74
+ * `ner-pt_br.partial.prop`: O arquivo de propriedados no formato do
75
+ [Stanford NER][25] que é usado para treinar com o
76
+ `ner-pt_br.training-partial.txt`.
77
+ * `ner-pt_br.ser.gz`: O modelo linguístico no formato do [Stanford NER][26]
78
+ resultante do treinamento com o `ner-pt_br.training.txt`.
79
+ * `ner-pt_br.ser-partial.gz`: O modelo linguístico no formato do
80
+ [Stanford NER][27] resultante do treinamento com o
81
+ `ner-pt_br.training-partial.txt`.
82
+
83
+ A performance do modelo linguístico testado é:
84
+
85
+ ```
86
+ CRFClassifier tagged 4450 words in 1 documents at 3632.65 words per second.
87
+ Entity P R F1 TP FP FN
88
+ LOCATION 0.5667 0.3953 0.4658 17 13 26
89
+ ORGANIZATION 0.4531 0.2500 0.3222 29 35 87
90
+ PERSON 0.5333 0.7442 0.6214 32 28 11
91
+ Totals 0.5065 0.3861 0.4382 78 76 124
92
+ ```
93
+
94
+ Essa performance é ruim se compara com [outros trabalhos][28] sobre o assunto,
95
+ mas tem servido aos nossos propósitos. Nós continuaremos tentando melhorar
96
+ essa situação.
97
+
98
+ Sugestões são bem vindas sobre como fazer isso.
42
99
 
43
- O _corpus_ é do [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) e o treinamento usou o [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
100
+ ---
44
101
 
45
- **Note** que a transformação do Corpus Processor descarta muita informação do _corpus_ anotado. Os _corpora_ usados são bastante ricos em anotações e para tirar completo proveito deles considere usar as ferramentas encontradas na [Linguateca](http://www.linguateca.pt).
102
+ **Note** que a transformação do Corpus Processor descarta muita informação do
103
+ _corpus_ anotado. Os _corpora_ usados são bastante ricos em anotações e para
104
+ tirar completo proveito deles considere usar as ferramentas encontradas na
105
+ [Linguateca][29].
46
106
 
47
107
  Para entender melhor, siga as seguintes referências:
48
108
 
49
- Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
50
- http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
109
+ ```
110
+ Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
111
+ http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
112
+
113
+ Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
114
+ ```
115
+
116
+ ---
117
+
118
+ [Leia mais sobre o processo de treinamento][30].
51
119
 
52
- Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
53
120
 
54
121
  Agradecimentos
55
122
  --------------
56
123
 
57
- * [Time do HAREM / Linguateca](http://www.linguateca.pt/HAREM) pelo _corpus_ com anotações semânticas em português.
58
- * *[Time de NLP de Stanford](http://www-nlp.stanford.edu/)* pela ferramenta [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
124
+ * [Time do HAREM / Linguateca][31] pelo _corpus_ com anotações semânticas em
125
+ português.
126
+ * *[Time de NLP de Stanford][32]* pela ferramenta [Stanford NER][33].
59
127
 
60
128
  English version
61
129
  ===============
62
130
 
63
- Corpus Processor is a tool to work with [Corpus Linguistics](http://en.wikipedia.org/wiki/Corpus_linguistics). It converts _corpora_ between different formats for use in Natural Language Processing (NLP) tools.
131
+ Corpus Processor is a tool to work with [Corpus Linguistics][34]. It converts
132
+ _corpora_ between different formats for use in Natural Language Processing
133
+ (NLP) tools.
64
134
 
65
- The first purpose of Corpus Processor and its current only feature is to transform _corpora_ found in [Linguateca](http://www.linguateca.pt) into the format used for training in [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
135
+ The first purpose of Corpus Processor and its current only feature is to
136
+ transform _corpora_ found in [Linguateca][35] into the format used for training
137
+ in [Stanford NER][36].
66
138
 
67
- [Linguateca](http://www.linguateca.pt) is an source of _corpora_ in Portuguese.
139
+ [Linguateca][37] is an source of _corpora_ in Portuguese.
68
140
 
69
- [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) is an implementation of [Named Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition).
141
+ [Stanford NER][38] is an implementation of [Named Entity Recognition][39].
70
142
 
71
143
  Installation
72
144
  ------------
73
145
 
74
- Corpus Processor is a [Ruby](http://www.ruby-lang.org/) [Gem](http://rubygems.org/). To install it, given a working installation of Ruby, run:
146
+ Corpus Processor is a [Ruby][40] [Gem][41]. To install it, given a working
147
+ installation of Ruby, run:
75
148
 
76
149
  ```bash
77
150
  $ gem install corpus_processor
@@ -86,27 +159,82 @@ Convert _corpus_ from LâMPADA 2.0 format to Stanford-NER format:
86
159
  $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
87
160
  ```
88
161
 
162
+ Classes recognized by default in Corpus Processor are `PESSOA` (person),
163
+ `LOCAL` (location) and `ORGANIZACAO` (organization). In order to configure
164
+ other classes, refer to `lib/corpus-processor/categories/default.yml`.
165
+
166
+ To run with different configurations, consult the options with:
167
+
168
+ ```bash
169
+ $ corpus-processor help process
170
+ ```
171
+
89
172
  Results
90
173
  -------
91
174
 
92
- For an example of converting one _corpus_ with Corpus Processor, refer to this [gist](https://gist.github.com/leafac/5259008).
175
+ The results of using [Corpus Processor][42] with a _corpus_ from
176
+ [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection][43] - available in
177
+ [Linguateca][20] - are in [this directory][19]:
178
+
179
+ * `ner-pt_br.training.txt`: The _corpus_ from [Linguateca][46] converted with
180
+ [Corpus Processor][47] to [Stanford NER][48] training format.
181
+ * `ner-pt_br.training-partial.txt`: The first 95% of the _corpus_ in
182
+ `ner-pt_br.training.txt`, used for training [Stanford NER][49] for accuracy
183
+ testing.
184
+ * `ner-pt_br.test.txt`: The last 5% of the _corpus_ in
185
+ `ner-pt_br.training.txt`, used to test the language model.
186
+ * `ner-pt_br.prop`: The property file in [Stanford NER][50]'s format for
187
+ setting up the training with the whole `ner-pt_br.training.txt`.
188
+ * `ner-pt_br.partial.prop`: The property file in [Stanford NER][51]'s format
189
+ for setting up the training with the partial
190
+ `ner-pt_br.training-partial.txt`.
191
+ * `ner-pt_br.ser.gz`: The resulting language model for [Stanford NER][52]
192
+ trained with `ner-pt_br.training.txt`.
193
+ * `ner-pt_br.ser-partial.gz`: The resulting language model for
194
+ [Stanford NER][53] trained with `ner-pt_br.training-partial.txt`.
195
+
196
+ The performance of the language model under test is:
197
+
198
+ ```
199
+ CRFClassifier tagged 4450 words in 1 documents at 3632.65 words per second.
200
+ Entity P R F1 TP FP FN
201
+ LOCATION 0.5667 0.3953 0.4658 17 13 26
202
+ ORGANIZATION 0.4531 0.2500 0.3222 29 35 87
203
+ PERSON 0.5333 0.7442 0.6214 32 28 11
204
+ Totals 0.5065 0.3861 0.4382 78 76 124
205
+ ```
206
+
207
+ This performance is poor if compared with [other works][54] on the topic,
208
+ but it has served well our purposes. We'll keep trying to improve on this.
209
+
210
+ Suggestions are welcome in this regard.
93
211
 
94
- The _corpus_ is from [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) and the training used [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
212
+ ---
95
213
 
96
- **Note** that the transformation performed by Corpus Processor discards lots of information from the annotated _corpus_. The _corpora_ used in this process are very rich in annotations, in order to extract all of it consider using one of the tools found in [Linguateca](http://www.linguateca.pt).
214
+ **Note** that the transformation performed by Corpus Processor discards lots
215
+ of information from the annotated _corpus_. The _corpora_ used in this process
216
+ are very rich in annotations, in order to extract all of it consider using one
217
+ of the tools found in [Linguateca][55].
97
218
 
98
219
  Further information about the subject can be found in the following sources:
99
220
 
100
- Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
101
- http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
221
+ ```
222
+ Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
223
+ http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
102
224
 
103
- Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
225
+ Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
226
+ ```
227
+
228
+ ---
229
+
230
+ [Read more about the process of training][56].
104
231
 
105
232
  Thanks
106
233
  ------
107
234
 
108
- * [HAREM / Linguateca team](http://www.linguateca.pt/HAREM) for the semantic annotated _corpus_ in Portuguese.
109
- * *[Stanford NLP team](http://www-nlp.stanford.edu/)* for the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) tool.
235
+ * *[HAREM / Linguateca team][57]* for the semantic annotated _corpus_ in
236
+ Portuguese.
237
+ * *[Stanford NLP team][58]* for the [Stanford NER][59] tool.
110
238
 
111
239
  Contributing
112
240
  ------------
@@ -120,15 +248,24 @@ Contributing
120
248
  Changelog
121
249
  ---------
122
250
 
123
- ### 0.0.1
251
+ ### 0.3.0
124
252
 
125
- * [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) Parser.
126
- * [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) Generator.
253
+ * Stoped using Regex for parser and started using [Nokogiri][62].
254
+ * Fixed missing punctuation.
255
+ * Fixed inconsistencies in tagging. The issue was caused by `<ALT>` tags.
256
+ * Accepted categories definitions from users.
257
+ * Installed several measures for quality of code.
258
+ * Added documentation.
127
259
 
128
- ### 0.0.2
260
+ ### 0.2.0
129
261
 
130
262
  * Renamed Harem to LâMPADA, as asked by Linguateca's team.
131
263
 
264
+ ### 0.0.1
265
+
266
+ * [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection][60] Parser.
267
+ * [Stanford NER][61] Generator.
268
+
132
269
  License
133
270
  -------
134
271
 
@@ -154,3 +291,67 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
154
291
  LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
155
292
  OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
156
293
  WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
294
+
295
+
296
+ [1]: https://fury-badge.herokuapp.com/rb/corpus-processor.png
297
+ [2]: https://travis-ci.org/dasdad/corpus-processor.png
298
+ [3]: https://codeclimate.com/github/dasdad/corpus-processor.png
299
+ [4]: https://gemnasium.com/dasdad/corpus-processor.png
300
+ [5]: https://coveralls.io/repos/dasdad/corpus-processor/badge.png
301
+ [6]: #verso-em-portugus
302
+ [7]: #english-version
303
+ [8]: http://pt.wikipedia.org/wiki/Lingu%C3%ADstica_de_corpus
304
+ [9]: http://www.linguateca.pt
305
+ [10]: http://nlp.stanford.edu/software/CRF-NER.shtml
306
+ [11]: http://www.linguateca.pt
307
+ [12]: http://nlp.stanford.edu/software/CRF-NER.shtml
308
+ [13]: http://pt.wikipedia.org/wiki/Reconhecimento_de_entidade_mencionada
309
+ [14]: http://www.ruby-lang.org/
310
+ [15]: http://rubygems.org/
311
+ [16]: https://github.com/dasdad/corpus-processor
312
+ [17]: http://www.linguateca.pt/HAREM/
313
+ [18]: http://www.linguateca.pt/
314
+ [19]: https://www.dropbox.com/sh/8p6cbbcaoyv23u7/GxY0qKObYV/corpus
315
+ [20]: http://www.linguateca.pt/
316
+ [21]: https://github.com/dasdad/corpus-processor
317
+ [22]: http://nlp.stanford.edu/software/CRF-NER.shtml
318
+ [23]: http://nlp.stanford.edu/software/CRF-NER.shtml
319
+ [24]: http://nlp.stanford.edu/software/CRF-NER.shtml
320
+ [25]: http://nlp.stanford.edu/software/CRF-NER.shtml
321
+ [26]: http://nlp.stanford.edu/software/CRF-NER.shtml
322
+ [27]: http://nlp.stanford.edu/software/CRF-NER.shtml
323
+ [28]: ftp://ftp.inf.puc-rio.br/pub/docs/techreports/07_09_duarte.pdf
324
+ [29]: http://www.linguateca.pt
325
+ [30]: http://nlp.stanford.edu/software/crf-faq.shtml
326
+ [31]: http://www.linguateca.pt/HAREM
327
+ [32]: http://www-nlp.stanford.edu/
328
+ [33]: http://nlp.stanford.edu/software/CRF-NER.shtml
329
+ [34]: http://en.wikipedia.org/wiki/Corpus_linguistics
330
+ [35]: http://www.linguateca.pt
331
+ [36]: http://nlp.stanford.edu/software/CRF-NER.shtml
332
+ [37]: http://www.linguateca.pt
333
+ [38]: http://nlp.stanford.edu/software/CRF-NER.shtml
334
+ [39]: http://en.wikipedia.org/wiki/Named-entity_recognition
335
+ [40]: http://www.ruby-lang.org/
336
+ [41]: http://rubygems.org/
337
+ [42]: https://github.com/dasdad/corpus-processor
338
+ [43]: http://www.linguateca.pt/HAREM/
339
+ [44]: http://www.linguateca.pt/
340
+ [45]: https://www.dropbox.com/sh/8p6cbbcaoyv23u7/GxY0qKObYV/corpus
341
+ [46]: http://www.linguateca.pt/
342
+ [47]: https://github.com/dasdad/corpus-processor
343
+ [48]: http://nlp.stanford.edu/software/CRF-NER.shtml
344
+ [49]: http://nlp.stanford.edu/software/CRF-NER.shtml
345
+ [50]: http://nlp.stanford.edu/software/CRF-NER.shtml
346
+ [51]: http://nlp.stanford.edu/software/CRF-NER.shtml
347
+ [52]: http://nlp.stanford.edu/software/CRF-NER.shtml
348
+ [53]: http://nlp.stanford.edu/software/CRF-NER.shtml
349
+ [54]: ftp://ftp.inf.puc-rio.br/pub/docs/techreports/07_09_duarte.pdf
350
+ [55]: http://www.linguateca.pt
351
+ [56]: http://nlp.stanford.edu/software/crf-faq.shtml
352
+ [57]: http://www.linguateca.pt/HAREM
353
+ [58]: http://www-nlp.stanford.edu/
354
+ [59]: http://nlp.stanford.edu/software/CRF-NER.shtml
355
+ [60]: http://www.linguateca.pt/HAREM/
356
+ [61]: http://nlp.stanford.edu/software/CRF-NER.shtml
357
+ [62]: http://nokogiri.org/
data/bin/corpus-processor CHANGED
@@ -1,9 +1,9 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- $LOAD_PATH.unshift(File.expand_path("../../lib", __FILE__))
3
+ $LOAD_PATH.unshift(File.expand_path('../../lib', __FILE__))
4
4
 
5
- require "bundler/setup"
5
+ require 'bundler/setup'
6
6
 
7
- require "corpus-processor/cli"
7
+ require 'corpus-processor/cli'
8
8
 
9
9
  CorpusProcessor::Cli.start(ARGV)
@@ -3,26 +3,28 @@ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
3
3
  require 'corpus-processor/version'
4
4
 
5
5
  Gem::Specification.new do |spec|
6
- spec.name = "corpus-processor"
6
+ spec.name = 'corpus-processor'
7
7
  spec.version = CorpusProcessor::VERSION
8
- spec.authors = ["Das Dad"]
9
- spec.email = ["feedback@dasdad.com.br"]
10
- spec.description = %q{Process linguistic corpus}
11
- spec.summary = %q{Handle linguistic corpus and convert it to use NLP tools}
12
- spec.homepage = "https://github.com/dasdad/corpus-processor"
13
- spec.license = "MIT"
8
+ spec.authors = ['Das Dad']
9
+ spec.email = ['dev@dasdad.com.br']
10
+ spec.description = 'Process linguistic corpus'
11
+ spec.summary = 'Handle linguistic corpus and convert it to use NLP ' \
12
+ 'tools'
13
+ spec.homepage = 'https://github.com/dasdad/corpus-processor'
14
+ spec.license = 'MIT'
14
15
 
15
16
  spec.files = `git ls-files`.split($/)
16
17
  spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
17
18
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
18
- spec.require_paths = ["lib"]
19
+ spec.require_paths = ['lib']
19
20
 
20
- spec.required_ruby_version = Gem::Requirement.new(">= 2.0.0")
21
+ spec.required_ruby_version = Gem::Requirement.new('>= 2.0.0')
21
22
 
22
- spec.add_dependency "thor"
23
+ spec.add_dependency 'thor'
24
+ spec.add_dependency 'nokogiri'
23
25
 
24
- spec.add_development_dependency "bundler", "~> 1.3"
25
- spec.add_development_dependency "rspec"
26
- spec.add_development_dependency "fakefs"
27
- spec.add_development_dependency "pry-nav"
26
+ spec.add_development_dependency 'bundler', '~> 1.3'
27
+ spec.add_development_dependency 'rspec', '2.14.0.rc1'
28
+ spec.add_development_dependency 'pry-nav'
29
+ spec.add_development_dependency 'coveralls'
28
30
  end
@@ -1,8 +1,12 @@
1
- require "corpus-processor/version"
2
- require "corpus-processor/token"
3
- require "corpus-processor/default_categories"
4
- require "corpus-processor/traverser"
5
- require "corpus-processor/tokenizer"
6
- require "corpus-processor/parsers"
7
- require "corpus-processor/generators"
8
- require "corpus-processor/processor"
1
+ require 'nokogiri'
2
+ require 'yaml'
3
+
4
+ module CorpusProcessor
5
+ end
6
+
7
+ require 'corpus-processor/version'
8
+ require 'corpus-processor/token'
9
+ require 'corpus-processor/categories'
10
+ require 'corpus-processor/parsers'
11
+ require 'corpus-processor/generators'
12
+ require 'corpus-processor/processor'