corpus-processor 0.0.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +82 -12
- data/lib/corpus-processor/cli.rb +1 -1
- data/lib/corpus-processor/parsers.rb +1 -1
- data/lib/corpus-processor/parsers/{harem.rb → lampada.rb} +1 -1
- data/lib/corpus-processor/processor.rb +1 -1
- data/lib/corpus-processor/version.rb +1 -1
- data/spec/unit/parsers/{harem_spec.rb → lampada_spec.rb} +6 -6
- metadata +5 -5
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA1:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 0b40f1ccc5e1f007f584f6c0bf037b0221d65cec
|
|
4
|
+
data.tar.gz: 5b486e05f2372b163a1399244ed2861c239bea02
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: ec94f33cf3ff79a6874130ddbfbb10df20186ff9a5ffb176de48d92aca56b43bfd0d679e6e88f4cc4e215dfd253d7d697b8f68cd99fabbd05fce7b0ab8e761e4
|
|
7
|
+
data.tar.gz: 9e82dff64190b3dd04c33a31d5df024b5b3522de70c13075dc29b2d6c6b431020aded272818315d7cbcef5997bc6fa71e0e00aa1c001c692ec3be8b181f4c2ab
|
data/README.md
CHANGED
|
@@ -1,15 +1,72 @@
|
|
|
1
1
|
Corpus Processor
|
|
2
2
|
================
|
|
3
3
|
|
|
4
|
-
](http://badge.fury.io/rb/corpus-processor)
|
|
5
5
|
|
|
6
|
-
|
|
6
|
+
* [Versão em português](#versao-em-portugues)
|
|
7
|
+
* [English version](#english-version)
|
|
8
|
+
|
|
9
|
+
Versão em portuguễs
|
|
10
|
+
===================
|
|
11
|
+
|
|
12
|
+
Corpus Processor é uma ferramenta para trabalhar com [Linguística de Corpus](http://pt.wikipedia.org/wiki/Lingu%C3%ADstica_de_corpus). Ele converte _corpora_ entre diferentes formatos para serem usado em ferramentas de Processamento de Linguagem Natural (NLP).
|
|
13
|
+
|
|
14
|
+
O primeiro propósito do Corpus Processor e seu único recurso implementado até agora é transformar _corpora_ encontrados na [Linguateca](http://www.linguateca.pt) para o formato usado pelo treinamento do [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
|
|
15
|
+
|
|
16
|
+
[Linguateca](http://www.linguateca.pt) é uma fonte de _corpora_ em português.
|
|
17
|
+
|
|
18
|
+
[Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) é uma implementação de [Reconhecimento de Entidade Mencionada (NER)](http://pt.wikipedia.org/wiki/Reconhecimento_de_entidade_mencionada).
|
|
19
|
+
|
|
20
|
+
Instalação
|
|
21
|
+
----------
|
|
22
|
+
|
|
23
|
+
Corpus Processor é uma [Ruby](http://www.ruby-lang.org/) [Gem](http://rubygems.org/). Para instalar, dada uma instalação de Ruby, rode:
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
$ gem install corpus_processor
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Uso
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
Converter _corpus_ do formato do LâMPADA 2.0 para o formato do Stanford-NER:
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
$ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Resultados
|
|
39
|
+
----------
|
|
40
|
+
|
|
41
|
+
Para um exemplo de conversão usando o Corpus Processor, veja este [gist](https://gist.github.com/leafac/5259008).
|
|
42
|
+
|
|
43
|
+
O _corpus_ é do [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) e o treinamento usou o [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
|
|
44
|
+
|
|
45
|
+
**Note** que a transformação do Corpus Processor descarta muita informação do _corpus_ anotado. Os _corpora_ usados são bastante ricos em anotações e para tirar completo proveito deles considere usar as ferramentas encontradas na [Linguateca](http://www.linguateca.pt).
|
|
46
|
+
|
|
47
|
+
Para entender melhor, siga as seguintes referências:
|
|
48
|
+
|
|
49
|
+
Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
|
|
50
|
+
http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
|
|
51
|
+
|
|
52
|
+
Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
|
|
53
|
+
|
|
54
|
+
Agradecimentos
|
|
55
|
+
--------------
|
|
56
|
+
|
|
57
|
+
* [Time do HAREM / Linguateca](http://www.linguateca.pt/HAREM) pelo _corpus_ com anotações semânticas em português.
|
|
58
|
+
* *[Time de NLP de Stanford](http://www-nlp.stanford.edu/)* pela ferramenta [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
|
|
59
|
+
|
|
60
|
+
English version
|
|
61
|
+
===============
|
|
62
|
+
|
|
63
|
+
Corpus Processor is a tool to work with [Corpus Linguistics](http://en.wikipedia.org/wiki/Corpus_linguistics). It converts _corpora_ between different formats for use in Natural Language Processing (NLP) tools.
|
|
7
64
|
|
|
8
65
|
The first purpose of Corpus Processor and its current only feature is to transform _corpora_ found in [Linguateca](http://www.linguateca.pt) into the format used for training in [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
|
|
9
66
|
|
|
10
|
-
[Linguateca](http://www.linguateca.pt) is an
|
|
67
|
+
[Linguateca](http://www.linguateca.pt) is an source of _corpora_ in Portuguese.
|
|
11
68
|
|
|
12
|
-
[Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) is an
|
|
69
|
+
[Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) is an implementation of [Named Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition).
|
|
13
70
|
|
|
14
71
|
Installation
|
|
15
72
|
------------
|
|
@@ -23,7 +80,7 @@ $ gem install corpus_processor
|
|
|
23
80
|
Usage
|
|
24
81
|
-----
|
|
25
82
|
|
|
26
|
-
Convert
|
|
83
|
+
Convert _corpus_ from LâMPADA 2.0 format to Stanford-NER format:
|
|
27
84
|
|
|
28
85
|
```bash
|
|
29
86
|
$ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
|
|
@@ -32,9 +89,24 @@ $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
|
|
|
32
89
|
Results
|
|
33
90
|
-------
|
|
34
91
|
|
|
35
|
-
For an example of converting one
|
|
92
|
+
For an example of converting one _corpus_ with Corpus Processor, refer to this [gist](https://gist.github.com/leafac/5259008).
|
|
93
|
+
|
|
94
|
+
The _corpus_ is from [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) and the training used [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
|
|
95
|
+
|
|
96
|
+
**Note** that the transformation performed by Corpus Processor discards lots of information from the annotated _corpus_. The _corpora_ used in this process are very rich in annotations, in order to extract all of it consider using one of the tools found in [Linguateca](http://www.linguateca.pt).
|
|
36
97
|
|
|
37
|
-
|
|
98
|
+
Further information about the subject can be found in the following sources:
|
|
99
|
+
|
|
100
|
+
Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
|
|
101
|
+
http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
|
|
102
|
+
|
|
103
|
+
Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
|
|
104
|
+
|
|
105
|
+
Thanks
|
|
106
|
+
------
|
|
107
|
+
|
|
108
|
+
* [HAREM / Linguateca team](http://www.linguateca.pt/HAREM) for the semantic annotated _corpus_ in Portuguese.
|
|
109
|
+
* *[Stanford NLP team](http://www-nlp.stanford.edu/)* for the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) tool.
|
|
38
110
|
|
|
39
111
|
Contributing
|
|
40
112
|
------------
|
|
@@ -50,14 +122,12 @@ Changelog
|
|
|
50
122
|
|
|
51
123
|
### 0.0.1
|
|
52
124
|
|
|
53
|
-
* [
|
|
125
|
+
* [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) Parser.
|
|
54
126
|
* [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) Generator.
|
|
55
127
|
|
|
56
|
-
|
|
57
|
-
------
|
|
128
|
+
### 0.0.2
|
|
58
129
|
|
|
59
|
-
*
|
|
60
|
-
* *[Stanford NLP team](http://www-nlp.stanford.edu/)* for the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) tool.
|
|
130
|
+
* Renamed Harem to LâMPADA, as asked by Linguateca's team.
|
|
61
131
|
|
|
62
132
|
License
|
|
63
133
|
-------
|
data/lib/corpus-processor/cli.rb
CHANGED
|
@@ -4,7 +4,7 @@ require "thor"
|
|
|
4
4
|
module CorpusProcessor
|
|
5
5
|
class Cli < ::Thor
|
|
6
6
|
|
|
7
|
-
desc "process [INPUT_FILE [OUTPUT_FILE]] ", "convert corpus from
|
|
7
|
+
desc "process [INPUT_FILE [OUTPUT_FILE]] ", "convert corpus from LâMPADA format to Stanford-NER format"
|
|
8
8
|
def process(input_file = $stdin, output_file = $stdout)
|
|
9
9
|
input_file = File.new( input_file, "r") if input_file.is_a? String
|
|
10
10
|
output_file = File.new(output_file, "w") if output_file.is_a? String
|
|
@@ -1 +1 @@
|
|
|
1
|
-
require "corpus-processor/parsers/
|
|
1
|
+
require "corpus-processor/parsers/lampada"
|
|
@@ -1,10 +1,10 @@
|
|
|
1
1
|
require "spec_helper"
|
|
2
2
|
|
|
3
|
-
describe CorpusProcessor::Parsers::
|
|
4
|
-
subject(:
|
|
3
|
+
describe CorpusProcessor::Parsers::Lampada do
|
|
4
|
+
subject(:lampada) { CorpusProcessor::Parsers::Lampada.new }
|
|
5
5
|
|
|
6
6
|
describe "#parse" do
|
|
7
|
-
subject {
|
|
7
|
+
subject { lampada.parse(corpus) }
|
|
8
8
|
|
|
9
9
|
context "default categories" do
|
|
10
10
|
context "empty corpus" do
|
|
@@ -193,8 +193,8 @@ CORPUS
|
|
|
193
193
|
end
|
|
194
194
|
|
|
195
195
|
context "user-defined categories" do
|
|
196
|
-
let(:
|
|
197
|
-
CorpusProcessor::Parsers::
|
|
196
|
+
let(:lampada) {
|
|
197
|
+
CorpusProcessor::Parsers::Lampada.new({
|
|
198
198
|
"FRUTA" => :fruit,
|
|
199
199
|
"LIVRO" => :book,
|
|
200
200
|
})
|
|
@@ -240,7 +240,7 @@ CORPUS
|
|
|
240
240
|
end
|
|
241
241
|
|
|
242
242
|
describe "#extract_category" do
|
|
243
|
-
subject {
|
|
243
|
+
subject { lampada.extract_category(categories) }
|
|
244
244
|
|
|
245
245
|
context "empty categories" do
|
|
246
246
|
let(:categories) { "" }
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: corpus-processor
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.0
|
|
4
|
+
version: 0.2.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Das Dad
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2013-
|
|
11
|
+
date: 2013-04-01 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: thor
|
|
@@ -100,7 +100,7 @@ files:
|
|
|
100
100
|
- lib/corpus-processor/generators.rb
|
|
101
101
|
- lib/corpus-processor/generators/stanford_ner.rb
|
|
102
102
|
- lib/corpus-processor/parsers.rb
|
|
103
|
-
- lib/corpus-processor/parsers/
|
|
103
|
+
- lib/corpus-processor/parsers/lampada.rb
|
|
104
104
|
- lib/corpus-processor/processor.rb
|
|
105
105
|
- lib/corpus-processor/token.rb
|
|
106
106
|
- lib/corpus-processor/tokenizer.rb
|
|
@@ -109,7 +109,7 @@ files:
|
|
|
109
109
|
- spec/integration/cli_spec.rb
|
|
110
110
|
- spec/spec_helper.rb
|
|
111
111
|
- spec/unit/generators/stanford_ner_spec.rb
|
|
112
|
-
- spec/unit/parsers/
|
|
112
|
+
- spec/unit/parsers/lampada_spec.rb
|
|
113
113
|
- spec/unit/processor.rb
|
|
114
114
|
- spec/unit/token_spec.rb
|
|
115
115
|
- spec/unit/tokenizer_spec.rb
|
|
@@ -142,7 +142,7 @@ test_files:
|
|
|
142
142
|
- spec/integration/cli_spec.rb
|
|
143
143
|
- spec/spec_helper.rb
|
|
144
144
|
- spec/unit/generators/stanford_ner_spec.rb
|
|
145
|
-
- spec/unit/parsers/
|
|
145
|
+
- spec/unit/parsers/lampada_spec.rb
|
|
146
146
|
- spec/unit/processor.rb
|
|
147
147
|
- spec/unit/token_spec.rb
|
|
148
148
|
- spec/unit/tokenizer_spec.rb
|