corpus-processor 0.0.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 8a0ff96102528239769c105832893034e21434bf
4
- data.tar.gz: 625ffe80fa8399f20610e048c6ce346a69eef9c0
3
+ metadata.gz: 0b40f1ccc5e1f007f584f6c0bf037b0221d65cec
4
+ data.tar.gz: 5b486e05f2372b163a1399244ed2861c239bea02
5
5
  SHA512:
6
- metadata.gz: 1716f52826fa5b895977760e33f5e918a9b7fcebd0d3448b6419c4cb9e8d1b7902f8d99cb6646f4b33693f5743aac3802bc4476e1eac9db555cd188d52acb9e0
7
- data.tar.gz: 770efa624c0c2fcb0b3170d10dcce05069f90650f04b775f6d5662c9ac4b61b71f7884831e4903c015398d21cb21498206f6f2bc4a41cf59b6905d887222d9b8
6
+ metadata.gz: ec94f33cf3ff79a6874130ddbfbb10df20186ff9a5ffb176de48d92aca56b43bfd0d679e6e88f4cc4e215dfd253d7d697b8f68cd99fabbd05fce7b0ab8e761e4
7
+ data.tar.gz: 9e82dff64190b3dd04c33a31d5df024b5b3522de70c13075dc29b2d6c6b431020aded272818315d7cbcef5997bc6fa71e0e00aa1c001c692ec3be8b181f4c2ab
data/README.md CHANGED
@@ -1,15 +1,72 @@
1
1
  Corpus Processor
2
2
  ================
3
3
 
4
- ![Corpus Processor](http://badge.fury.io/rb/corpus-processor)
4
+ [![Gem Version](https://fury-badge.herokuapp.com/rb/corpus-processor.png)](http://badge.fury.io/rb/corpus-processor)
5
5
 
6
- Tool to work with [Corpus Linguistics](http://en.wikipedia.org/wiki/Corpus_linguistics). Corpus Processor converts _corpora_ between different formats for use in Natural Language Processing (NLP) tools.
6
+ * [Versão em português](#versao-em-portugues)
7
+ * [English version](#english-version)
8
+
9
+ Versão em portuguễs
10
+ ===================
11
+
12
+ Corpus Processor é uma ferramenta para trabalhar com [Linguística de Corpus](http://pt.wikipedia.org/wiki/Lingu%C3%ADstica_de_corpus). Ele converte _corpora_ entre diferentes formatos para serem usado em ferramentas de Processamento de Linguagem Natural (NLP).
13
+
14
+ O primeiro propósito do Corpus Processor e seu único recurso implementado até agora é transformar _corpora_ encontrados na [Linguateca](http://www.linguateca.pt) para o formato usado pelo treinamento do [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
15
+
16
+ [Linguateca](http://www.linguateca.pt) é uma fonte de _corpora_ em português.
17
+
18
+ [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) é uma implementação de [Reconhecimento de Entidade Mencionada (NER)](http://pt.wikipedia.org/wiki/Reconhecimento_de_entidade_mencionada).
19
+
20
+ Instalação
21
+ ----------
22
+
23
+ Corpus Processor é uma [Ruby](http://www.ruby-lang.org/) [Gem](http://rubygems.org/). Para instalar, dada uma instalação de Ruby, rode:
24
+
25
+ ```bash
26
+ $ gem install corpus_processor
27
+ ```
28
+
29
+ Uso
30
+ ---
31
+
32
+ Converter _corpus_ do formato do LâMPADA 2.0 para o formato do Stanford-NER:
33
+
34
+ ```bash
35
+ $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
36
+ ```
37
+
38
+ Resultados
39
+ ----------
40
+
41
+ Para um exemplo de conversão usando o Corpus Processor, veja este [gist](https://gist.github.com/leafac/5259008).
42
+
43
+ O _corpus_ é do [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) e o treinamento usou o [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
44
+
45
+ **Note** que a transformação do Corpus Processor descarta muita informação do _corpus_ anotado. Os _corpora_ usados são bastante ricos em anotações e para tirar completo proveito deles considere usar as ferramentas encontradas na [Linguateca](http://www.linguateca.pt).
46
+
47
+ Para entender melhor, siga as seguintes referências:
48
+
49
+ Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
50
+ http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
51
+
52
+ Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
53
+
54
+ Agradecimentos
55
+ --------------
56
+
57
+ * [Time do HAREM / Linguateca](http://www.linguateca.pt/HAREM) pelo _corpus_ com anotações semânticas em português.
58
+ * *[Time de NLP de Stanford](http://www-nlp.stanford.edu/)* pela ferramenta [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
59
+
60
+ English version
61
+ ===============
62
+
63
+ Corpus Processor is a tool to work with [Corpus Linguistics](http://en.wikipedia.org/wiki/Corpus_linguistics). It converts _corpora_ between different formats for use in Natural Language Processing (NLP) tools.
7
64
 
8
65
  The first purpose of Corpus Processor and its current only feature is to transform _corpora_ found in [Linguateca](http://www.linguateca.pt) into the format used for training in [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
9
66
 
10
- [Linguateca](http://www.linguateca.pt) is an excellent source of _corpora_ in Portuguese.
67
+ [Linguateca](http://www.linguateca.pt) is an source of _corpora_ in Portuguese.
11
68
 
12
- [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) is an excellent implementation of [Named Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition).
69
+ [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) is an implementation of [Named Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition).
13
70
 
14
71
  Installation
15
72
  ------------
@@ -23,7 +80,7 @@ $ gem install corpus_processor
23
80
  Usage
24
81
  -----
25
82
 
26
- Convert corpus from HAREM format to Stanford-NER format:
83
+ Convert _corpus_ from LâMPADA 2.0 format to Stanford-NER format:
27
84
 
28
85
  ```bash
29
86
  $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
@@ -32,9 +89,24 @@ $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
32
89
  Results
33
90
  -------
34
91
 
35
- For an example of converting one corpus with Corpus Processor, refer to this [gist](https://gist.github.com/leafac/5259008).
92
+ For an example of converting one _corpus_ with Corpus Processor, refer to this [gist](https://gist.github.com/leafac/5259008).
93
+
94
+ The _corpus_ is from [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) and the training used [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
95
+
96
+ **Note** that the transformation performed by Corpus Processor discards lots of information from the annotated _corpus_. The _corpora_ used in this process are very rich in annotations, in order to extract all of it consider using one of the tools found in [Linguateca](http://www.linguateca.pt).
36
97
 
37
- The corpus is from [Linguateca](http://www.linguateca.pt/HAREM/) and the training used [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
98
+ Further information about the subject can be found in the following sources:
99
+
100
+ Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
101
+ http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
102
+
103
+ Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
104
+
105
+ Thanks
106
+ ------
107
+
108
+ * [HAREM / Linguateca team](http://www.linguateca.pt/HAREM) for the semantic annotated _corpus_ in Portuguese.
109
+ * *[Stanford NLP team](http://www-nlp.stanford.edu/)* for the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) tool.
38
110
 
39
111
  Contributing
40
112
  ------------
@@ -50,14 +122,12 @@ Changelog
50
122
 
51
123
  ### 0.0.1
52
124
 
53
- * [Harem](http://www.linguateca.pt/HAREM/) Parser.
125
+ * [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) Parser.
54
126
  * [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) Generator.
55
127
 
56
- Thanks
57
- ------
128
+ ### 0.0.2
58
129
 
59
- * *Diana Santos* and her team in [Linguateca](http://www.linguateca.pt) for the semantic annotated corpus in Portuguese.
60
- * *[Stanford NLP team](http://www-nlp.stanford.edu/)* for the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) tool.
130
+ * Renamed Harem to LâMPADA, as asked by Linguateca's team.
61
131
 
62
132
  License
63
133
  -------
@@ -4,7 +4,7 @@ require "thor"
4
4
  module CorpusProcessor
5
5
  class Cli < ::Thor
6
6
 
7
- desc "process [INPUT_FILE [OUTPUT_FILE]] ", "convert corpus from HAREM format to Stanford-NER format"
7
+ desc "process [INPUT_FILE [OUTPUT_FILE]] ", "convert corpus from LâMPADA format to Stanford-NER format"
8
8
  def process(input_file = $stdin, output_file = $stdout)
9
9
  input_file = File.new( input_file, "r") if input_file.is_a? String
10
10
  output_file = File.new(output_file, "w") if output_file.is_a? String
@@ -1 +1 @@
1
- require "corpus-processor/parsers/harem"
1
+ require "corpus-processor/parsers/lampada"
@@ -1,5 +1,5 @@
1
1
  module CorpusProcessor::Parsers
2
- class Harem
2
+ class Lampada
3
3
 
4
4
  CATEGORY_REGEX = /
5
5
  (?<any_text> .*? ){0}
@@ -1,5 +1,5 @@
1
1
  class CorpusProcessor::Processor
2
- def initialize(parser = CorpusProcessor::Parsers::Harem.new,
2
+ def initialize(parser = CorpusProcessor::Parsers::Lampada.new,
3
3
  generator = CorpusProcessor::Generators::StanfordNer.new)
4
4
  @parser = parser
5
5
  @generator = generator
@@ -1,3 +1,3 @@
1
1
  module CorpusProcessor
2
- VERSION = "0.0.1"
2
+ VERSION = "0.2.0"
3
3
  end
@@ -1,10 +1,10 @@
1
1
  require "spec_helper"
2
2
 
3
- describe CorpusProcessor::Parsers::Harem do
4
- subject(:harem) { CorpusProcessor::Parsers::Harem.new }
3
+ describe CorpusProcessor::Parsers::Lampada do
4
+ subject(:lampada) { CorpusProcessor::Parsers::Lampada.new }
5
5
 
6
6
  describe "#parse" do
7
- subject { harem.parse(corpus) }
7
+ subject { lampada.parse(corpus) }
8
8
 
9
9
  context "default categories" do
10
10
  context "empty corpus" do
@@ -193,8 +193,8 @@ CORPUS
193
193
  end
194
194
 
195
195
  context "user-defined categories" do
196
- let(:harem) {
197
- CorpusProcessor::Parsers::Harem.new({
196
+ let(:lampada) {
197
+ CorpusProcessor::Parsers::Lampada.new({
198
198
  "FRUTA" => :fruit,
199
199
  "LIVRO" => :book,
200
200
  })
@@ -240,7 +240,7 @@ CORPUS
240
240
  end
241
241
 
242
242
  describe "#extract_category" do
243
- subject { harem.extract_category(categories) }
243
+ subject { lampada.extract_category(categories) }
244
244
 
245
245
  context "empty categories" do
246
246
  let(:categories) { "" }
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: corpus-processor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Das Dad
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-03-27 00:00:00.000000000 Z
11
+ date: 2013-04-01 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: thor
@@ -100,7 +100,7 @@ files:
100
100
  - lib/corpus-processor/generators.rb
101
101
  - lib/corpus-processor/generators/stanford_ner.rb
102
102
  - lib/corpus-processor/parsers.rb
103
- - lib/corpus-processor/parsers/harem.rb
103
+ - lib/corpus-processor/parsers/lampada.rb
104
104
  - lib/corpus-processor/processor.rb
105
105
  - lib/corpus-processor/token.rb
106
106
  - lib/corpus-processor/tokenizer.rb
@@ -109,7 +109,7 @@ files:
109
109
  - spec/integration/cli_spec.rb
110
110
  - spec/spec_helper.rb
111
111
  - spec/unit/generators/stanford_ner_spec.rb
112
- - spec/unit/parsers/harem_spec.rb
112
+ - spec/unit/parsers/lampada_spec.rb
113
113
  - spec/unit/processor.rb
114
114
  - spec/unit/token_spec.rb
115
115
  - spec/unit/tokenizer_spec.rb
@@ -142,7 +142,7 @@ test_files:
142
142
  - spec/integration/cli_spec.rb
143
143
  - spec/spec_helper.rb
144
144
  - spec/unit/generators/stanford_ner_spec.rb
145
- - spec/unit/parsers/harem_spec.rb
145
+ - spec/unit/parsers/lampada_spec.rb
146
146
  - spec/unit/processor.rb
147
147
  - spec/unit/token_spec.rb
148
148
  - spec/unit/tokenizer_spec.rb