corpus-processor 0.0.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 8a0ff96102528239769c105832893034e21434bf
4
- data.tar.gz: 625ffe80fa8399f20610e048c6ce346a69eef9c0
3
+ metadata.gz: 0b40f1ccc5e1f007f584f6c0bf037b0221d65cec
4
+ data.tar.gz: 5b486e05f2372b163a1399244ed2861c239bea02
5
5
  SHA512:
6
- metadata.gz: 1716f52826fa5b895977760e33f5e918a9b7fcebd0d3448b6419c4cb9e8d1b7902f8d99cb6646f4b33693f5743aac3802bc4476e1eac9db555cd188d52acb9e0
7
- data.tar.gz: 770efa624c0c2fcb0b3170d10dcce05069f90650f04b775f6d5662c9ac4b61b71f7884831e4903c015398d21cb21498206f6f2bc4a41cf59b6905d887222d9b8
6
+ metadata.gz: ec94f33cf3ff79a6874130ddbfbb10df20186ff9a5ffb176de48d92aca56b43bfd0d679e6e88f4cc4e215dfd253d7d697b8f68cd99fabbd05fce7b0ab8e761e4
7
+ data.tar.gz: 9e82dff64190b3dd04c33a31d5df024b5b3522de70c13075dc29b2d6c6b431020aded272818315d7cbcef5997bc6fa71e0e00aa1c001c692ec3be8b181f4c2ab
data/README.md CHANGED
@@ -1,15 +1,72 @@
1
1
  Corpus Processor
2
2
  ================
3
3
 
4
- ![Corpus Processor](http://badge.fury.io/rb/corpus-processor)
4
+ [![Gem Version](https://fury-badge.herokuapp.com/rb/corpus-processor.png)](http://badge.fury.io/rb/corpus-processor)
5
5
 
6
- Tool to work with [Corpus Linguistics](http://en.wikipedia.org/wiki/Corpus_linguistics). Corpus Processor converts _corpora_ between different formats for use in Natural Language Processing (NLP) tools.
6
+ * [Versão em português](#versao-em-portugues)
7
+ * [English version](#english-version)
8
+
9
+ Versão em portuguễs
10
+ ===================
11
+
12
+ Corpus Processor é uma ferramenta para trabalhar com [Linguística de Corpus](http://pt.wikipedia.org/wiki/Lingu%C3%ADstica_de_corpus). Ele converte _corpora_ entre diferentes formatos para serem usado em ferramentas de Processamento de Linguagem Natural (NLP).
13
+
14
+ O primeiro propósito do Corpus Processor e seu único recurso implementado até agora é transformar _corpora_ encontrados na [Linguateca](http://www.linguateca.pt) para o formato usado pelo treinamento do [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
15
+
16
+ [Linguateca](http://www.linguateca.pt) é uma fonte de _corpora_ em português.
17
+
18
+ [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) é uma implementação de [Reconhecimento de Entidade Mencionada (NER)](http://pt.wikipedia.org/wiki/Reconhecimento_de_entidade_mencionada).
19
+
20
+ Instalação
21
+ ----------
22
+
23
+ Corpus Processor é uma [Ruby](http://www.ruby-lang.org/) [Gem](http://rubygems.org/). Para instalar, dada uma instalação de Ruby, rode:
24
+
25
+ ```bash
26
+ $ gem install corpus_processor
27
+ ```
28
+
29
+ Uso
30
+ ---
31
+
32
+ Converter _corpus_ do formato do LâMPADA 2.0 para o formato do Stanford-NER:
33
+
34
+ ```bash
35
+ $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
36
+ ```
37
+
38
+ Resultados
39
+ ----------
40
+
41
+ Para um exemplo de conversão usando o Corpus Processor, veja este [gist](https://gist.github.com/leafac/5259008).
42
+
43
+ O _corpus_ é do [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) e o treinamento usou o [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
44
+
45
+ **Note** que a transformação do Corpus Processor descarta muita informação do _corpus_ anotado. Os _corpora_ usados são bastante ricos em anotações e para tirar completo proveito deles considere usar as ferramentas encontradas na [Linguateca](http://www.linguateca.pt).
46
+
47
+ Para entender melhor, siga as seguintes referências:
48
+
49
+ Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
50
+ http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
51
+
52
+ Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
53
+
54
+ Agradecimentos
55
+ --------------
56
+
57
+ * [Time do HAREM / Linguateca](http://www.linguateca.pt/HAREM) pelo _corpus_ com anotações semânticas em português.
58
+ * *[Time de NLP de Stanford](http://www-nlp.stanford.edu/)* pela ferramenta [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
59
+
60
+ English version
61
+ ===============
62
+
63
+ Corpus Processor is a tool to work with [Corpus Linguistics](http://en.wikipedia.org/wiki/Corpus_linguistics). It converts _corpora_ between different formats for use in Natural Language Processing (NLP) tools.
7
64
 
8
65
  The first purpose of Corpus Processor and its current only feature is to transform _corpora_ found in [Linguateca](http://www.linguateca.pt) into the format used for training in [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
9
66
 
10
- [Linguateca](http://www.linguateca.pt) is an excellent source of _corpora_ in Portuguese.
67
+ [Linguateca](http://www.linguateca.pt) is an source of _corpora_ in Portuguese.
11
68
 
12
- [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) is an excellent implementation of [Named Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition).
69
+ [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) is an implementation of [Named Entity Recognition](http://en.wikipedia.org/wiki/Named-entity_recognition).
13
70
 
14
71
  Installation
15
72
  ------------
@@ -23,7 +80,7 @@ $ gem install corpus_processor
23
80
  Usage
24
81
  -----
25
82
 
26
- Convert corpus from HAREM format to Stanford-NER format:
83
+ Convert _corpus_ from LâMPADA 2.0 format to Stanford-NER format:
27
84
 
28
85
  ```bash
29
86
  $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
@@ -32,9 +89,24 @@ $ corpus-processor process [INPUT_FILE [OUTPUT_FILE]]
32
89
  Results
33
90
  -------
34
91
 
35
- For an example of converting one corpus with Corpus Processor, refer to this [gist](https://gist.github.com/leafac/5259008).
92
+ For an example of converting one _corpus_ with Corpus Processor, refer to this [gist](https://gist.github.com/leafac/5259008).
93
+
94
+ The _corpus_ is from [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) and the training used [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
95
+
96
+ **Note** that the transformation performed by Corpus Processor discards lots of information from the annotated _corpus_. The _corpora_ used in this process are very rich in annotations, in order to extract all of it consider using one of the tools found in [Linguateca](http://www.linguateca.pt).
36
97
 
37
- The corpus is from [Linguateca](http://www.linguateca.pt/HAREM/) and the training used [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml).
98
+ Further information about the subject can be found in the following sources:
99
+
100
+ Diana Santos. "O modelo semântico usado no Primeiro HAREM". In Diana Santos & Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca, 2007, pp. 43-57.
101
+ http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap04-SantosCardoso2007-Santos.pdf
102
+
103
+ Diana Santos. "Evaluation in natural language processing". European Summer School on Language, Logic and Information (ESSLLI 2007) (Trinity College, Dublin, Irlanda, 6-17 de Agosto de 2007).
104
+
105
+ Thanks
106
+ ------
107
+
108
+ * [HAREM / Linguateca team](http://www.linguateca.pt/HAREM) for the semantic annotated _corpus_ in Portuguese.
109
+ * *[Stanford NLP team](http://www-nlp.stanford.edu/)* for the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) tool.
38
110
 
39
111
  Contributing
40
112
  ------------
@@ -50,14 +122,12 @@ Changelog
50
122
 
51
123
  ### 0.0.1
52
124
 
53
- * [Harem](http://www.linguateca.pt/HAREM/) Parser.
125
+ * [LâMPADA 2.0 / Classic HAREM 2.0 Golden Collection](http://www.linguateca.pt/HAREM/) Parser.
54
126
  * [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) Generator.
55
127
 
56
- Thanks
57
- ------
128
+ ### 0.0.2
58
129
 
59
- * *Diana Santos* and her team in [Linguateca](http://www.linguateca.pt) for the semantic annotated corpus in Portuguese.
60
- * *[Stanford NLP team](http://www-nlp.stanford.edu/)* for the [Stanford NER](http://nlp.stanford.edu/software/CRF-NER.shtml) tool.
130
+ * Renamed Harem to LâMPADA, as asked by Linguateca's team.
61
131
 
62
132
  License
63
133
  -------
@@ -4,7 +4,7 @@ require "thor"
4
4
  module CorpusProcessor
5
5
  class Cli < ::Thor
6
6
 
7
- desc "process [INPUT_FILE [OUTPUT_FILE]] ", "convert corpus from HAREM format to Stanford-NER format"
7
+ desc "process [INPUT_FILE [OUTPUT_FILE]] ", "convert corpus from LâMPADA format to Stanford-NER format"
8
8
  def process(input_file = $stdin, output_file = $stdout)
9
9
  input_file = File.new( input_file, "r") if input_file.is_a? String
10
10
  output_file = File.new(output_file, "w") if output_file.is_a? String
@@ -1 +1 @@
1
- require "corpus-processor/parsers/harem"
1
+ require "corpus-processor/parsers/lampada"
@@ -1,5 +1,5 @@
1
1
  module CorpusProcessor::Parsers
2
- class Harem
2
+ class Lampada
3
3
 
4
4
  CATEGORY_REGEX = /
5
5
  (?<any_text> .*? ){0}
@@ -1,5 +1,5 @@
1
1
  class CorpusProcessor::Processor
2
- def initialize(parser = CorpusProcessor::Parsers::Harem.new,
2
+ def initialize(parser = CorpusProcessor::Parsers::Lampada.new,
3
3
  generator = CorpusProcessor::Generators::StanfordNer.new)
4
4
  @parser = parser
5
5
  @generator = generator
@@ -1,3 +1,3 @@
1
1
  module CorpusProcessor
2
- VERSION = "0.0.1"
2
+ VERSION = "0.2.0"
3
3
  end
@@ -1,10 +1,10 @@
1
1
  require "spec_helper"
2
2
 
3
- describe CorpusProcessor::Parsers::Harem do
4
- subject(:harem) { CorpusProcessor::Parsers::Harem.new }
3
+ describe CorpusProcessor::Parsers::Lampada do
4
+ subject(:lampada) { CorpusProcessor::Parsers::Lampada.new }
5
5
 
6
6
  describe "#parse" do
7
- subject { harem.parse(corpus) }
7
+ subject { lampada.parse(corpus) }
8
8
 
9
9
  context "default categories" do
10
10
  context "empty corpus" do
@@ -193,8 +193,8 @@ CORPUS
193
193
  end
194
194
 
195
195
  context "user-defined categories" do
196
- let(:harem) {
197
- CorpusProcessor::Parsers::Harem.new({
196
+ let(:lampada) {
197
+ CorpusProcessor::Parsers::Lampada.new({
198
198
  "FRUTA" => :fruit,
199
199
  "LIVRO" => :book,
200
200
  })
@@ -240,7 +240,7 @@ CORPUS
240
240
  end
241
241
 
242
242
  describe "#extract_category" do
243
- subject { harem.extract_category(categories) }
243
+ subject { lampada.extract_category(categories) }
244
244
 
245
245
  context "empty categories" do
246
246
  let(:categories) { "" }
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: corpus-processor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Das Dad
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-03-27 00:00:00.000000000 Z
11
+ date: 2013-04-01 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: thor
@@ -100,7 +100,7 @@ files:
100
100
  - lib/corpus-processor/generators.rb
101
101
  - lib/corpus-processor/generators/stanford_ner.rb
102
102
  - lib/corpus-processor/parsers.rb
103
- - lib/corpus-processor/parsers/harem.rb
103
+ - lib/corpus-processor/parsers/lampada.rb
104
104
  - lib/corpus-processor/processor.rb
105
105
  - lib/corpus-processor/token.rb
106
106
  - lib/corpus-processor/tokenizer.rb
@@ -109,7 +109,7 @@ files:
109
109
  - spec/integration/cli_spec.rb
110
110
  - spec/spec_helper.rb
111
111
  - spec/unit/generators/stanford_ner_spec.rb
112
- - spec/unit/parsers/harem_spec.rb
112
+ - spec/unit/parsers/lampada_spec.rb
113
113
  - spec/unit/processor.rb
114
114
  - spec/unit/token_spec.rb
115
115
  - spec/unit/tokenizer_spec.rb
@@ -142,7 +142,7 @@ test_files:
142
142
  - spec/integration/cli_spec.rb
143
143
  - spec/spec_helper.rb
144
144
  - spec/unit/generators/stanford_ner_spec.rb
145
- - spec/unit/parsers/harem_spec.rb
145
+ - spec/unit/parsers/lampada_spec.rb
146
146
  - spec/unit/processor.rb
147
147
  - spec/unit/token_spec.rb
148
148
  - spec/unit/tokenizer_spec.rb