parts 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +75 -0
- data/lib/parts.rb +3 -1
- data/lib/parts/treebank.rb +36 -0
- metadata +13 -12
- data/README.rdoc +0 -19
data/README.md
ADDED
@@ -0,0 +1,75 @@
|
|
1
|
+
# Parts: a probabilistic part of speech tagger
|
2
|
+
|
3
|
+
Parts is a simple to use probabilistic part of speech tagger. At its core, parts is an adapted [Viterbi](http://en.wikipedia.org/wiki/Viterbi_algorithm) [bi-gram](http://en.wikipedia.org/wiki/Bigram) classifier. As such it looks to classify a sentence's parts of speech by identifying the most probable sequence of tags, given the sentence's words. This is done by training the tagger with a pre-tagged corpora. By default the tagger is trained using the [Treebank 3](http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42) corpora.
|
4
|
+
|
5
|
+
Any questions please do get in contact via [email](mailto:joe@onlysix.co.uk).
|
6
|
+
|
7
|
+
## Basics use
|
8
|
+
|
9
|
+
Parts is packaged as a [gem](https://rubygems.org/pages/download) and thus installed accordingly.
|
10
|
+
|
11
|
+
gem install parts
|
12
|
+
|
13
|
+
With the gem installed, we must first `require` it within any code making use of it.
|
14
|
+
|
15
|
+
require 'parts'
|
16
|
+
|
17
|
+
In order to create a tagger with parts, we must first initialise a new `Parts::Tagger`.
|
18
|
+
|
19
|
+
tagger = Parts::Tagger.new
|
20
|
+
|
21
|
+
This will create a new tagger, and assuming no arguments are passed in, will train it with the default Treebank 3 corpora. With our tagger now created and trained, we can classify a sentence using the tagger's `classify` method. For example, if we wish to classify the string, `Hello world, this is a sentence`, we would write the following.
|
22
|
+
|
23
|
+
tagger.classify ["Hello", "world", ",", "this", "is", "a", "sentence"]
|
24
|
+
|
25
|
+
The tagger requires you to split a sentence up into its appropriate tokens, thus when calling the `classify` method, an array of tokens must be passed in rather than the sentence string itself.
|
26
|
+
|
27
|
+
As the tagger is trained by default using the Penn Treebank 3 corpora, sentences are tagged with the [Penn Treebank tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
|
28
|
+
|
29
|
+
## Training parts
|
30
|
+
|
31
|
+
Training and evaluating parts with your own corpora is simple. In order to train a tagger with your own corpora, parts requires you to pass in an array of tagged sentences.
|
32
|
+
|
33
|
+
tagger = Parts::Tagger.new sentences
|
34
|
+
|
35
|
+
Sentences are stored as array's of word-tag pairs, where each sentence will be [{:word => w1, :tag => t1},...,{:word => wn, :tag => tn}]. For example, were we to train it with one sentence, we might create a `sentences` array as such.
|
36
|
+
|
37
|
+
sentences = [
|
38
|
+
[
|
39
|
+
{:word => "Rolls-Royce"", :tag => "NNP"}
|
40
|
+
{:word => "said", :tag => "VBD"},
|
41
|
+
{:word => "it", :tag => "PRP"},
|
42
|
+
{:word => "expects", :tag => "VBZ"},
|
43
|
+
{:word => "to", :tag => "TO"},
|
44
|
+
{:word => "remain", :tag => "VB"},
|
45
|
+
{:word => "steady", :tag => "JJ"}
|
46
|
+
]
|
47
|
+
]
|
48
|
+
|
49
|
+
Parts aims to stay out of your way as much as possible, thus you are free to use whatever tags you want within your corpora. It is worth noting that we automatically prepend and append `$start` and `$end` tags to all sentence arrays when training, thus full-stops need *not* be included in each sentence in the `sentences` array.
|
50
|
+
|
51
|
+
## TODO
|
52
|
+
|
53
|
+
There is still significant work to be done on parts, in particular looking at:
|
54
|
+
|
55
|
+
* integrate the k-fold tester such that it can be used with user built corporas
|
56
|
+
* noun phrase grouping, e.g. *The British Broadcasting Company*
|
57
|
+
* exploring mechanisms for automatically splitting sentences into their tokens
|
58
|
+
* introducing tri-gram tagging as an option
|
59
|
+
* writing a full featured test suite
|
60
|
+
|
61
|
+
## Contributing to parts
|
62
|
+
|
63
|
+
* Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
|
64
|
+
* Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
|
65
|
+
* Fork the project
|
66
|
+
* Start a feature/bugfix branch
|
67
|
+
* Commit and push until you are happy with your contribution
|
68
|
+
* Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
|
69
|
+
* Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
|
70
|
+
|
71
|
+
## Copyright
|
72
|
+
|
73
|
+
Copyright (c) 2012 Joe Root. See LICENSE.txt for
|
74
|
+
further details.
|
75
|
+
|
data/lib/parts.rb
CHANGED
@@ -9,7 +9,8 @@ module Parts
|
|
9
9
|
|
10
10
|
attr_accessor :bigrams, :words, :tags, :bigram_smoothing, :suffixes
|
11
11
|
|
12
|
-
def initialize sentences
|
12
|
+
def initialize sentences=nil
|
13
|
+
sentences = Treebank.new.sentences if sentences.nil?
|
13
14
|
# Tag-bigrams are stored such that P(T2|T1) = @bigrams[T1][T2].
|
14
15
|
# Word-tag pairs are stored such that P(W|T) = @words[W][T].
|
15
16
|
# Tags are stored such that @tags[T] = no. of occurences of T.
|
@@ -149,3 +150,4 @@ module Parts
|
|
149
150
|
end
|
150
151
|
|
151
152
|
require 'parts/tester'
|
153
|
+
require 'parts/treebank'
|
@@ -0,0 +1,36 @@
|
|
1
|
+
class Parts::Treebank
|
2
|
+
|
3
|
+
def initialize path="#{File.dirname(__FILE__)}/treebank3.2.txt"
|
4
|
+
# Sentences are stored as array's of word-tag pairs, where each sentence
|
5
|
+
# will be [{:word => w1, :tag => t1},...,{:word => wn, :tag => tn}].
|
6
|
+
@sentences = []
|
7
|
+
self.load path
|
8
|
+
end
|
9
|
+
|
10
|
+
def sentences
|
11
|
+
@sentences
|
12
|
+
end
|
13
|
+
|
14
|
+
def load path
|
15
|
+
# For each sentence we split on empty space, and then use regex to split
|
16
|
+
# each word/tag pair into its word and tag constituents. Whenever a full
|
17
|
+
# stop is encountered we create a new sentence.
|
18
|
+
File.open(path, "r") do |file|
|
19
|
+
sentence = []
|
20
|
+
while (line = file.gets)
|
21
|
+
line.split(' ').each do |part|
|
22
|
+
md = /(.+)+(\/){1}(.+)+/.match part
|
23
|
+
if md
|
24
|
+
if md[3] == "."
|
25
|
+
@sentences << sentence if not sentence.empty?
|
26
|
+
sentence = []
|
27
|
+
else
|
28
|
+
sentence << {:word => md[1].downcase, :tag => md[3]}
|
29
|
+
end
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: parts
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.1
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -13,7 +13,7 @@ date: 2012-01-16 00:00:00.000000000Z
|
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: shoulda
|
16
|
-
requirement: &
|
16
|
+
requirement: &70208468895780 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :development
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70208468895780
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: bundler
|
27
|
-
requirement: &
|
27
|
+
requirement: &70208468895300 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.0.0
|
33
33
|
type: :development
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *70208468895300
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: jeweler
|
38
|
-
requirement: &
|
38
|
+
requirement: &70208468894820 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 1.6.4
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *70208468894820
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: rcov
|
49
|
-
requirement: &
|
49
|
+
requirement: &70208468894300 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ! '>='
|
@@ -54,20 +54,21 @@ dependencies:
|
|
54
54
|
version: '0'
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *70208468894300
|
58
58
|
description: ''
|
59
59
|
email: joe@onlysix.co.uk
|
60
60
|
executables: []
|
61
61
|
extensions: []
|
62
62
|
extra_rdoc_files:
|
63
63
|
- LICENSE.txt
|
64
|
-
- README.
|
64
|
+
- README.md
|
65
65
|
files:
|
66
66
|
- lib/parts.rb
|
67
67
|
- lib/parts/tester.rb
|
68
|
+
- lib/parts/treebank.rb
|
68
69
|
- lib/parts/treebank3.2.txt
|
69
70
|
- LICENSE.txt
|
70
|
-
- README.
|
71
|
+
- README.md
|
71
72
|
homepage: http://github.com/joeroot/parts
|
72
73
|
licenses:
|
73
74
|
- MIT
|
@@ -83,7 +84,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
83
84
|
version: '0'
|
84
85
|
segments:
|
85
86
|
- 0
|
86
|
-
hash: -
|
87
|
+
hash: -447277508490023829
|
87
88
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
88
89
|
none: false
|
89
90
|
requirements:
|
data/README.rdoc
DELETED
@@ -1,19 +0,0 @@
|
|
1
|
-
= parts
|
2
|
-
|
3
|
-
Description goes here.
|
4
|
-
|
5
|
-
== Contributing to parts
|
6
|
-
|
7
|
-
* Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
|
8
|
-
* Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
|
9
|
-
* Fork the project
|
10
|
-
* Start a feature/bugfix branch
|
11
|
-
* Commit and push until you are happy with your contribution
|
12
|
-
* Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
|
13
|
-
* Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
|
14
|
-
|
15
|
-
== Copyright
|
16
|
-
|
17
|
-
Copyright (c) 2012 Joe Root. See LICENSE.txt for
|
18
|
-
further details.
|
19
|
-
|