parts 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (5) hide show
  1. data/README.md +75 -0
  2. data/lib/parts.rb +3 -1
  3. data/lib/parts/treebank.rb +36 -0
  4. metadata +13 -12
  5. data/README.rdoc +0 -19
data/README.md ADDED
@@ -0,0 +1,75 @@
1
+ # Parts: a probabilistic part of speech tagger
2
+
3
+ Parts is a simple to use probabilistic part of speech tagger. At its core, parts is an adapted [Viterbi](http://en.wikipedia.org/wiki/Viterbi_algorithm) [bi-gram](http://en.wikipedia.org/wiki/Bigram) classifier. As such it looks to classify a sentence's parts of speech by identifying the most probable sequence of tags, given the sentence's words. This is done by training the tagger with a pre-tagged corpora. By default the tagger is trained using the [Treebank 3](http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42) corpora.
4
+
5
+ Any questions please do get in contact via [email](mailto:joe@onlysix.co.uk).
6
+
7
+ ## Basics use
8
+
9
+ Parts is packaged as a [gem](https://rubygems.org/pages/download) and thus installed accordingly.
10
+
11
+ gem install parts
12
+
13
+ With the gem installed, we must first `require` it within any code making use of it.
14
+
15
+ require 'parts'
16
+
17
+ In order to create a tagger with parts, we must first initialise a new `Parts::Tagger`.
18
+
19
+ tagger = Parts::Tagger.new
20
+
21
+ This will create a new tagger, and assuming no arguments are passed in, will train it with the default Treebank 3 corpora. With our tagger now created and trained, we can classify a sentence using the tagger's `classify` method. For example, if we wish to classify the string, `Hello world, this is a sentence`, we would write the following.
22
+
23
+ tagger.classify ["Hello", "world", ",", "this", "is", "a", "sentence"]
24
+
25
+ The tagger requires you to split a sentence up into its appropriate tokens, thus when calling the `classify` method, an array of tokens must be passed in rather than the sentence string itself.
26
+
27
+ As the tagger is trained by default using the Penn Treebank 3 corpora, sentences are tagged with the [Penn Treebank tags](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
28
+
29
+ ## Training parts
30
+
31
+ Training and evaluating parts with your own corpora is simple. In order to train a tagger with your own corpora, parts requires you to pass in an array of tagged sentences.
32
+
33
+ tagger = Parts::Tagger.new sentences
34
+
35
+ Sentences are stored as array's of word-tag pairs, where each sentence will be [{:word => w1, :tag => t1},...,{:word => wn, :tag => tn}]. For example, were we to train it with one sentence, we might create a `sentences` array as such.
36
+
37
+ sentences = [
38
+ [
39
+ {:word => "Rolls-Royce"", :tag => "NNP"}
40
+ {:word => "said", :tag => "VBD"},
41
+ {:word => "it", :tag => "PRP"},
42
+ {:word => "expects", :tag => "VBZ"},
43
+ {:word => "to", :tag => "TO"},
44
+ {:word => "remain", :tag => "VB"},
45
+ {:word => "steady", :tag => "JJ"}
46
+ ]
47
+ ]
48
+
49
+ Parts aims to stay out of your way as much as possible, thus you are free to use whatever tags you want within your corpora. It is worth noting that we automatically prepend and append `$start` and `$end` tags to all sentence arrays when training, thus full-stops need *not* be included in each sentence in the `sentences` array.
50
+
51
+ ## TODO
52
+
53
+ There is still significant work to be done on parts, in particular looking at:
54
+
55
+ * integrate the k-fold tester such that it can be used with user built corporas
56
+ * noun phrase grouping, e.g. *The British Broadcasting Company*
57
+ * exploring mechanisms for automatically splitting sentences into their tokens
58
+ * introducing tri-gram tagging as an option
59
+ * writing a full featured test suite
60
+
61
+ ## Contributing to parts
62
+
63
+ * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
64
+ * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
65
+ * Fork the project
66
+ * Start a feature/bugfix branch
67
+ * Commit and push until you are happy with your contribution
68
+ * Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
69
+ * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
70
+
71
+ ## Copyright
72
+
73
+ Copyright (c) 2012 Joe Root. See LICENSE.txt for
74
+ further details.
75
+
data/lib/parts.rb CHANGED
@@ -9,7 +9,8 @@ module Parts
9
9
 
10
10
  attr_accessor :bigrams, :words, :tags, :bigram_smoothing, :suffixes
11
11
 
12
- def initialize sentences
12
+ def initialize sentences=nil
13
+ sentences = Treebank.new.sentences if sentences.nil?
13
14
  # Tag-bigrams are stored such that P(T2|T1) = @bigrams[T1][T2].
14
15
  # Word-tag pairs are stored such that P(W|T) = @words[W][T].
15
16
  # Tags are stored such that @tags[T] = no. of occurences of T.
@@ -149,3 +150,4 @@ module Parts
149
150
  end
150
151
 
151
152
  require 'parts/tester'
153
+ require 'parts/treebank'
@@ -0,0 +1,36 @@
1
+ class Parts::Treebank
2
+
3
+ def initialize path="#{File.dirname(__FILE__)}/treebank3.2.txt"
4
+ # Sentences are stored as array's of word-tag pairs, where each sentence
5
+ # will be [{:word => w1, :tag => t1},...,{:word => wn, :tag => tn}].
6
+ @sentences = []
7
+ self.load path
8
+ end
9
+
10
+ def sentences
11
+ @sentences
12
+ end
13
+
14
+ def load path
15
+ # For each sentence we split on empty space, and then use regex to split
16
+ # each word/tag pair into its word and tag constituents. Whenever a full
17
+ # stop is encountered we create a new sentence.
18
+ File.open(path, "r") do |file|
19
+ sentence = []
20
+ while (line = file.gets)
21
+ line.split(' ').each do |part|
22
+ md = /(.+)+(\/){1}(.+)+/.match part
23
+ if md
24
+ if md[3] == "."
25
+ @sentences << sentence if not sentence.empty?
26
+ sentence = []
27
+ else
28
+ sentence << {:word => md[1].downcase, :tag => md[3]}
29
+ end
30
+ end
31
+ end
32
+ end
33
+ end
34
+ end
35
+
36
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: parts
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -13,7 +13,7 @@ date: 2012-01-16 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: shoulda
16
- requirement: &70189868506800 !ruby/object:Gem::Requirement
16
+ requirement: &70208468895780 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: '0'
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *70189868506800
24
+ version_requirements: *70208468895780
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: bundler
27
- requirement: &70189868506320 !ruby/object:Gem::Requirement
27
+ requirement: &70208468895300 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.0.0
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *70189868506320
35
+ version_requirements: *70208468895300
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: jeweler
38
- requirement: &70189868505820 !ruby/object:Gem::Requirement
38
+ requirement: &70208468894820 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 1.6.4
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *70189868505820
46
+ version_requirements: *70208468894820
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: rcov
49
- requirement: &70189868492780 !ruby/object:Gem::Requirement
49
+ requirement: &70208468894300 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ! '>='
@@ -54,20 +54,21 @@ dependencies:
54
54
  version: '0'
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *70189868492780
57
+ version_requirements: *70208468894300
58
58
  description: ''
59
59
  email: joe@onlysix.co.uk
60
60
  executables: []
61
61
  extensions: []
62
62
  extra_rdoc_files:
63
63
  - LICENSE.txt
64
- - README.rdoc
64
+ - README.md
65
65
  files:
66
66
  - lib/parts.rb
67
67
  - lib/parts/tester.rb
68
+ - lib/parts/treebank.rb
68
69
  - lib/parts/treebank3.2.txt
69
70
  - LICENSE.txt
70
- - README.rdoc
71
+ - README.md
71
72
  homepage: http://github.com/joeroot/parts
72
73
  licenses:
73
74
  - MIT
@@ -83,7 +84,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
83
84
  version: '0'
84
85
  segments:
85
86
  - 0
86
- hash: -1093201021619199354
87
+ hash: -447277508490023829
87
88
  required_rubygems_version: !ruby/object:Gem::Requirement
88
89
  none: false
89
90
  requirements:
data/README.rdoc DELETED
@@ -1,19 +0,0 @@
1
- = parts
2
-
3
- Description goes here.
4
-
5
- == Contributing to parts
6
-
7
- * Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yet
8
- * Check out the issue tracker to make sure someone already hasn't requested it and/or contributed it
9
- * Fork the project
10
- * Start a feature/bugfix branch
11
- * Commit and push until you are happy with your contribution
12
- * Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
13
- * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
14
-
15
- == Copyright
16
-
17
- Copyright (c) 2012 Joe Root. See LICENSE.txt for
18
- further details.
19
-