mitie 0.1.5 → 0.2.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: e0ceaa2d4609a2a1b3b4056d67d88a1b0f55616ac8fd1a0509f070c352a96ea1
4
- data.tar.gz: 0b90eba1027ca5a46a405a97411d2b5fe193d20666ee8c8bad347bb2c79c225b
3
+ metadata.gz: dc2b7bdba2fba6b335ab9750efab8766190e618c4b8e2542ff6409f6727ec8b2
4
+ data.tar.gz: 03d85b928082a04b46209694c8e1b294e5c4205057a26b5cc65e695bbe3564a9
5
5
  SHA512:
6
- metadata.gz: 8281e51659e08157d305535f3cd242082d4173368c36fa167716771b008a82233d02e43d1015c39e27c4c704ae0ec53935834eb0fefb71dac3b515962987d7eb
7
- data.tar.gz: 96d3564684f8197651f93238876f2bbeb69f860a8a373f36dcafb578db698555153b320f2bd0c799cd8d5d939ad05c3746c59e907d4ad62cfa4917fbb7ec2313
6
+ metadata.gz: 4092a2dc005bb76527429454c301179c63f0eeee4913a3a8f56190e13d7f7551ee3bbfcd1098a6c335cf3b886ec0dee4e52f5c062975a4a058d117229f45b340
7
+ data.tar.gz: 066e5800a520b16002388088fc367939a26738bd638631d3b86ff9074d508cf5cd051154fbbf7be129f07f82c97a851caf7a6486e6fff2fb7f6a4b252d426ec0
data/CHANGELOG.md CHANGED
@@ -1,3 +1,19 @@
1
+ ## 0.2.1 (2022-06-12)
2
+
3
+ - Added `tokenize` and `tokenize_file` methods
4
+ - Added support for untokenized text to text categorization
5
+
6
+ ## 0.2.0 (2022-06-01)
7
+
8
+ - Added support for text categorization
9
+ - Added support for training binary relation detectors
10
+ - Dropped support for Ruby < 2.7
11
+
12
+ ## 0.1.6 (2022-03-20)
13
+
14
+ - Added support for training NER models
15
+ - Improved ARM detection
16
+
1
17
  ## 0.1.5 (2021-01-29)
2
18
 
3
19
  - Fixed issue with multibyte characters
data/README.md CHANGED
@@ -1,21 +1,21 @@
1
- # MITIE
1
+ # MITIE Ruby
2
2
 
3
- [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition and binary relation detection - for Ruby
3
+ [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition, binary relation detection, and text categorization - for Ruby
4
4
 
5
5
  - Finds people, organizations, and locations in text
6
6
  - Detects relationships between entities, like `PERSON` was born in `LOCATION`
7
7
 
8
- [![Build Status](https://github.com/ankane/mitie/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mitie/actions)
8
+ [![Build Status](https://github.com/ankane/mitie-ruby/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mitie-ruby/actions)
9
9
 
10
10
  ## Installation
11
11
 
12
12
  Add this line to your application’s Gemfile:
13
13
 
14
14
  ```ruby
15
- gem 'mitie'
15
+ gem "mitie"
16
16
  ```
17
17
 
18
- And download the pre-trained model for your language:
18
+ And download the pre-trained models for your language:
19
19
 
20
20
  - [English](https://github.com/mit-nlp/MITIE/releases/download/v0.4/MITIE-models-v0.2.tar.bz2)
21
21
  - [Spanish](https://github.com/mit-nlp/MITIE/releases/download/v0.4/MITIE-models-v0.2-Spanish.zip)
@@ -23,6 +23,12 @@ And download the pre-trained model for your language:
23
23
 
24
24
  ## Getting Started
25
25
 
26
+ - [Named Entity Recognition](#named-entity-recognition)
27
+ - [Binary Relation Detection](#binary-relation-detection)
28
+ - [Text Categorization](#text-categorization)
29
+
30
+ ## Named Entity Recognition
31
+
26
32
  Load an NER model
27
33
 
28
34
  ```ruby
@@ -69,6 +75,41 @@ Get all tags for a model
69
75
  model.tags
70
76
  ```
71
77
 
78
+ ### Training
79
+
80
+ Load an NER model into a trainer
81
+
82
+ ```ruby
83
+ trainer = Mitie::NERTrainer.new("total_word_feature_extractor.dat")
84
+ ```
85
+
86
+ Create training instances
87
+
88
+ ```ruby
89
+ tokens = ["You", "can", "do", "machine", "learning", "in", "Ruby", "!"]
90
+ instance = Mitie::NERTrainingInstance.new(tokens)
91
+ instance.add_entity(3..4, "topic") # machine learning
92
+ instance.add_entity(6..6, "language") # Ruby
93
+ ```
94
+
95
+ Add the training instances to the trainer
96
+
97
+ ```ruby
98
+ trainer.add(instance)
99
+ ```
100
+
101
+ Train the model
102
+
103
+ ```ruby
104
+ model = trainer.train
105
+ ```
106
+
107
+ Save the model
108
+
109
+ ```ruby
110
+ model.save_to_disk("ner_model.dat")
111
+ ```
112
+
72
113
  ## Binary Relation Detection
73
114
 
74
115
  Detect relationships betweens two entities, like:
@@ -103,24 +144,98 @@ This returns
103
144
  [{first: "Shopify", second: "Ottawa", score: 0.17649169745814464}]
104
145
  ```
105
146
 
147
+ ### Training
148
+
149
+ Load an NER model into a trainer
150
+
151
+ ```ruby
152
+ trainer = Mitie::BinaryRelationTrainer.new(model)
153
+ ```
154
+
155
+ Add positive and negative examples to the trainer
156
+
157
+ ```ruby
158
+ tokens = ["Shopify", "was", "founded", "in", "Ottawa"]
159
+ trainer.add_positive_binary_relation(tokens, 0..0, 4..4)
160
+ trainer.add_negative_binary_relation(tokens, 4..4, 0..0)
161
+ ```
162
+
163
+ Train the detector
164
+
165
+ ```ruby
166
+ detector = trainer.train
167
+ ```
168
+
169
+ Save the detector
170
+
171
+ ```ruby
172
+ detector.save_to_disk("binary_relation_detector.svm")
173
+ ```
174
+
175
+ ## Text Categorization
176
+
177
+ Load a model into a trainer
178
+
179
+ ```ruby
180
+ trainer = Mitie::TextCategorizerTrainer.new("total_word_feature_extractor.dat")
181
+ ```
182
+
183
+ Add labeled text to the trainer
184
+
185
+ ```ruby
186
+ trainer.add("This is super cool", "positive")
187
+ ```
188
+
189
+ Train the model
190
+
191
+ ```ruby
192
+ model = trainer.train
193
+ ```
194
+
195
+ Save the model
196
+
197
+ ```ruby
198
+ model.save_to_disk("text_categorization_model.dat")
199
+ ```
200
+
201
+ Load a saved model
202
+
203
+ ```ruby
204
+ model = Mitie::TextCategorizer.new("text_categorization_model.dat")
205
+ ```
206
+
207
+ Categorize text
208
+
209
+ ```ruby
210
+ model.categorize("What a super nice day")
211
+ ```
212
+
213
+ ## Deployment
214
+
215
+ Check out [Trove](https://github.com/ankane/trove) for deploying models.
216
+
217
+ ```sh
218
+ trove push ner_model.dat
219
+ ```
220
+
106
221
  ## History
107
222
 
108
- View the [changelog](https://github.com/ankane/mitie/blob/master/CHANGELOG.md)
223
+ View the [changelog](https://github.com/ankane/mitie-ruby/blob/master/CHANGELOG.md)
109
224
 
110
225
  ## Contributing
111
226
 
112
227
  Everyone is encouraged to help improve this project. Here are a few ways you can help:
113
228
 
114
- - [Report bugs](https://github.com/ankane/mitie/issues)
115
- - Fix bugs and [submit pull requests](https://github.com/ankane/mitie/pulls)
229
+ - [Report bugs](https://github.com/ankane/mitie-ruby/issues)
230
+ - Fix bugs and [submit pull requests](https://github.com/ankane/mitie-ruby/pulls)
116
231
  - Write, clarify, or fix documentation
117
232
  - Suggest or add new features
118
233
 
119
234
  To get started with development:
120
235
 
121
236
  ```sh
122
- git clone https://github.com/ankane/mitie.git
123
- cd mitie
237
+ git clone https://github.com/ankane/mitie-ruby.git
238
+ cd mitie-ruby
124
239
  bundle install
125
240
  bundle exec rake vendor:all
126
241
 
@@ -1,9 +1,16 @@
1
1
  module Mitie
2
2
  class BinaryRelationDetector
3
- def initialize(path)
4
- # better error message
5
- raise ArgumentError, "File does not exist" unless File.exist?(path)
6
- @pointer = FFI.mitie_load_binary_relation_detector(path)
3
+ def initialize(path = nil, pointer: nil)
4
+ if path
5
+ # better error message
6
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
7
+ @pointer = FFI.mitie_load_binary_relation_detector(path)
8
+ elsif pointer
9
+ @pointer = pointer
10
+ else
11
+ raise ArgumentError, "Must pass either a path or a pointer"
12
+ end
13
+
7
14
  ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
8
15
  end
9
16
 
@@ -23,37 +30,52 @@ module Mitie
23
30
 
24
31
  relations = []
25
32
  combinations.each do |entity1, entity2|
26
- relation =
27
- FFI.mitie_extract_binary_relation(
28
- doc.model.pointer,
29
- doc.send(:tokens_ptr),
30
- entity1[:token_index],
31
- entity1[:token_length],
32
- entity2[:token_index],
33
- entity2[:token_length]
34
- )
35
-
36
- score_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
37
- status = FFI.mitie_classify_binary_relation(pointer, relation, score_ptr)
38
- raise "Bad status: #{status}" if status != 0
39
- score = score_ptr.to_s(Fiddle::SIZEOF_DOUBLE).unpack1("d")
40
- if score > 0
41
- relations << {
42
- first: entity1[:text],
43
- second: entity2[:text],
44
- score: score
45
- }
46
- end
33
+ relation = extract_relation(doc, entity1, entity2)
34
+ relations << relation if relation
47
35
  end
48
36
  relations
49
37
  end
50
38
 
39
+ def save_to_disk(filename)
40
+ if FFI.mitie_save_binary_relation_detector(filename, pointer) != 0
41
+ raise Error, "Unable to save detector"
42
+ end
43
+ nil
44
+ end
45
+
51
46
  private
52
47
 
53
48
  def pointer
54
49
  @pointer
55
50
  end
56
51
 
52
+ def extract_relation(doc, entity1, entity2)
53
+ relation =
54
+ FFI.mitie_extract_binary_relation(
55
+ doc.model.pointer,
56
+ doc.send(:tokens_ptr),
57
+ entity1[:token_index],
58
+ entity1[:token_length],
59
+ entity2[:token_index],
60
+ entity2[:token_length]
61
+ )
62
+
63
+ score_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
64
+ status = FFI.mitie_classify_binary_relation(pointer, relation, score_ptr)
65
+ raise Error, "Bad status: #{status}" if status != 0
66
+
67
+ score = Utils.read_double(score_ptr)
68
+ if score > 0
69
+ {
70
+ first: entity1[:text],
71
+ second: entity2[:text],
72
+ score: score
73
+ }
74
+ end
75
+ ensure
76
+ FFI.mitie_free(relation) if relation
77
+ end
78
+
57
79
  def self.finalize(pointer)
58
80
  # must use proc instead of stabby lambda
59
81
  proc { FFI.mitie_free(pointer) }
@@ -0,0 +1,87 @@
1
+ module Mitie
2
+ class BinaryRelationTrainer
3
+ def initialize(ner, name: "")
4
+ @pointer = FFI.mitie_create_binary_relation_trainer(name, ner.pointer)
5
+
6
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
7
+ end
8
+
9
+ def add_positive_binary_relation(tokens, range1, range2)
10
+ check_add(tokens, range1, range2)
11
+
12
+ tokens_pointer = Utils.array_to_pointer(tokens)
13
+ status = FFI.mitie_add_positive_binary_relation(@pointer, tokens_pointer, range1.begin, range1.size, range2.begin, range2.size)
14
+ if status != 0
15
+ raise Error, "Unable to add binary relation"
16
+ end
17
+ end
18
+
19
+ def add_negative_binary_relation(tokens, range1, range2)
20
+ check_add(tokens, range1, range2)
21
+
22
+ tokens_pointer = Utils.array_to_pointer(tokens)
23
+ status = FFI.mitie_add_negative_binary_relation(@pointer, tokens_pointer, range1.begin, range1.size, range2.begin, range2.size)
24
+ if status != 0
25
+ raise Error, "Unable to add binary relation"
26
+ end
27
+ end
28
+
29
+ def beta
30
+ FFI.mitie_binary_relation_trainer_get_beta(@pointer)
31
+ end
32
+
33
+ def beta=(value)
34
+ raise ArgumentError, "beta must be greater than or equal to zero" unless value >= 0
35
+
36
+ FFI.mitie_binary_relation_trainer_set_beta(@pointer, value)
37
+ end
38
+
39
+ def num_threads
40
+ FFI.mitie_binary_relation_trainer_get_num_threads(@pointer)
41
+ end
42
+
43
+ def num_threads=(value)
44
+ FFI.mitie_binary_relation_trainer_set_num_threads(@pointer, value)
45
+ end
46
+
47
+ def num_positive_examples
48
+ FFI.mitie_binary_relation_trainer_num_positive_examples(@pointer)
49
+ end
50
+
51
+ def num_negative_examples
52
+ FFI.mitie_binary_relation_trainer_num_negative_examples(@pointer)
53
+ end
54
+
55
+ def train
56
+ if num_positive_examples + num_negative_examples == 0
57
+ raise Error, "You can't call train() on an empty trainer"
58
+ end
59
+
60
+ detector = FFI.mitie_train_binary_relation_detector(@pointer)
61
+
62
+ raise Error, "Unable to create binary relation detector. Probably ran out of RAM." if detector.null?
63
+
64
+ Mitie::BinaryRelationDetector.new(pointer: detector)
65
+ end
66
+
67
+ private
68
+
69
+ def check_add(tokens, range1, range2)
70
+ Utils.check_range(range1, tokens.size)
71
+ Utils.check_range(range2, tokens.size)
72
+
73
+ if entities_overlap?(range1, range2)
74
+ raise ArgumentError, "Entities overlap"
75
+ end
76
+ end
77
+
78
+ def entities_overlap?(range1, range2)
79
+ FFI.mitie_entities_overlap(range1.begin, range1.size, range2.begin, range2.size) == 1
80
+ end
81
+
82
+ def self.finalize(pointer)
83
+ # must use proc instead of stabby lambda
84
+ proc { FFI.mitie_free(pointer) }
85
+ end
86
+ end
87
+ end
@@ -33,37 +33,35 @@ module Mitie
33
33
 
34
34
  def entities
35
35
  @entities ||= begin
36
- begin
37
- entities = []
38
- tokens = tokens_with_offset
39
- detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
40
- num_detections = FFI.mitie_ner_get_num_detections(detections)
41
- num_detections.times do |i|
42
- pos = FFI.mitie_ner_get_detection_position(detections, i)
43
- len = FFI.mitie_ner_get_detection_length(detections, i)
44
- tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
45
- score = FFI.mitie_ner_get_detection_score(detections, i)
46
- tok = tokens[pos, len]
47
- offset = tok[0][1]
36
+ entities = []
37
+ tokens = tokens_with_offset
38
+ detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
39
+ num_detections = FFI.mitie_ner_get_num_detections(detections)
40
+ num_detections.times do |i|
41
+ pos = FFI.mitie_ner_get_detection_position(detections, i)
42
+ len = FFI.mitie_ner_get_detection_length(detections, i)
43
+ tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
44
+ score = FFI.mitie_ner_get_detection_score(detections, i)
45
+ tok = tokens[pos, len]
46
+ offset = tok[0][1]
48
47
 
49
- entity = {}
50
- if offset
51
- finish = tok[-1][1] + tok[-1][0].bytesize
52
- entity[:text] = text.byteslice(offset...finish)
53
- else
54
- entity[:text] = tok.map(&:first)
55
- end
56
- entity[:tag] = tag
57
- entity[:score] = score
58
- entity[:offset] = offset if offset
59
- entity[:token_index] = pos
60
- entity[:token_length] = len
61
- entities << entity
48
+ entity = {}
49
+ if offset
50
+ finish = tok[-1][1] + tok[-1][0].bytesize
51
+ entity[:text] = text.byteslice(offset...finish)
52
+ else
53
+ entity[:text] = tok.map(&:first)
62
54
  end
63
- entities
64
- ensure
65
- FFI.mitie_free(detections) if detections
55
+ entity[:tag] = tag
56
+ entity[:score] = score
57
+ entity[:offset] = offset if offset
58
+ entity[:token_index] = pos
59
+ entity[:token_length] = len
60
+ entities << entity
66
61
  end
62
+ entities
63
+ ensure
64
+ FFI.mitie_free(detections) if detections
67
65
  end
68
66
  end
69
67
 
@@ -84,11 +82,7 @@ module Mitie
84
82
  def tokenize
85
83
  @tokenize ||= begin
86
84
  if text.is_a?(Array)
87
- # malloc uses memset to set all bytes to 0
88
- tokens_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP * (text.size + 1))
89
- text.size.times do |i|
90
- tokens_ptr[i * Fiddle::SIZEOF_VOIDP, Fiddle::SIZEOF_VOIDP] = Fiddle::Pointer.to_ptr(text[i]).ref
91
- end
85
+ tokens_ptr = Utils.array_to_pointer(text)
92
86
  [tokens_ptr, nil]
93
87
  else
94
88
  offsets_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
data/lib/mitie/ffi.rb CHANGED
@@ -10,14 +10,17 @@ module Mitie
10
10
  raise e
11
11
  end
12
12
 
13
+ # https://github.com/mit-nlp/MITIE/blob/master/mitielib/include/mitie.h
14
+
13
15
  extern "void mitie_free(void* object)"
14
16
  extern "char** mitie_tokenize(const char* text)"
17
+ extern "char** mitie_tokenize_file(const char* filename)"
15
18
  extern "char** mitie_tokenize_with_offsets(const char* text, unsigned long** token_offsets)"
16
19
 
20
+ # ner
17
21
  extern "mitie_named_entity_extractor* mitie_load_named_entity_extractor(const char* filename)"
18
22
  extern "unsigned long mitie_get_num_possible_ner_tags(const mitie_named_entity_extractor* ner)"
19
23
  extern "const char* mitie_get_named_entity_tagstr(const mitie_named_entity_extractor* ner, unsigned long idx)"
20
-
21
24
  extern "mitie_named_entity_detections* mitie_extract_entities(const mitie_named_entity_extractor* ner, char** tokens)"
22
25
  extern "unsigned long mitie_ner_get_num_detections(const mitie_named_entity_detections* dets)"
23
26
  extern "unsigned long mitie_ner_get_detection_position(const mitie_named_entity_detections* dets, unsigned long idx)"
@@ -26,10 +29,57 @@ module Mitie
26
29
  extern "const char* mitie_ner_get_detection_tagstr(const mitie_named_entity_detections* dets, unsigned long idx)"
27
30
  extern "double mitie_ner_get_detection_score(const mitie_named_entity_detections* dets, unsigned long idx)"
28
31
 
32
+ # binary relation detector
29
33
  extern "mitie_binary_relation_detector* mitie_load_binary_relation_detector(const char* filename)"
30
34
  extern "const char* mitie_binary_relation_detector_name_string(const mitie_binary_relation_detector* detector)"
31
35
  extern "int mitie_entities_overlap(unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
32
36
  extern "mitie_binary_relation* mitie_extract_binary_relation(const mitie_named_entity_extractor* ner, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
33
37
  extern "int mitie_classify_binary_relation(const mitie_binary_relation_detector* detector, const mitie_binary_relation* relation, double* score)"
38
+
39
+ # text categorizer
40
+ extern "mitie_text_categorizer* mitie_load_text_categorizer(const char* filename)"
41
+ extern "int mitie_categorize_text(const mitie_text_categorizer* tcat, const char** tokens, char** text_tag, double* text_score)"
42
+
43
+ # save
44
+ extern "int mitie_save_named_entity_extractor(const char* filename, const mitie_named_entity_extractor* ner)"
45
+ extern "int mitie_save_binary_relation_detector(const char* filename, const mitie_binary_relation_detector* detector)"
46
+ extern "int mitie_save_text_categorizer(const char* filename, const mitie_text_categorizer* tcat)"
47
+
48
+ # ner trainer
49
+ extern "mitie_ner_training_instance* mitie_create_ner_training_instance(char** tokens)"
50
+ extern "unsigned long mitie_ner_training_instance_num_entities(const mitie_ner_training_instance* instance)"
51
+ extern "unsigned long mitie_ner_training_instance_num_tokens(const mitie_ner_training_instance* instance)"
52
+ extern "int mitie_overlaps_any_entity(mitie_ner_training_instance* instance, unsigned long start, unsigned long length)"
53
+ extern "int mitie_add_ner_training_entity(mitie_ner_training_instance* instance, unsigned long start, unsigned long length, const char* label)"
54
+ extern "mitie_ner_trainer* mitie_create_ner_trainer(const char* filename)"
55
+ extern "unsigned long mitie_ner_trainer_size(const mitie_ner_trainer* trainer)"
56
+ extern "int mitie_add_ner_training_instance(mitie_ner_trainer* trainer, const mitie_ner_training_instance* instance)"
57
+ extern "void mitie_ner_trainer_set_beta(mitie_ner_trainer* trainer, double beta)"
58
+ extern "double mitie_ner_trainer_get_beta(const mitie_ner_trainer* trainer)"
59
+ extern "void mitie_ner_trainer_set_num_threads(mitie_ner_trainer* trainer, unsigned long num_threads)"
60
+ extern "unsigned long mitie_ner_trainer_get_num_threads(const mitie_ner_trainer* trainer)"
61
+ extern "mitie_named_entity_extractor* mitie_train_named_entity_extractor(const mitie_ner_trainer* trainer)"
62
+
63
+ # binary relation trainer
64
+ extern "mitie_binary_relation_trainer* mitie_create_binary_relation_trainer(const char* relation_name, const mitie_named_entity_extractor* ner)"
65
+ extern "unsigned long mitie_binary_relation_trainer_num_positive_examples(const mitie_binary_relation_trainer* trainer)"
66
+ extern "unsigned long mitie_binary_relation_trainer_num_negative_examples(const mitie_binary_relation_trainer* trainer)"
67
+ extern "int mitie_add_positive_binary_relation(mitie_binary_relation_trainer* trainer, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
68
+ extern "int mitie_add_negative_binary_relation(mitie_binary_relation_trainer* trainer, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
69
+ extern "void mitie_binary_relation_trainer_set_beta(mitie_binary_relation_trainer* trainer, double beta)"
70
+ extern "double mitie_binary_relation_trainer_get_beta(const mitie_binary_relation_trainer* trainer)"
71
+ extern "void mitie_binary_relation_trainer_set_num_threads(mitie_binary_relation_trainer* trainer, unsigned long num_threads)"
72
+ extern "unsigned long mitie_binary_relation_trainer_get_num_threads(const mitie_binary_relation_trainer* trainer)"
73
+ extern "mitie_binary_relation_detector* mitie_train_binary_relation_detector(const mitie_binary_relation_trainer* trainer)"
74
+
75
+ # text categorizer trainer
76
+ extern "mitie_text_categorizer_trainer* mitie_create_text_categorizer_trainer(const char* filename)"
77
+ extern "unsigned long mitie_text_categorizer_trainer_size(const mitie_text_categorizer_trainer* trainer)"
78
+ extern "void mitie_text_categorizer_trainer_set_beta(mitie_text_categorizer_trainer* trainer, double beta)"
79
+ extern "double mitie_text_categorizer_trainer_get_beta(const mitie_text_categorizer_trainer* trainer)"
80
+ extern "void mitie_text_categorizer_trainer_set_num_threads(mitie_text_categorizer_trainer* trainer, unsigned long num_threads)"
81
+ extern "unsigned long mitie_text_categorizer_trainer_get_num_threads(const mitie_text_categorizer_trainer* trainer)"
82
+ extern "int mitie_add_text_categorizer_labeled_text(mitie_text_categorizer_trainer* trainer, const char** tokens, const char* label)"
83
+ extern "mitie_text_categorizer* mitie_train_text_categorizer(const mitie_text_categorizer_trainer* trainer)"
34
84
  end
35
85
  end
data/lib/mitie/ner.rb CHANGED
@@ -2,11 +2,18 @@ module Mitie
2
2
  class NER
3
3
  attr_reader :pointer
4
4
 
5
- def initialize(path)
6
- # better error message
7
- raise ArgumentError, "File does not exist" unless File.exist?(path)
8
- @pointer = FFI.mitie_load_named_entity_extractor(path)
9
- ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
5
+ def initialize(path = nil, pointer: nil)
6
+ if path
7
+ # better error message
8
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
9
+ @pointer = FFI.mitie_load_named_entity_extractor(path)
10
+ elsif pointer
11
+ @pointer = pointer
12
+ else
13
+ raise ArgumentError, "Must pass either a path or a pointer"
14
+ end
15
+
16
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
10
17
  end
11
18
 
12
19
  def tags
@@ -23,6 +30,13 @@ module Mitie
23
30
  doc(text).entities
24
31
  end
25
32
 
33
+ def save_to_disk(filename)
34
+ if FFI.mitie_save_named_entity_extractor(filename, pointer) != 0
35
+ raise Error, "Unable to save model"
36
+ end
37
+ nil
38
+ end
39
+
26
40
  def tokens(text)
27
41
  doc(text).tokens
28
42
  end
@@ -0,0 +1,51 @@
1
+ module Mitie
2
+ class NERTrainer
3
+ def initialize(filename)
4
+ raise ArgumentError, "File does not exist" unless File.exist?(filename)
5
+ @pointer = FFI.mitie_create_ner_trainer(filename)
6
+
7
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
8
+ end
9
+
10
+ def add(instance)
11
+ FFI.mitie_add_ner_training_instance(@pointer, instance.pointer)
12
+ end
13
+
14
+ def beta
15
+ FFI.mitie_ner_trainer_get_beta(@pointer)
16
+ end
17
+
18
+ def beta=(value)
19
+ raise ArgumentError, "beta must be greater than or equal to zero" unless value >= 0
20
+
21
+ FFI.mitie_ner_trainer_set_beta(@pointer, value)
22
+ end
23
+
24
+ def num_threads
25
+ FFI.mitie_ner_trainer_get_num_threads(@pointer)
26
+ end
27
+
28
+ def num_threads=(value)
29
+ FFI.mitie_ner_trainer_set_num_threads(@pointer, value)
30
+ end
31
+
32
+ def size
33
+ FFI.mitie_ner_trainer_size(@pointer)
34
+ end
35
+
36
+ def train
37
+ raise Error, "You can't call train() on an empty trainer" if size.zero?
38
+
39
+ extractor = FFI.mitie_train_named_entity_extractor(@pointer)
40
+
41
+ raise Error, "Unable to create named entity extractor. Probably ran out of RAM." if extractor.null?
42
+
43
+ Mitie::NER.new(pointer: extractor)
44
+ end
45
+
46
+ def self.finalize(pointer)
47
+ # must use proc instead of stabby lambda
48
+ proc { FFI.mitie_free(pointer) }
49
+ end
50
+ end
51
+ end
@@ -0,0 +1,45 @@
1
+ module Mitie
2
+ class NERTrainingInstance
3
+ attr_reader :pointer
4
+
5
+ def initialize(tokens)
6
+ tokens_pointer = Utils.array_to_pointer(tokens)
7
+
8
+ @pointer = FFI.mitie_create_ner_training_instance(tokens_pointer)
9
+ raise Error, "Unable to create training instance. Probably ran out of RAM." if @pointer.null?
10
+
11
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
12
+ end
13
+
14
+ def add_entity(range, label)
15
+ Utils.check_range(range, num_tokens)
16
+
17
+ raise ArgumentError, "Range overlaps existing entity" if overlaps_any_entity?(range)
18
+
19
+ unless FFI.mitie_add_ner_training_entity(@pointer, range.begin, range.size, label).zero?
20
+ raise Error, "Unable to add entity to training instance. Probably ran out of RAM."
21
+ end
22
+
23
+ nil
24
+ end
25
+
26
+ def num_entities
27
+ FFI.mitie_ner_training_instance_num_entities(@pointer)
28
+ end
29
+
30
+ def num_tokens
31
+ FFI.mitie_ner_training_instance_num_tokens(@pointer)
32
+ end
33
+
34
+ def overlaps_any_entity?(range)
35
+ Utils.check_range(range, num_tokens)
36
+
37
+ FFI.mitie_overlaps_any_entity(@pointer, range.begin, range.size) == 1
38
+ end
39
+
40
+ def self.finalize(pointer)
41
+ # must use proc instead of stabby lambda
42
+ proc { FFI.mitie_free(pointer) }
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,48 @@
1
+ module Mitie
2
+ class TextCategorizer
3
+ def initialize(path = nil, pointer: nil)
4
+ if path
5
+ # better error message
6
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
7
+ @pointer = FFI.mitie_load_text_categorizer(path)
8
+ elsif pointer
9
+ @pointer = pointer
10
+ else
11
+ raise ArgumentError, "Must pass either a path or a pointer"
12
+ end
13
+
14
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
15
+ end
16
+
17
+ def categorize(text)
18
+ tokens = text.is_a?(Array) ? text : Mitie.tokenize(text)
19
+ tokens_pointer = Utils.array_to_pointer(tokens)
20
+ text_tag = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
21
+ text_score = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
22
+
23
+ if FFI.mitie_categorize_text(@pointer, tokens_pointer, text_tag, text_score) != 0
24
+ raise Error, "Unable to categorize"
25
+ end
26
+
27
+ {
28
+ tag: text_tag.ptr.to_s,
29
+ score: Utils.read_double(text_score)
30
+ }
31
+ ensure
32
+ # text_tag must be freed
33
+ FFI.mitie_free(text_tag.ptr) if text_tag
34
+ end
35
+
36
+ def save_to_disk(filename)
37
+ if FFI.mitie_save_text_categorizer(filename, @pointer) != 0
38
+ raise Error, "Unable to save model"
39
+ end
40
+ nil
41
+ end
42
+
43
+ def self.finalize(pointer)
44
+ # must use proc instead of stabby lambda
45
+ proc { FFI.mitie_free(pointer) }
46
+ end
47
+ end
48
+ end
@@ -0,0 +1,53 @@
1
+ module Mitie
2
+ class TextCategorizerTrainer
3
+ def initialize(filename)
4
+ raise ArgumentError, "File does not exist" unless File.exist?(filename)
5
+ @pointer = FFI.mitie_create_text_categorizer_trainer(filename)
6
+
7
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
8
+ end
9
+
10
+ def add(text, label)
11
+ tokens = text.is_a?(Array) ? text : Mitie.tokenize(text)
12
+ tokens_pointer = Utils.array_to_pointer(tokens)
13
+ FFI.mitie_add_text_categorizer_labeled_text(@pointer, tokens_pointer, label)
14
+ end
15
+
16
+ def beta
17
+ FFI.mitie_text_categorizer_trainer_get_beta(@pointer)
18
+ end
19
+
20
+ def beta=(value)
21
+ raise ArgumentError, "beta must be greater than or equal to zero" unless value >= 0
22
+
23
+ FFI.mitie_text_categorizer_trainer_set_beta(@pointer, value)
24
+ end
25
+
26
+ def num_threads
27
+ FFI.mitie_text_categorizer_trainer_get_num_threads(@pointer)
28
+ end
29
+
30
+ def num_threads=(value)
31
+ FFI.mitie_text_categorizer_trainer_set_num_threads(@pointer, value)
32
+ end
33
+
34
+ def size
35
+ FFI.mitie_text_categorizer_trainer_size(@pointer)
36
+ end
37
+
38
+ def train
39
+ raise Error, "You can't call train() on an empty trainer" if size.zero?
40
+
41
+ categorizer = FFI.mitie_train_text_categorizer(@pointer)
42
+
43
+ raise Error, "Unable to create text categorizer. Probably ran out of RAM." if categorizer.null?
44
+
45
+ Mitie::TextCategorizer.new(pointer: categorizer)
46
+ end
47
+
48
+ def self.finalize(pointer)
49
+ # must use proc instead of stabby lambda
50
+ proc { FFI.mitie_free(pointer) }
51
+ end
52
+ end
53
+ end
@@ -0,0 +1,22 @@
1
+ module Mitie
2
+ module Utils
3
+ def self.array_to_pointer(text)
4
+ # malloc uses memset to set all bytes to 0
5
+ tokens_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP * (text.size + 1))
6
+ text.size.times do |i|
7
+ tokens_ptr[i * Fiddle::SIZEOF_VOIDP, Fiddle::SIZEOF_VOIDP] = Fiddle::Pointer.to_ptr(text[i]).ref
8
+ end
9
+ tokens_ptr
10
+ end
11
+
12
+ def self.check_range(range, num_tokens)
13
+ if range.none? || !(0..(num_tokens - 1)).cover?(range)
14
+ raise ArgumentError, "Invalid range"
15
+ end
16
+ end
17
+
18
+ def self.read_double(ptr)
19
+ ptr.to_s(Fiddle::SIZEOF_DOUBLE).unpack1("d")
20
+ end
21
+ end
22
+ end
data/lib/mitie/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Mitie
2
- VERSION = "0.1.5"
2
+ VERSION = "0.2.1"
3
3
  end
data/lib/mitie.rb CHANGED
@@ -3,8 +3,14 @@ require "fiddle/import"
3
3
 
4
4
  # modules
5
5
  require "mitie/binary_relation_detector"
6
+ require "mitie/binary_relation_trainer"
6
7
  require "mitie/document"
7
8
  require "mitie/ner"
9
+ require "mitie/ner_training_instance"
10
+ require "mitie/ner_trainer"
11
+ require "mitie/text_categorizer"
12
+ require "mitie/text_categorizer_trainer"
13
+ require "mitie/utils"
8
14
  require "mitie/version"
9
15
 
10
16
  module Mitie
@@ -16,10 +22,12 @@ module Mitie
16
22
  lib_name =
17
23
  if Gem.win_platform?
18
24
  "mitie.dll"
19
- elsif RbConfig::CONFIG["arch"] =~ /arm64-darwin/i
20
- "libmitie.arm64.dylib"
21
25
  elsif RbConfig::CONFIG["host_os"] =~ /darwin/i
22
- "libmitie.dylib"
26
+ if RbConfig::CONFIG["host_cpu"] =~ /arm|aarch64/i
27
+ "libmitie.arm64.dylib"
28
+ else
29
+ "libmitie.dylib"
30
+ end
23
31
  else
24
32
  "libmitie.so"
25
33
  end
@@ -28,4 +36,36 @@ module Mitie
28
36
 
29
37
  # friendlier error message
30
38
  autoload :FFI, "mitie/ffi"
39
+
40
+ class << self
41
+ def tokenize(text)
42
+ tokens_ptr = FFI.mitie_tokenize(text)
43
+ tokens = read_tokens(tokens_ptr)
44
+ tokens.each { |t| t.force_encoding(text.encoding) }
45
+ tokens
46
+ ensure
47
+ FFI.mitie_free(tokens_ptr) if tokens_ptr
48
+ end
49
+
50
+ def tokenize_file(filename)
51
+ tokens_ptr = FFI.mitie_tokenize_file(filename)
52
+ read_tokens(tokens_ptr)
53
+ ensure
54
+ FFI.mitie_free(tokens_ptr) if tokens_ptr
55
+ end
56
+
57
+ private
58
+
59
+ def read_tokens(tokens_ptr)
60
+ i = 0
61
+ tokens = []
62
+ loop do
63
+ token = (tokens_ptr + i * Fiddle::SIZEOF_VOIDP).ptr
64
+ break if token.null?
65
+ tokens << token.to_s
66
+ i += 1
67
+ end
68
+ tokens
69
+ end
70
+ end
31
71
  end
metadata CHANGED
@@ -1,17 +1,17 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mitie
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.5
4
+ version: 0.2.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-01-30 00:00:00.000000000 Z
11
+ date: 2022-06-12 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description:
14
- email: andrew@chartkick.com
14
+ email: andrew@ankane.org
15
15
  executables: []
16
16
  extensions: []
17
17
  extra_rdoc_files: []
@@ -21,16 +21,22 @@ files:
21
21
  - README.md
22
22
  - lib/mitie.rb
23
23
  - lib/mitie/binary_relation_detector.rb
24
+ - lib/mitie/binary_relation_trainer.rb
24
25
  - lib/mitie/document.rb
25
26
  - lib/mitie/ffi.rb
26
27
  - lib/mitie/ner.rb
28
+ - lib/mitie/ner_trainer.rb
29
+ - lib/mitie/ner_training_instance.rb
30
+ - lib/mitie/text_categorizer.rb
31
+ - lib/mitie/text_categorizer_trainer.rb
32
+ - lib/mitie/utils.rb
27
33
  - lib/mitie/version.rb
28
34
  - vendor/LICENSE.txt
29
35
  - vendor/libmitie.arm64.dylib
30
36
  - vendor/libmitie.dylib
31
37
  - vendor/libmitie.so
32
38
  - vendor/mitie.dll
33
- homepage: https://github.com/ankane/mitie
39
+ homepage: https://github.com/ankane/mitie-ruby
34
40
  licenses:
35
41
  - BSL-1.0
36
42
  metadata: {}
@@ -42,14 +48,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
42
48
  requirements:
43
49
  - - ">="
44
50
  - !ruby/object:Gem::Version
45
- version: '2.5'
51
+ version: '2.7'
46
52
  required_rubygems_version: !ruby/object:Gem::Requirement
47
53
  requirements:
48
54
  - - ">="
49
55
  - !ruby/object:Gem::Version
50
56
  version: '0'
51
57
  requirements: []
52
- rubygems_version: 3.2.3
58
+ rubygems_version: 3.3.7
53
59
  signing_key:
54
60
  specification_version: 4
55
61
  summary: Named-entity recognition for Ruby