mitie 0.1.4 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ddea728d3e6814fe8e9841e96f9aaab39a13d8b9573331136ff5236d9abfd364
4
- data.tar.gz: 391430fe23874d3a6aed47b8d0fd54ea517a85cf7331834a2732a0216bdc63e3
3
+ metadata.gz: 50fb12adcd0042b3c09968108ec382694f8f0df20750b88566bc64c8d85d9e8d
4
+ data.tar.gz: 9d3d34f9839f71fc17e6651a17069f1f0fdb4b5956f9e57c4b66c9567d908967
5
5
  SHA512:
6
- metadata.gz: fe566863d8c89b08be5538bd736bc6e6131ca728aea221585d81bf6f21228f55e660e0c93d58b8517255383bc5e961d3dfd47f3f28c351ba6c46f71893b6d5e0
7
- data.tar.gz: 355018b0c8bedf1216ef66061758073f57a3574075084e73d54bb277c73aad48c41079c90d5621165479fafca998ae1a913bb2fd79802bf001e76f87eeedc564
6
+ metadata.gz: 189d5a17f94fff9abbc1d8d961c8f8407ae85bd1284f727e363b9aa7384d748eb1d80b6ffa1b1a44c8c6fadbbb6e91e6d064ea58fa88af972e6fd6e90e3ef71d
7
+ data.tar.gz: 202ba418ee98636736185a8f4783601acabecc59dde7e3a5c31a12b56827497b6b2b7e7e26d264b61848281b94cb635249b17a9e8fa009c496dacbe24e6466ca
data/CHANGELOG.md CHANGED
@@ -1,3 +1,18 @@
1
+ ## 0.2.0 (2022-06-01)
2
+
3
+ - Added support for text categorization
4
+ - Added support for training binary relation detectors
5
+ - Dropped support for Ruby < 2.7
6
+
7
+ ## 0.1.6 (2022-03-20)
8
+
9
+ - Added support for training NER models
10
+ - Improved ARM detection
11
+
12
+ ## 0.1.5 (2021-01-29)
13
+
14
+ - Fixed issue with multibyte characters
15
+
1
16
  ## 0.1.4 (2020-12-28)
2
17
 
3
18
  - Added ARM shared library for Mac
data/README.md CHANGED
@@ -1,21 +1,21 @@
1
- # MITIE
1
+ # MITIE Ruby
2
2
 
3
- [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition and binary relation detection - for Ruby
3
+ [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition, binary relation detection, and text categorization - for Ruby
4
4
 
5
5
  - Finds people, organizations, and locations in text
6
6
  - Detects relationships between entities, like `PERSON` was born in `LOCATION`
7
7
 
8
- [![Build Status](https://github.com/ankane/mitie/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mitie/actions)
8
+ [![Build Status](https://github.com/ankane/mitie-ruby/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mitie-ruby/actions)
9
9
 
10
10
  ## Installation
11
11
 
12
12
  Add this line to your application’s Gemfile:
13
13
 
14
14
  ```ruby
15
- gem 'mitie'
15
+ gem "mitie"
16
16
  ```
17
17
 
18
- And download the pre-trained model for your language:
18
+ And download the pre-trained models for your language:
19
19
 
20
20
  - [English](https://github.com/mit-nlp/MITIE/releases/download/v0.4/MITIE-models-v0.2.tar.bz2)
21
21
  - [Spanish](https://github.com/mit-nlp/MITIE/releases/download/v0.4/MITIE-models-v0.2-Spanish.zip)
@@ -23,6 +23,12 @@ And download the pre-trained model for your language:
23
23
 
24
24
  ## Getting Started
25
25
 
26
+ - [Named Entity Recognition](#named-entity-recognition)
27
+ - [Binary Relation Detection](#binary-relation-detection)
28
+ - [Text Categorization](#text-categorization)
29
+
30
+ ## Named Entity Recognition
31
+
26
32
  Load an NER model
27
33
 
28
34
  ```ruby
@@ -69,6 +75,41 @@ Get all tags for a model
69
75
  model.tags
70
76
  ```
71
77
 
78
+ ### Training
79
+
80
+ Load an NER model into a trainer
81
+
82
+ ```ruby
83
+ trainer = Mitie::NERTrainer.new("total_word_feature_extractor.dat")
84
+ ```
85
+
86
+ Create training instances
87
+
88
+ ```ruby
89
+ tokens = ["You", "can", "do", "machine", "learning", "in", "Ruby", "!"]
90
+ instance = Mitie::NERTrainingInstance.new(tokens)
91
+ instance.add_entity(3..4, "topic") # machine learning
92
+ instance.add_entity(6..6, "language") # Ruby
93
+ ```
94
+
95
+ Add the training instances to the trainer
96
+
97
+ ```ruby
98
+ trainer.add(instance)
99
+ ```
100
+
101
+ Train the model
102
+
103
+ ```ruby
104
+ model = trainer.train
105
+ ```
106
+
107
+ Save the model
108
+
109
+ ```ruby
110
+ model.save_to_disk("ner_model.dat")
111
+ ```
112
+
72
113
  ## Binary Relation Detection
73
114
 
74
115
  Detect relationships betweens two entities, like:
@@ -103,24 +144,98 @@ This returns
103
144
  [{first: "Shopify", second: "Ottawa", score: 0.17649169745814464}]
104
145
  ```
105
146
 
147
+ ### Training
148
+
149
+ Load an NER model into a trainer
150
+
151
+ ```ruby
152
+ trainer = Mitie::BinaryRelationTrainer.new(model)
153
+ ```
154
+
155
+ Add positive and negative examples to the trainer
156
+
157
+ ```ruby
158
+ tokens = ["Shopify", "was", "founded", "in", "Ottawa"]
159
+ trainer.add_positive_binary_relation(tokens, 0..0, 4..4)
160
+ trainer.add_negative_binary_relation(tokens, 4..4, 0..0)
161
+ ```
162
+
163
+ Train the detector
164
+
165
+ ```ruby
166
+ detector = trainer.train
167
+ ```
168
+
169
+ Save the detector
170
+
171
+ ```ruby
172
+ detector.save_to_disk("binary_relation_detector.svm")
173
+ ```
174
+
175
+ ## Text Categorization
176
+
177
+ Load a model into a trainer
178
+
179
+ ```ruby
180
+ trainer = Mitie::TextCategorizerTrainer.new("total_word_feature_extractor.dat")
181
+ ```
182
+
183
+ Add labeled text to the trainer
184
+
185
+ ```ruby
186
+ trainer.add(["This", "is", "super", "cool"], "positive")
187
+ ```
188
+
189
+ Train the model
190
+
191
+ ```ruby
192
+ model = trainer.train
193
+ ```
194
+
195
+ Save the model
196
+
197
+ ```ruby
198
+ model.save_to_disk("text_categorization_model.dat")
199
+ ```
200
+
201
+ Load a saved model
202
+
203
+ ```ruby
204
+ model = Mitie::TextCategorizer.new("text_categorization_model.dat")
205
+ ```
206
+
207
+ Categorize text
208
+
209
+ ```ruby
210
+ model.categorize(["What", "a", "super", "nice", "day"])
211
+ ```
212
+
213
+ ## Deployment
214
+
215
+ Check out [Trove](https://github.com/ankane/trove) for deploying models.
216
+
217
+ ```sh
218
+ trove push ner_model.dat
219
+ ```
220
+
106
221
  ## History
107
222
 
108
- View the [changelog](https://github.com/ankane/mitie/blob/master/CHANGELOG.md)
223
+ View the [changelog](https://github.com/ankane/mitie-ruby/blob/master/CHANGELOG.md)
109
224
 
110
225
  ## Contributing
111
226
 
112
227
  Everyone is encouraged to help improve this project. Here are a few ways you can help:
113
228
 
114
- - [Report bugs](https://github.com/ankane/mitie/issues)
115
- - Fix bugs and [submit pull requests](https://github.com/ankane/mitie/pulls)
229
+ - [Report bugs](https://github.com/ankane/mitie-ruby/issues)
230
+ - Fix bugs and [submit pull requests](https://github.com/ankane/mitie-ruby/pulls)
116
231
  - Write, clarify, or fix documentation
117
232
  - Suggest or add new features
118
233
 
119
234
  To get started with development:
120
235
 
121
236
  ```sh
122
- git clone https://github.com/ankane/mitie.git
123
- cd mitie
237
+ git clone https://github.com/ankane/mitie-ruby.git
238
+ cd mitie-ruby
124
239
  bundle install
125
240
  bundle exec rake vendor:all
126
241
 
@@ -1,9 +1,16 @@
1
1
  module Mitie
2
2
  class BinaryRelationDetector
3
- def initialize(path)
4
- # better error message
5
- raise ArgumentError, "File does not exist" unless File.exist?(path)
6
- @pointer = FFI.mitie_load_binary_relation_detector(path)
3
+ def initialize(path = nil, pointer: nil)
4
+ if path
5
+ # better error message
6
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
7
+ @pointer = FFI.mitie_load_binary_relation_detector(path)
8
+ elsif pointer
9
+ @pointer = pointer
10
+ else
11
+ raise ArgumentError, "Must pass either a path or a pointer"
12
+ end
13
+
7
14
  ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
8
15
  end
9
16
 
@@ -23,37 +30,52 @@ module Mitie
23
30
 
24
31
  relations = []
25
32
  combinations.each do |entity1, entity2|
26
- relation =
27
- FFI.mitie_extract_binary_relation(
28
- doc.model.pointer,
29
- doc.send(:tokens_ptr),
30
- entity1[:token_index],
31
- entity1[:token_length],
32
- entity2[:token_index],
33
- entity2[:token_length]
34
- )
35
-
36
- score_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
37
- status = FFI.mitie_classify_binary_relation(pointer, relation, score_ptr)
38
- raise "Bad status: #{status}" if status != 0
39
- score = score_ptr.to_s(Fiddle::SIZEOF_DOUBLE).unpack1("d")
40
- if score > 0
41
- relations << {
42
- first: entity1[:text],
43
- second: entity2[:text],
44
- score: score
45
- }
46
- end
33
+ relation = extract_relation(doc, entity1, entity2)
34
+ relations << relation if relation
47
35
  end
48
36
  relations
49
37
  end
50
38
 
39
+ def save_to_disk(filename)
40
+ if FFI.mitie_save_binary_relation_detector(filename, pointer) != 0
41
+ raise Error, "Unable to save detector"
42
+ end
43
+ nil
44
+ end
45
+
51
46
  private
52
47
 
53
48
  def pointer
54
49
  @pointer
55
50
  end
56
51
 
52
+ def extract_relation(doc, entity1, entity2)
53
+ relation =
54
+ FFI.mitie_extract_binary_relation(
55
+ doc.model.pointer,
56
+ doc.send(:tokens_ptr),
57
+ entity1[:token_index],
58
+ entity1[:token_length],
59
+ entity2[:token_index],
60
+ entity2[:token_length]
61
+ )
62
+
63
+ score_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
64
+ status = FFI.mitie_classify_binary_relation(pointer, relation, score_ptr)
65
+ raise Error, "Bad status: #{status}" if status != 0
66
+
67
+ score = Utils.read_double(score_ptr)
68
+ if score > 0
69
+ {
70
+ first: entity1[:text],
71
+ second: entity2[:text],
72
+ score: score
73
+ }
74
+ end
75
+ ensure
76
+ FFI.mitie_free(relation) if relation
77
+ end
78
+
57
79
  def self.finalize(pointer)
58
80
  # must use proc instead of stabby lambda
59
81
  proc { FFI.mitie_free(pointer) }
@@ -0,0 +1,87 @@
1
+ module Mitie
2
+ class BinaryRelationTrainer
3
+ def initialize(ner, name: "")
4
+ @pointer = FFI.mitie_create_binary_relation_trainer(name, ner.pointer)
5
+
6
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
7
+ end
8
+
9
+ def add_positive_binary_relation(tokens, range1, range2)
10
+ check_add(tokens, range1, range2)
11
+
12
+ tokens_pointer = Utils.array_to_pointer(tokens)
13
+ status = FFI.mitie_add_positive_binary_relation(@pointer, tokens_pointer, range1.begin, range1.size, range2.begin, range2.size)
14
+ if status != 0
15
+ raise Error, "Unable to add binary relation"
16
+ end
17
+ end
18
+
19
+ def add_negative_binary_relation(tokens, range1, range2)
20
+ check_add(tokens, range1, range2)
21
+
22
+ tokens_pointer = Utils.array_to_pointer(tokens)
23
+ status = FFI.mitie_add_negative_binary_relation(@pointer, tokens_pointer, range1.begin, range1.size, range2.begin, range2.size)
24
+ if status != 0
25
+ raise Error, "Unable to add binary relation"
26
+ end
27
+ end
28
+
29
+ def beta
30
+ FFI.mitie_binary_relation_trainer_get_beta(@pointer)
31
+ end
32
+
33
+ def beta=(value)
34
+ raise ArgumentError, "beta must be greater than or equal to zero" unless value >= 0
35
+
36
+ FFI.mitie_binary_relation_trainer_set_beta(@pointer, value)
37
+ end
38
+
39
+ def num_threads
40
+ FFI.mitie_binary_relation_trainer_get_num_threads(@pointer)
41
+ end
42
+
43
+ def num_threads=(value)
44
+ FFI.mitie_binary_relation_trainer_set_num_threads(@pointer, value)
45
+ end
46
+
47
+ def num_positive_examples
48
+ FFI.mitie_binary_relation_trainer_num_positive_examples(@pointer)
49
+ end
50
+
51
+ def num_negative_examples
52
+ FFI.mitie_binary_relation_trainer_num_negative_examples(@pointer)
53
+ end
54
+
55
+ def train
56
+ if num_positive_examples + num_negative_examples == 0
57
+ raise Error, "You can't call train() on an empty trainer"
58
+ end
59
+
60
+ detector = FFI.mitie_train_binary_relation_detector(@pointer)
61
+
62
+ raise Error, "Unable to create binary relation detector. Probably ran out of RAM." if detector.null?
63
+
64
+ Mitie::BinaryRelationDetector.new(pointer: detector)
65
+ end
66
+
67
+ private
68
+
69
+ def check_add(tokens, range1, range2)
70
+ Utils.check_range(range1, tokens.size)
71
+ Utils.check_range(range2, tokens.size)
72
+
73
+ if entities_overlap?(range1, range2)
74
+ raise ArgumentError, "Entities overlap"
75
+ end
76
+ end
77
+
78
+ def entities_overlap?(range1, range2)
79
+ FFI.mitie_entities_overlap(range1.begin, range1.size, range2.begin, range2.size) == 1
80
+ end
81
+
82
+ def self.finalize(pointer)
83
+ # must use proc instead of stabby lambda
84
+ proc { FFI.mitie_free(pointer) }
85
+ end
86
+ end
87
+ end
@@ -33,37 +33,35 @@ module Mitie
33
33
 
34
34
  def entities
35
35
  @entities ||= begin
36
- begin
37
- entities = []
38
- tokens = tokens_with_offset
39
- detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
40
- num_detections = FFI.mitie_ner_get_num_detections(detections)
41
- num_detections.times do |i|
42
- pos = FFI.mitie_ner_get_detection_position(detections, i)
43
- len = FFI.mitie_ner_get_detection_length(detections, i)
44
- tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
45
- score = FFI.mitie_ner_get_detection_score(detections, i)
46
- tok = tokens[pos, len]
47
- offset = tok[0][1]
36
+ entities = []
37
+ tokens = tokens_with_offset
38
+ detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
39
+ num_detections = FFI.mitie_ner_get_num_detections(detections)
40
+ num_detections.times do |i|
41
+ pos = FFI.mitie_ner_get_detection_position(detections, i)
42
+ len = FFI.mitie_ner_get_detection_length(detections, i)
43
+ tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
44
+ score = FFI.mitie_ner_get_detection_score(detections, i)
45
+ tok = tokens[pos, len]
46
+ offset = tok[0][1]
48
47
 
49
- entity = {}
50
- if offset
51
- finish = tok[-1][1] + tok[-1][0].size
52
- entity[:text] = text[offset...finish]
53
- else
54
- entity[:text] = tok.map(&:first)
55
- end
56
- entity[:tag] = tag
57
- entity[:score] = score
58
- entity[:offset] = offset if offset
59
- entity[:token_index] = pos
60
- entity[:token_length] = len
61
- entities << entity
48
+ entity = {}
49
+ if offset
50
+ finish = tok[-1][1] + tok[-1][0].bytesize
51
+ entity[:text] = text.byteslice(offset...finish)
52
+ else
53
+ entity[:text] = tok.map(&:first)
62
54
  end
63
- entities
64
- ensure
65
- FFI.mitie_free(detections) if detections
55
+ entity[:tag] = tag
56
+ entity[:score] = score
57
+ entity[:offset] = offset if offset
58
+ entity[:token_index] = pos
59
+ entity[:token_length] = len
60
+ entities << entity
66
61
  end
62
+ entities
63
+ ensure
64
+ FFI.mitie_free(detections) if detections
67
65
  end
68
66
  end
69
67
 
@@ -84,11 +82,7 @@ module Mitie
84
82
  def tokenize
85
83
  @tokenize ||= begin
86
84
  if text.is_a?(Array)
87
- # malloc uses memset to set all bytes to 0
88
- tokens_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP * (text.size + 1))
89
- text.size.times do |i|
90
- tokens_ptr[i * Fiddle::SIZEOF_VOIDP, Fiddle::SIZEOF_VOIDP] = Fiddle::Pointer.to_ptr(text[i]).ref
91
- end
85
+ tokens_ptr = Utils.array_to_pointer(text)
92
86
  [tokens_ptr, nil]
93
87
  else
94
88
  offsets_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
data/lib/mitie/ffi.rb CHANGED
@@ -10,14 +10,16 @@ module Mitie
10
10
  raise e
11
11
  end
12
12
 
13
+ # https://github.com/mit-nlp/MITIE/blob/master/mitielib/include/mitie.h
14
+
13
15
  extern "void mitie_free(void* object)"
14
16
  extern "char** mitie_tokenize(const char* text)"
15
17
  extern "char** mitie_tokenize_with_offsets(const char* text, unsigned long** token_offsets)"
16
18
 
19
+ # ner
17
20
  extern "mitie_named_entity_extractor* mitie_load_named_entity_extractor(const char* filename)"
18
21
  extern "unsigned long mitie_get_num_possible_ner_tags(const mitie_named_entity_extractor* ner)"
19
22
  extern "const char* mitie_get_named_entity_tagstr(const mitie_named_entity_extractor* ner, unsigned long idx)"
20
-
21
23
  extern "mitie_named_entity_detections* mitie_extract_entities(const mitie_named_entity_extractor* ner, char** tokens)"
22
24
  extern "unsigned long mitie_ner_get_num_detections(const mitie_named_entity_detections* dets)"
23
25
  extern "unsigned long mitie_ner_get_detection_position(const mitie_named_entity_detections* dets, unsigned long idx)"
@@ -26,10 +28,57 @@ module Mitie
26
28
  extern "const char* mitie_ner_get_detection_tagstr(const mitie_named_entity_detections* dets, unsigned long idx)"
27
29
  extern "double mitie_ner_get_detection_score(const mitie_named_entity_detections* dets, unsigned long idx)"
28
30
 
31
+ # binary relation detector
29
32
  extern "mitie_binary_relation_detector* mitie_load_binary_relation_detector(const char* filename)"
30
33
  extern "const char* mitie_binary_relation_detector_name_string(const mitie_binary_relation_detector* detector)"
31
34
  extern "int mitie_entities_overlap(unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
32
35
  extern "mitie_binary_relation* mitie_extract_binary_relation(const mitie_named_entity_extractor* ner, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
33
36
  extern "int mitie_classify_binary_relation(const mitie_binary_relation_detector* detector, const mitie_binary_relation* relation, double* score)"
37
+
38
+ # text categorizer
39
+ extern "mitie_text_categorizer* mitie_load_text_categorizer(const char* filename)"
40
+ extern "int mitie_categorize_text(const mitie_text_categorizer* tcat, const char** tokens, char** text_tag, double* text_score)"
41
+
42
+ # save
43
+ extern "int mitie_save_named_entity_extractor(const char* filename, const mitie_named_entity_extractor* ner)"
44
+ extern "int mitie_save_binary_relation_detector(const char* filename, const mitie_binary_relation_detector* detector)"
45
+ extern "int mitie_save_text_categorizer(const char* filename, const mitie_text_categorizer* tcat)"
46
+
47
+ # ner trainer
48
+ extern "mitie_ner_training_instance* mitie_create_ner_training_instance(char** tokens)"
49
+ extern "unsigned long mitie_ner_training_instance_num_entities(const mitie_ner_training_instance* instance)"
50
+ extern "unsigned long mitie_ner_training_instance_num_tokens(const mitie_ner_training_instance* instance)"
51
+ extern "int mitie_overlaps_any_entity(mitie_ner_training_instance* instance, unsigned long start, unsigned long length)"
52
+ extern "int mitie_add_ner_training_entity(mitie_ner_training_instance* instance, unsigned long start, unsigned long length, const char* label)"
53
+ extern "mitie_ner_trainer* mitie_create_ner_trainer(const char* filename)"
54
+ extern "unsigned long mitie_ner_trainer_size(const mitie_ner_trainer* trainer)"
55
+ extern "int mitie_add_ner_training_instance(mitie_ner_trainer* trainer, const mitie_ner_training_instance* instance)"
56
+ extern "void mitie_ner_trainer_set_beta(mitie_ner_trainer* trainer, double beta)"
57
+ extern "double mitie_ner_trainer_get_beta(const mitie_ner_trainer* trainer)"
58
+ extern "void mitie_ner_trainer_set_num_threads(mitie_ner_trainer* trainer, unsigned long num_threads)"
59
+ extern "unsigned long mitie_ner_trainer_get_num_threads(const mitie_ner_trainer* trainer)"
60
+ extern "mitie_named_entity_extractor* mitie_train_named_entity_extractor(const mitie_ner_trainer* trainer)"
61
+
62
+ # binary relation trainer
63
+ extern "mitie_binary_relation_trainer* mitie_create_binary_relation_trainer(const char* relation_name, const mitie_named_entity_extractor* ner)"
64
+ extern "unsigned long mitie_binary_relation_trainer_num_positive_examples(const mitie_binary_relation_trainer* trainer)"
65
+ extern "unsigned long mitie_binary_relation_trainer_num_negative_examples(const mitie_binary_relation_trainer* trainer)"
66
+ extern "int mitie_add_positive_binary_relation(mitie_binary_relation_trainer* trainer, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
67
+ extern "int mitie_add_negative_binary_relation(mitie_binary_relation_trainer* trainer, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
68
+ extern "void mitie_binary_relation_trainer_set_beta(mitie_binary_relation_trainer* trainer, double beta)"
69
+ extern "double mitie_binary_relation_trainer_get_beta(const mitie_binary_relation_trainer* trainer)"
70
+ extern "void mitie_binary_relation_trainer_set_num_threads(mitie_binary_relation_trainer* trainer, unsigned long num_threads)"
71
+ extern "unsigned long mitie_binary_relation_trainer_get_num_threads(const mitie_binary_relation_trainer* trainer)"
72
+ extern "mitie_binary_relation_detector* mitie_train_binary_relation_detector(const mitie_binary_relation_trainer* trainer)"
73
+
74
+ # text categorizer trainer
75
+ extern "mitie_text_categorizer_trainer* mitie_create_text_categorizer_trainer(const char* filename)"
76
+ extern "unsigned long mitie_text_categorizer_trainer_size(const mitie_text_categorizer_trainer* trainer)"
77
+ extern "void mitie_text_categorizer_trainer_set_beta(mitie_text_categorizer_trainer* trainer, double beta)"
78
+ extern "double mitie_text_categorizer_trainer_get_beta(const mitie_text_categorizer_trainer* trainer)"
79
+ extern "void mitie_text_categorizer_trainer_set_num_threads(mitie_text_categorizer_trainer* trainer, unsigned long num_threads)"
80
+ extern "unsigned long mitie_text_categorizer_trainer_get_num_threads(const mitie_text_categorizer_trainer* trainer)"
81
+ extern "int mitie_add_text_categorizer_labeled_text(mitie_text_categorizer_trainer* trainer, const char** tokens, const char* label)"
82
+ extern "mitie_text_categorizer* mitie_train_text_categorizer(const mitie_text_categorizer_trainer* trainer)"
34
83
  end
35
84
  end
data/lib/mitie/ner.rb CHANGED
@@ -2,11 +2,18 @@ module Mitie
2
2
  class NER
3
3
  attr_reader :pointer
4
4
 
5
- def initialize(path)
6
- # better error message
7
- raise ArgumentError, "File does not exist" unless File.exist?(path)
8
- @pointer = FFI.mitie_load_named_entity_extractor(path)
9
- ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
5
+ def initialize(path = nil, pointer: nil)
6
+ if path
7
+ # better error message
8
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
9
+ @pointer = FFI.mitie_load_named_entity_extractor(path)
10
+ elsif pointer
11
+ @pointer = pointer
12
+ else
13
+ raise ArgumentError, "Must pass either a path or a pointer"
14
+ end
15
+
16
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
10
17
  end
11
18
 
12
19
  def tags
@@ -23,6 +30,13 @@ module Mitie
23
30
  doc(text).entities
24
31
  end
25
32
 
33
+ def save_to_disk(filename)
34
+ if FFI.mitie_save_named_entity_extractor(filename, pointer) != 0
35
+ raise Error, "Unable to save model"
36
+ end
37
+ nil
38
+ end
39
+
26
40
  def tokens(text)
27
41
  doc(text).tokens
28
42
  end
@@ -0,0 +1,51 @@
1
+ module Mitie
2
+ class NERTrainer
3
+ def initialize(filename)
4
+ raise ArgumentError, "File does not exist" unless File.exist?(filename)
5
+ @pointer = FFI.mitie_create_ner_trainer(filename)
6
+
7
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
8
+ end
9
+
10
+ def add(instance)
11
+ FFI.mitie_add_ner_training_instance(@pointer, instance.pointer)
12
+ end
13
+
14
+ def beta
15
+ FFI.mitie_ner_trainer_get_beta(@pointer)
16
+ end
17
+
18
+ def beta=(value)
19
+ raise ArgumentError, "beta must be greater than or equal to zero" unless value >= 0
20
+
21
+ FFI.mitie_ner_trainer_set_beta(@pointer, value)
22
+ end
23
+
24
+ def num_threads
25
+ FFI.mitie_ner_trainer_get_num_threads(@pointer)
26
+ end
27
+
28
+ def num_threads=(value)
29
+ FFI.mitie_ner_trainer_set_num_threads(@pointer, value)
30
+ end
31
+
32
+ def size
33
+ FFI.mitie_ner_trainer_size(@pointer)
34
+ end
35
+
36
+ def train
37
+ raise Error, "You can't call train() on an empty trainer" if size.zero?
38
+
39
+ extractor = FFI.mitie_train_named_entity_extractor(@pointer)
40
+
41
+ raise Error, "Unable to create named entity extractor. Probably ran out of RAM." if extractor.null?
42
+
43
+ Mitie::NER.new(pointer: extractor)
44
+ end
45
+
46
+ def self.finalize(pointer)
47
+ # must use proc instead of stabby lambda
48
+ proc { FFI.mitie_free(pointer) }
49
+ end
50
+ end
51
+ end
@@ -0,0 +1,45 @@
1
+ module Mitie
2
+ class NERTrainingInstance
3
+ attr_reader :pointer
4
+
5
+ def initialize(tokens)
6
+ tokens_pointer = Utils.array_to_pointer(tokens)
7
+
8
+ @pointer = FFI.mitie_create_ner_training_instance(tokens_pointer)
9
+ raise Error, "Unable to create training instance. Probably ran out of RAM." if @pointer.null?
10
+
11
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
12
+ end
13
+
14
+ def add_entity(range, label)
15
+ Utils.check_range(range, num_tokens)
16
+
17
+ raise ArgumentError, "Range overlaps existing entity" if overlaps_any_entity?(range)
18
+
19
+ unless FFI.mitie_add_ner_training_entity(@pointer, range.begin, range.size, label).zero?
20
+ raise Error, "Unable to add entity to training instance. Probably ran out of RAM."
21
+ end
22
+
23
+ nil
24
+ end
25
+
26
+ def num_entities
27
+ FFI.mitie_ner_training_instance_num_entities(@pointer)
28
+ end
29
+
30
+ def num_tokens
31
+ FFI.mitie_ner_training_instance_num_tokens(@pointer)
32
+ end
33
+
34
+ def overlaps_any_entity?(range)
35
+ Utils.check_range(range, num_tokens)
36
+
37
+ FFI.mitie_overlaps_any_entity(@pointer, range.begin, range.size) == 1
38
+ end
39
+
40
+ def self.finalize(pointer)
41
+ # must use proc instead of stabby lambda
42
+ proc { FFI.mitie_free(pointer) }
43
+ end
44
+ end
45
+ end
@@ -0,0 +1,47 @@
1
+ module Mitie
2
+ class TextCategorizer
3
+ def initialize(path = nil, pointer: nil)
4
+ if path
5
+ # better error message
6
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
7
+ @pointer = FFI.mitie_load_text_categorizer(path)
8
+ elsif pointer
9
+ @pointer = pointer
10
+ else
11
+ raise ArgumentError, "Must pass either a path or a pointer"
12
+ end
13
+
14
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
15
+ end
16
+
17
+ def categorize(tokens)
18
+ tokens_pointer = Utils.array_to_pointer(tokens)
19
+ text_tag = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
20
+ text_score = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
21
+
22
+ if FFI.mitie_categorize_text(@pointer, tokens_pointer, text_tag, text_score) != 0
23
+ raise Error, "Unable to categorize"
24
+ end
25
+
26
+ {
27
+ tag: text_tag.ptr.to_s,
28
+ score: Utils.read_double(text_score)
29
+ }
30
+ ensure
31
+ # text_tag must be freed
32
+ FFI.mitie_free(text_tag.ptr) if text_tag
33
+ end
34
+
35
+ def save_to_disk(filename)
36
+ if FFI.mitie_save_text_categorizer(filename, @pointer) != 0
37
+ raise Error, "Unable to save model"
38
+ end
39
+ nil
40
+ end
41
+
42
+ def self.finalize(pointer)
43
+ # must use proc instead of stabby lambda
44
+ proc { FFI.mitie_free(pointer) }
45
+ end
46
+ end
47
+ end
@@ -0,0 +1,52 @@
1
+ module Mitie
2
+ class TextCategorizerTrainer
3
+ def initialize(filename)
4
+ raise ArgumentError, "File does not exist" unless File.exist?(filename)
5
+ @pointer = FFI.mitie_create_text_categorizer_trainer(filename)
6
+
7
+ ObjectSpace.define_finalizer(self, self.class.finalize(@pointer))
8
+ end
9
+
10
+ def add(tokens, label)
11
+ tokens_pointer = Utils.array_to_pointer(tokens)
12
+ FFI.mitie_add_text_categorizer_labeled_text(@pointer, tokens_pointer, label)
13
+ end
14
+
15
+ def beta
16
+ FFI.mitie_text_categorizer_trainer_get_beta(@pointer)
17
+ end
18
+
19
+ def beta=(value)
20
+ raise ArgumentError, "beta must be greater than or equal to zero" unless value >= 0
21
+
22
+ FFI.mitie_text_categorizer_trainer_set_beta(@pointer, value)
23
+ end
24
+
25
+ def num_threads
26
+ FFI.mitie_text_categorizer_trainer_get_num_threads(@pointer)
27
+ end
28
+
29
+ def num_threads=(value)
30
+ FFI.mitie_text_categorizer_trainer_set_num_threads(@pointer, value)
31
+ end
32
+
33
+ def size
34
+ FFI.mitie_text_categorizer_trainer_size(@pointer)
35
+ end
36
+
37
+ def train
38
+ raise Error, "You can't call train() on an empty trainer" if size.zero?
39
+
40
+ categorizer = FFI.mitie_train_text_categorizer(@pointer)
41
+
42
+ raise Error, "Unable to create text categorizer. Probably ran out of RAM." if categorizer.null?
43
+
44
+ Mitie::TextCategorizer.new(pointer: categorizer)
45
+ end
46
+
47
+ def self.finalize(pointer)
48
+ # must use proc instead of stabby lambda
49
+ proc { FFI.mitie_free(pointer) }
50
+ end
51
+ end
52
+ end
@@ -0,0 +1,22 @@
1
+ module Mitie
2
+ module Utils
3
+ def self.array_to_pointer(text)
4
+ # malloc uses memset to set all bytes to 0
5
+ tokens_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP * (text.size + 1))
6
+ text.size.times do |i|
7
+ tokens_ptr[i * Fiddle::SIZEOF_VOIDP, Fiddle::SIZEOF_VOIDP] = Fiddle::Pointer.to_ptr(text[i]).ref
8
+ end
9
+ tokens_ptr
10
+ end
11
+
12
+ def self.check_range(range, num_tokens)
13
+ if range.none? || !(0..(num_tokens - 1)).cover?(range)
14
+ raise ArgumentError, "Invalid range"
15
+ end
16
+ end
17
+
18
+ def self.read_double(ptr)
19
+ ptr.to_s(Fiddle::SIZEOF_DOUBLE).unpack1("d")
20
+ end
21
+ end
22
+ end
data/lib/mitie/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Mitie
2
- VERSION = "0.1.4"
2
+ VERSION = "0.2.0"
3
3
  end
data/lib/mitie.rb CHANGED
@@ -3,8 +3,14 @@ require "fiddle/import"
3
3
 
4
4
  # modules
5
5
  require "mitie/binary_relation_detector"
6
+ require "mitie/binary_relation_trainer"
6
7
  require "mitie/document"
7
8
  require "mitie/ner"
9
+ require "mitie/ner_training_instance"
10
+ require "mitie/ner_trainer"
11
+ require "mitie/text_categorizer"
12
+ require "mitie/text_categorizer_trainer"
13
+ require "mitie/utils"
8
14
  require "mitie/version"
9
15
 
10
16
  module Mitie
@@ -16,10 +22,12 @@ module Mitie
16
22
  lib_name =
17
23
  if Gem.win_platform?
18
24
  "mitie.dll"
19
- elsif RbConfig::CONFIG["arch"] =~ /arm64-darwin/i
20
- "libmitie.arm64.dylib"
21
25
  elsif RbConfig::CONFIG["host_os"] =~ /darwin/i
22
- "libmitie.dylib"
26
+ if RbConfig::CONFIG["host_cpu"] =~ /arm|aarch64/i
27
+ "libmitie.arm64.dylib"
28
+ else
29
+ "libmitie.dylib"
30
+ end
23
31
  else
24
32
  "libmitie.so"
25
33
  end
metadata CHANGED
@@ -1,17 +1,17 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mitie
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.4
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-12-29 00:00:00.000000000 Z
11
+ date: 2022-06-01 00:00:00.000000000 Z
12
12
  dependencies: []
13
13
  description:
14
- email: andrew@chartkick.com
14
+ email: andrew@ankane.org
15
15
  executables: []
16
16
  extensions: []
17
17
  extra_rdoc_files: []
@@ -21,16 +21,22 @@ files:
21
21
  - README.md
22
22
  - lib/mitie.rb
23
23
  - lib/mitie/binary_relation_detector.rb
24
+ - lib/mitie/binary_relation_trainer.rb
24
25
  - lib/mitie/document.rb
25
26
  - lib/mitie/ffi.rb
26
27
  - lib/mitie/ner.rb
28
+ - lib/mitie/ner_trainer.rb
29
+ - lib/mitie/ner_training_instance.rb
30
+ - lib/mitie/text_categorizer.rb
31
+ - lib/mitie/text_categorizer_trainer.rb
32
+ - lib/mitie/utils.rb
27
33
  - lib/mitie/version.rb
28
34
  - vendor/LICENSE.txt
29
35
  - vendor/libmitie.arm64.dylib
30
36
  - vendor/libmitie.dylib
31
37
  - vendor/libmitie.so
32
38
  - vendor/mitie.dll
33
- homepage: https://github.com/ankane/mitie
39
+ homepage: https://github.com/ankane/mitie-ruby
34
40
  licenses:
35
41
  - BSL-1.0
36
42
  metadata: {}
@@ -42,14 +48,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
42
48
  requirements:
43
49
  - - ">="
44
50
  - !ruby/object:Gem::Version
45
- version: '2.5'
51
+ version: '2.7'
46
52
  required_rubygems_version: !ruby/object:Gem::Requirement
47
53
  requirements:
48
54
  - - ">="
49
55
  - !ruby/object:Gem::Version
50
56
  version: '0'
51
57
  requirements: []
52
- rubygems_version: 3.2.3
58
+ rubygems_version: 3.3.7
53
59
  signing_key:
54
60
  specification_version: 4
55
61
  summary: Named-entity recognition for Ruby