mitie 0.1.0 → 0.1.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ea3ef115016c59ecb496ffbbe13c4ac3a2ffda6acf9392ac423103b9c3cfe634
4
- data.tar.gz: 6eb77dd514ba3c08c30e1216921cd83619f206a658018c1d4522c598e175e8b2
3
+ metadata.gz: e0ceaa2d4609a2a1b3b4056d67d88a1b0f55616ac8fd1a0509f070c352a96ea1
4
+ data.tar.gz: 0b90eba1027ca5a46a405a97411d2b5fe193d20666ee8c8bad347bb2c79c225b
5
5
  SHA512:
6
- metadata.gz: 682fb3ea1c0be1889f2e1e177204309ea7b6d2989834a4d2bae49ddf567e309b9fef9ea872746626590ec06b9aabb455067d88e696ea3a4f59a9b785b43819c9
7
- data.tar.gz: 4260a6dff4eb613278468d9fff2ef19f99f49bed81571781b4d2ac886f39bed9356a4a20a9220fa2b31e8c12965bffac5e86b40318b28aa450ebd60a3c53c3cd
6
+ metadata.gz: 8281e51659e08157d305535f3cd242082d4173368c36fa167716771b008a82233d02e43d1015c39e27c4c704ae0ec53935834eb0fefb71dac3b515962987d7eb
7
+ data.tar.gz: 96d3564684f8197651f93238876f2bbeb69f860a8a373f36dcafb578db698555153b320f2bd0c799cd8d5d939ad05c3746c59e907d4ad62cfa4917fbb7ec2313
@@ -1,3 +1,25 @@
1
+ ## 0.1.5 (2021-01-29)
2
+
3
+ - Fixed issue with multibyte characters
4
+
5
+ ## 0.1.4 (2020-12-28)
6
+
7
+ - Added ARM shared library for Mac
8
+
9
+ ## 0.1.3 (2020-12-04)
10
+
11
+ - Added support for custom tokenization
12
+
13
+ ## 0.1.2 (2020-09-14)
14
+
15
+ - Added binary relation detection
16
+ - Added `Document` class
17
+
18
+ ## 0.1.1 (2020-09-14)
19
+
20
+ - Added shared libraries
21
+ - Improved error message when model file does not exist
22
+
1
23
  ## 0.1.0 (2020-09-14)
2
24
 
3
25
  - First release
data/README.md CHANGED
@@ -1,14 +1,13 @@
1
1
  # MITIE
2
2
 
3
- [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition - for Ruby
3
+ [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition and binary relation detection - for Ruby
4
4
 
5
- ## Installation
5
+ - Finds people, organizations, and locations in text
6
+ - Detects relationships between entities, like `PERSON` was born in `LOCATION`
6
7
 
7
- First, install MITIE. For Homebrew, use:
8
+ [![Build Status](https://github.com/ankane/mitie/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mitie/actions)
8
9
 
9
- ```sh
10
- brew install mitie
11
- ```
10
+ ## Installation
12
11
 
13
12
  Add this line to your application’s Gemfile:
14
13
 
@@ -24,44 +23,44 @@ And download the pre-trained model for your language:
24
23
 
25
24
  ## Getting Started
26
25
 
27
- Get your text
26
+ Load an NER model
28
27
 
29
28
  ```ruby
30
- text = "Nat Friedman is the CEO of GitHub, which is headquartered in San Francisco"
29
+ model = Mitie::NER.new("ner_model.dat")
31
30
  ```
32
31
 
33
- Load an NER model
32
+ Create a document
34
33
 
35
34
  ```ruby
36
- model = Mitie::NER.new("ner_model.dat")
35
+ doc = model.doc("Nat works at GitHub in San Francisco")
37
36
  ```
38
37
 
39
38
  Get entities
40
39
 
41
40
  ```ruby
42
- model.entities(text)
41
+ doc.entities
43
42
  ```
44
43
 
45
44
  This returns
46
45
 
47
46
  ```ruby
48
47
  [
49
- {text: "Nat Friedman", tag: "PERSON", score: 1.099661347535191, offset: 0},
50
- {text: "GitHub", tag: "ORGANIZATION", score: 0.344641651251650, offset: 27},
51
- {text: "San Francisco", tag: "LOCATION", score: 1.428241888939011, offset: 61}
48
+ {text: "Nat", tag: "PERSON", score: 0.3112371212688382, offset: 0},
49
+ {text: "GitHub", tag: "ORGANIZATION", score: 0.5660115198329334, offset: 13},
50
+ {text: "San Francisco", tag: "LOCATION", score: 1.3890524313885309, offset: 23}
52
51
  ]
53
52
  ```
54
53
 
55
54
  Get tokens
56
55
 
57
56
  ```ruby
58
- model.tokens(text)
57
+ doc.tokens
59
58
  ```
60
59
 
61
60
  Get tokens and their offset
62
61
 
63
62
  ```ruby
64
- model.tokens_with_offset(text)
63
+ doc.tokens_with_offset
65
64
  ```
66
65
 
67
66
  Get all tags for a model
@@ -70,6 +69,40 @@ Get all tags for a model
70
69
  model.tags
71
70
  ```
72
71
 
72
+ ## Binary Relation Detection
73
+
74
+ Detect relationships betweens two entities, like:
75
+
76
+ - `PERSON` was born in `LOCATION`
77
+ - `ORGANIZATION` was founded in `LOCATION`
78
+ - `FILM` was directed by `PERSON`
79
+
80
+ There are 21 detectors for English. You can find them in the `binary_relations` directory in the model download.
81
+
82
+ Load a detector
83
+
84
+ ```ruby
85
+ detector = Mitie::BinaryRelationDetector.new("rel_classifier_organization.organization.place_founded.svm")
86
+ ```
87
+
88
+ And create a document
89
+
90
+ ```ruby
91
+ doc = model.doc("Shopify was founded in Ottawa")
92
+ ```
93
+
94
+ Get relations
95
+
96
+ ```ruby
97
+ detector.relations(doc)
98
+ ```
99
+
100
+ This returns
101
+
102
+ ```ruby
103
+ [{first: "Shopify", second: "Ottawa", score: 0.17649169745814464}]
104
+ ```
105
+
73
106
  ## History
74
107
 
75
108
  View the [changelog](https://github.com/ankane/mitie/blob/master/CHANGELOG.md)
@@ -89,5 +122,8 @@ To get started with development:
89
122
  git clone https://github.com/ankane/mitie.git
90
123
  cd mitie
91
124
  bundle install
92
- MITIE_NER_PATH=path/to/ner_model.dat bundle exec rake test
125
+ bundle exec rake vendor:all
126
+
127
+ export MITIE_MODELS_PATH=path/to/MITIE-models/english
128
+ bundle exec rake test
93
129
  ```
@@ -2,6 +2,8 @@
2
2
  require "fiddle/import"
3
3
 
4
4
  # modules
5
+ require "mitie/binary_relation_detector"
6
+ require "mitie/document"
5
7
  require "mitie/ner"
6
8
  require "mitie/version"
7
9
 
@@ -11,14 +13,18 @@ module Mitie
11
13
  class << self
12
14
  attr_accessor :ffi_lib
13
15
  end
14
- self.ffi_lib =
16
+ lib_name =
15
17
  if Gem.win_platform?
16
- ["mitie.dll"]
18
+ "mitie.dll"
19
+ elsif RbConfig::CONFIG["arch"] =~ /arm64-darwin/i
20
+ "libmitie.arm64.dylib"
17
21
  elsif RbConfig::CONFIG["host_os"] =~ /darwin/i
18
- ["libmitie.dylib"]
22
+ "libmitie.dylib"
19
23
  else
20
- ["libmitie.so"]
24
+ "libmitie.so"
21
25
  end
26
+ vendor_lib = File.expand_path("../vendor/#{lib_name}", __dir__)
27
+ self.ffi_lib = [vendor_lib]
22
28
 
23
29
  # friendlier error message
24
30
  autoload :FFI, "mitie/ffi"
@@ -0,0 +1,62 @@
1
+ module Mitie
2
+ class BinaryRelationDetector
3
+ def initialize(path)
4
+ # better error message
5
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
6
+ @pointer = FFI.mitie_load_binary_relation_detector(path)
7
+ ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
8
+ end
9
+
10
+ def name
11
+ FFI.mitie_binary_relation_detector_name_string(pointer).to_s
12
+ end
13
+
14
+ def relations(doc)
15
+ raise ArgumentError, "Expected Mitie::Document, not #{doc.class.name}" unless doc.is_a?(Document)
16
+
17
+ entities = doc.entities
18
+ combinations = []
19
+ (entities.size - 1).times do |i|
20
+ combinations << [entities[i], entities[i + 1]]
21
+ combinations << [entities[i + 1], entities[i]]
22
+ end
23
+
24
+ relations = []
25
+ combinations.each do |entity1, entity2|
26
+ relation =
27
+ FFI.mitie_extract_binary_relation(
28
+ doc.model.pointer,
29
+ doc.send(:tokens_ptr),
30
+ entity1[:token_index],
31
+ entity1[:token_length],
32
+ entity2[:token_index],
33
+ entity2[:token_length]
34
+ )
35
+
36
+ score_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
37
+ status = FFI.mitie_classify_binary_relation(pointer, relation, score_ptr)
38
+ raise "Bad status: #{status}" if status != 0
39
+ score = score_ptr.to_s(Fiddle::SIZEOF_DOUBLE).unpack1("d")
40
+ if score > 0
41
+ relations << {
42
+ first: entity1[:text],
43
+ second: entity2[:text],
44
+ score: score
45
+ }
46
+ end
47
+ end
48
+ relations
49
+ end
50
+
51
+ private
52
+
53
+ def pointer
54
+ @pointer
55
+ end
56
+
57
+ def self.finalize(pointer)
58
+ # must use proc instead of stabby lambda
59
+ proc { FFI.mitie_free(pointer) }
60
+ end
61
+ end
62
+ end
@@ -0,0 +1,115 @@
1
+ module Mitie
2
+ class Document
3
+ attr_reader :model, :text
4
+
5
+ def initialize(model, text)
6
+ @model = model
7
+ @text = text
8
+ end
9
+
10
+ def tokens
11
+ @tokens ||= tokens_with_offset.map(&:first)
12
+ end
13
+
14
+ def tokens_with_offset
15
+ @tokens_with_offset ||= begin
16
+ if text.is_a?(Array)
17
+ # offsets are unknown when given tokens
18
+ text.map { |v| [v, nil] }
19
+ else
20
+ i = 0
21
+ tokens = []
22
+ loop do
23
+ token = (tokens_ptr + i * Fiddle::SIZEOF_VOIDP).ptr
24
+ break if token.null?
25
+ offset = (offsets_ptr.ptr + i * Fiddle::SIZEOF_LONG).to_s(Fiddle::SIZEOF_LONG).unpack1("L!")
26
+ tokens << [token.to_s.force_encoding(text.encoding), offset]
27
+ i += 1
28
+ end
29
+ tokens
30
+ end
31
+ end
32
+ end
33
+
34
+ def entities
35
+ @entities ||= begin
36
+ begin
37
+ entities = []
38
+ tokens = tokens_with_offset
39
+ detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
40
+ num_detections = FFI.mitie_ner_get_num_detections(detections)
41
+ num_detections.times do |i|
42
+ pos = FFI.mitie_ner_get_detection_position(detections, i)
43
+ len = FFI.mitie_ner_get_detection_length(detections, i)
44
+ tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
45
+ score = FFI.mitie_ner_get_detection_score(detections, i)
46
+ tok = tokens[pos, len]
47
+ offset = tok[0][1]
48
+
49
+ entity = {}
50
+ if offset
51
+ finish = tok[-1][1] + tok[-1][0].bytesize
52
+ entity[:text] = text.byteslice(offset...finish)
53
+ else
54
+ entity[:text] = tok.map(&:first)
55
+ end
56
+ entity[:tag] = tag
57
+ entity[:score] = score
58
+ entity[:offset] = offset if offset
59
+ entity[:token_index] = pos
60
+ entity[:token_length] = len
61
+ entities << entity
62
+ end
63
+ entities
64
+ ensure
65
+ FFI.mitie_free(detections) if detections
66
+ end
67
+ end
68
+ end
69
+
70
+ private
71
+
72
+ def pointer
73
+ model.pointer
74
+ end
75
+
76
+ def tokens_ptr
77
+ tokenize[0]
78
+ end
79
+
80
+ def offsets_ptr
81
+ tokenize[1]
82
+ end
83
+
84
+ def tokenize
85
+ @tokenize ||= begin
86
+ if text.is_a?(Array)
87
+ # malloc uses memset to set all bytes to 0
88
+ tokens_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP * (text.size + 1))
89
+ text.size.times do |i|
90
+ tokens_ptr[i * Fiddle::SIZEOF_VOIDP, Fiddle::SIZEOF_VOIDP] = Fiddle::Pointer.to_ptr(text[i]).ref
91
+ end
92
+ [tokens_ptr, nil]
93
+ else
94
+ offsets_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
95
+ tokens_ptr = FFI.mitie_tokenize_with_offsets(text, offsets_ptr)
96
+
97
+ ObjectSpace.define_finalizer(tokens_ptr, self.class.finalize(tokens_ptr))
98
+ ObjectSpace.define_finalizer(offsets_ptr, self.class.finalize_ptr(offsets_ptr))
99
+
100
+ [tokens_ptr, offsets_ptr]
101
+ end
102
+ end
103
+ end
104
+
105
+ def self.finalize(pointer)
106
+ # must use proc instead of stabby lambda
107
+ proc { FFI.mitie_free(pointer) }
108
+ end
109
+
110
+ def self.finalize_ptr(pointer)
111
+ # must use proc instead of stabby lambda
112
+ proc { FFI.mitie_free(pointer.ptr) }
113
+ end
114
+ end
115
+ end
@@ -25,5 +25,11 @@ module Mitie
25
25
  extern "unsigned long mitie_ner_get_detection_tag(const mitie_named_entity_detections* dets, unsigned long idx)"
26
26
  extern "const char* mitie_ner_get_detection_tagstr(const mitie_named_entity_detections* dets, unsigned long idx)"
27
27
  extern "double mitie_ner_get_detection_score(const mitie_named_entity_detections* dets, unsigned long idx)"
28
+
29
+ extern "mitie_binary_relation_detector* mitie_load_binary_relation_detector(const char* filename)"
30
+ extern "const char* mitie_binary_relation_detector_name_string(const mitie_binary_relation_detector* detector)"
31
+ extern "int mitie_entities_overlap(unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
32
+ extern "mitie_binary_relation* mitie_extract_binary_relation(const mitie_named_entity_extractor* ner, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
33
+ extern "int mitie_classify_binary_relation(const mitie_binary_relation_detector* detector, const mitie_binary_relation* relation, double* score)"
28
34
  end
29
35
  end
@@ -1,6 +1,10 @@
1
1
  module Mitie
2
2
  class NER
3
+ attr_reader :pointer
4
+
3
5
  def initialize(path)
6
+ # better error message
7
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
4
8
  @pointer = FFI.mitie_load_named_entity_extractor(path)
5
9
  ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
6
10
  end
@@ -11,76 +15,20 @@ module Mitie
11
15
  end
12
16
  end
13
17
 
14
- def tokens(text)
15
- tokens = []
16
- ptr = FFI.mitie_tokenize(text)
17
- i = 0
18
- loop do
19
- token = (ptr + i * Fiddle::SIZEOF_VOIDP).ptr
20
- break if token.null?
21
- tokens << token.to_s.force_encoding(text.encoding)
22
- i += 1
23
- end
24
- tokens
25
- ensure
26
- FFI.mitie_free(ptr) if ptr
27
- end
28
-
29
- def tokens_with_offset(text)
30
- tokens, ptr = tokens_with_offset_with_ptr(text)
31
- tokens
32
- ensure
33
- FFI.mitie_free(ptr) if ptr
18
+ def doc(text)
19
+ Document.new(self, text)
34
20
  end
35
21
 
36
22
  def entities(text)
37
- entities = []
38
- tokens, tokens_ptr = tokens_with_offset_with_ptr(text)
39
- detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
40
- num_detections = FFI.mitie_ner_get_num_detections(detections)
41
- num_detections.times do |i|
42
- pos = FFI.mitie_ner_get_detection_position(detections, i)
43
- len = FFI.mitie_ner_get_detection_length(detections, i)
44
- tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
45
- score = FFI.mitie_ner_get_detection_score(detections, i)
46
- tok = tokens[pos, len]
47
- offset = tok[0][1]
48
- finish = tok[-1][1] + tok[-1][0].size
49
- entities << {
50
- text: text[offset...finish],
51
- tag: tag,
52
- score: score,
53
- offset: offset
54
- }
55
- end
56
- entities
57
- ensure
58
- FFI.mitie_free(tokens_ptr) if tokens_ptr
59
- FFI.mitie_free(detections) if detections
23
+ doc(text).entities
60
24
  end
61
25
 
62
- private
63
-
64
- def pointer
65
- @pointer
26
+ def tokens(text)
27
+ doc(text).tokens
66
28
  end
67
29
 
68
- def tokens_with_offset_with_ptr(text)
69
- token_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
70
- ptr = FFI.mitie_tokenize_with_offsets(text, token_offsets)
71
- i = 0
72
- tokens = []
73
- loop do
74
- token = (ptr + i * Fiddle::SIZEOF_VOIDP).ptr
75
- break if token.null?
76
- offset = (token_offsets.ptr + i * Fiddle::SIZEOF_LONG).to_s(Fiddle::SIZEOF_LONG).unpack1("L!")
77
- tokens << [token.to_s.force_encoding(text.encoding), offset]
78
- i += 1
79
- end
80
- [tokens, ptr]
81
- ensure
82
- # use ptr, not token_offsets.ptr
83
- FFI.mitie_free(token_offsets.ptr) if ptr
30
+ def tokens_with_offset(text)
31
+ doc(text).tokens_with_offset
84
32
  end
85
33
 
86
34
  def self.finalize(pointer)
@@ -1,3 +1,3 @@
1
1
  module Mitie
2
- VERSION = "0.1.0"
2
+ VERSION = "0.1.5"
3
3
  end
@@ -0,0 +1,23 @@
1
+ Boost Software License - Version 1.0 - August 17th, 2003
2
+
3
+ Permission is hereby granted, free of charge, to any person or organization
4
+ obtaining a copy of the software and accompanying documentation covered by
5
+ this license (the "Software") to use, reproduce, display, distribute,
6
+ execute, and transmit the Software, and to prepare derivative works of the
7
+ Software, and to permit third-parties to whom the Software is furnished to
8
+ do so, all subject to the following:
9
+
10
+ The copyright notices in the Software and this entire statement, including
11
+ the above license grant, this restriction and the following disclaimer,
12
+ must be included in all copies of the Software, in whole or in part, and
13
+ all derivative works of the Software, unless such copies or derivative
14
+ works are solely in the form of machine-executable object code generated by
15
+ a source language processor.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19
+ FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
20
+ SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
21
+ FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
22
+ ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
23
+ DEALINGS IN THE SOFTWARE.
Binary file
Binary file
Binary file
metadata CHANGED
@@ -1,58 +1,16 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mitie
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-09-14 00:00:00.000000000 Z
12
- dependencies:
13
- - !ruby/object:Gem::Dependency
14
- name: bundler
15
- requirement: !ruby/object:Gem::Requirement
16
- requirements:
17
- - - ">="
18
- - !ruby/object:Gem::Version
19
- version: '0'
20
- type: :development
21
- prerelease: false
22
- version_requirements: !ruby/object:Gem::Requirement
23
- requirements:
24
- - - ">="
25
- - !ruby/object:Gem::Version
26
- version: '0'
27
- - !ruby/object:Gem::Dependency
28
- name: rake
29
- requirement: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - ">="
32
- - !ruby/object:Gem::Version
33
- version: '0'
34
- type: :development
35
- prerelease: false
36
- version_requirements: !ruby/object:Gem::Requirement
37
- requirements:
38
- - - ">="
39
- - !ruby/object:Gem::Version
40
- version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: minitest
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '5'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '5'
55
- description:
11
+ date: 2021-01-30 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description:
56
14
  email: andrew@chartkick.com
57
15
  executables: []
58
16
  extensions: []
@@ -62,14 +20,21 @@ files:
62
20
  - LICENSE.txt
63
21
  - README.md
64
22
  - lib/mitie.rb
23
+ - lib/mitie/binary_relation_detector.rb
24
+ - lib/mitie/document.rb
65
25
  - lib/mitie/ffi.rb
66
26
  - lib/mitie/ner.rb
67
27
  - lib/mitie/version.rb
28
+ - vendor/LICENSE.txt
29
+ - vendor/libmitie.arm64.dylib
30
+ - vendor/libmitie.dylib
31
+ - vendor/libmitie.so
32
+ - vendor/mitie.dll
68
33
  homepage: https://github.com/ankane/mitie
69
34
  licenses:
70
35
  - BSL-1.0
71
36
  metadata: {}
72
- post_install_message:
37
+ post_install_message:
73
38
  rdoc_options: []
74
39
  require_paths:
75
40
  - lib
@@ -84,8 +49,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
84
49
  - !ruby/object:Gem::Version
85
50
  version: '0'
86
51
  requirements: []
87
- rubygems_version: 3.1.2
88
- signing_key:
52
+ rubygems_version: 3.2.3
53
+ signing_key:
89
54
  specification_version: 4
90
55
  summary: Named-entity recognition for Ruby
91
56
  test_files: []