mitie 0.1.0 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ea3ef115016c59ecb496ffbbe13c4ac3a2ffda6acf9392ac423103b9c3cfe634
4
- data.tar.gz: 6eb77dd514ba3c08c30e1216921cd83619f206a658018c1d4522c598e175e8b2
3
+ metadata.gz: e0ceaa2d4609a2a1b3b4056d67d88a1b0f55616ac8fd1a0509f070c352a96ea1
4
+ data.tar.gz: 0b90eba1027ca5a46a405a97411d2b5fe193d20666ee8c8bad347bb2c79c225b
5
5
  SHA512:
6
- metadata.gz: 682fb3ea1c0be1889f2e1e177204309ea7b6d2989834a4d2bae49ddf567e309b9fef9ea872746626590ec06b9aabb455067d88e696ea3a4f59a9b785b43819c9
7
- data.tar.gz: 4260a6dff4eb613278468d9fff2ef19f99f49bed81571781b4d2ac886f39bed9356a4a20a9220fa2b31e8c12965bffac5e86b40318b28aa450ebd60a3c53c3cd
6
+ metadata.gz: 8281e51659e08157d305535f3cd242082d4173368c36fa167716771b008a82233d02e43d1015c39e27c4c704ae0ec53935834eb0fefb71dac3b515962987d7eb
7
+ data.tar.gz: 96d3564684f8197651f93238876f2bbeb69f860a8a373f36dcafb578db698555153b320f2bd0c799cd8d5d939ad05c3746c59e907d4ad62cfa4917fbb7ec2313
@@ -1,3 +1,25 @@
1
+ ## 0.1.5 (2021-01-29)
2
+
3
+ - Fixed issue with multibyte characters
4
+
5
+ ## 0.1.4 (2020-12-28)
6
+
7
+ - Added ARM shared library for Mac
8
+
9
+ ## 0.1.3 (2020-12-04)
10
+
11
+ - Added support for custom tokenization
12
+
13
+ ## 0.1.2 (2020-09-14)
14
+
15
+ - Added binary relation detection
16
+ - Added `Document` class
17
+
18
+ ## 0.1.1 (2020-09-14)
19
+
20
+ - Added shared libraries
21
+ - Improved error message when model file does not exist
22
+
1
23
  ## 0.1.0 (2020-09-14)
2
24
 
3
25
  - First release
data/README.md CHANGED
@@ -1,14 +1,13 @@
1
1
  # MITIE
2
2
 
3
- [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition - for Ruby
3
+ [MITIE](https://github.com/mit-nlp/MITIE) - named-entity recognition and binary relation detection - for Ruby
4
4
 
5
- ## Installation
5
+ - Finds people, organizations, and locations in text
6
+ - Detects relationships between entities, like `PERSON` was born in `LOCATION`
6
7
 
7
- First, install MITIE. For Homebrew, use:
8
+ [![Build Status](https://github.com/ankane/mitie/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mitie/actions)
8
9
 
9
- ```sh
10
- brew install mitie
11
- ```
10
+ ## Installation
12
11
 
13
12
  Add this line to your application’s Gemfile:
14
13
 
@@ -24,44 +23,44 @@ And download the pre-trained model for your language:
24
23
 
25
24
  ## Getting Started
26
25
 
27
- Get your text
26
+ Load an NER model
28
27
 
29
28
  ```ruby
30
- text = "Nat Friedman is the CEO of GitHub, which is headquartered in San Francisco"
29
+ model = Mitie::NER.new("ner_model.dat")
31
30
  ```
32
31
 
33
- Load an NER model
32
+ Create a document
34
33
 
35
34
  ```ruby
36
- model = Mitie::NER.new("ner_model.dat")
35
+ doc = model.doc("Nat works at GitHub in San Francisco")
37
36
  ```
38
37
 
39
38
  Get entities
40
39
 
41
40
  ```ruby
42
- model.entities(text)
41
+ doc.entities
43
42
  ```
44
43
 
45
44
  This returns
46
45
 
47
46
  ```ruby
48
47
  [
49
- {text: "Nat Friedman", tag: "PERSON", score: 1.099661347535191, offset: 0},
50
- {text: "GitHub", tag: "ORGANIZATION", score: 0.344641651251650, offset: 27},
51
- {text: "San Francisco", tag: "LOCATION", score: 1.428241888939011, offset: 61}
48
+ {text: "Nat", tag: "PERSON", score: 0.3112371212688382, offset: 0},
49
+ {text: "GitHub", tag: "ORGANIZATION", score: 0.5660115198329334, offset: 13},
50
+ {text: "San Francisco", tag: "LOCATION", score: 1.3890524313885309, offset: 23}
52
51
  ]
53
52
  ```
54
53
 
55
54
  Get tokens
56
55
 
57
56
  ```ruby
58
- model.tokens(text)
57
+ doc.tokens
59
58
  ```
60
59
 
61
60
  Get tokens and their offset
62
61
 
63
62
  ```ruby
64
- model.tokens_with_offset(text)
63
+ doc.tokens_with_offset
65
64
  ```
66
65
 
67
66
  Get all tags for a model
@@ -70,6 +69,40 @@ Get all tags for a model
70
69
  model.tags
71
70
  ```
72
71
 
72
+ ## Binary Relation Detection
73
+
74
+ Detect relationships betweens two entities, like:
75
+
76
+ - `PERSON` was born in `LOCATION`
77
+ - `ORGANIZATION` was founded in `LOCATION`
78
+ - `FILM` was directed by `PERSON`
79
+
80
+ There are 21 detectors for English. You can find them in the `binary_relations` directory in the model download.
81
+
82
+ Load a detector
83
+
84
+ ```ruby
85
+ detector = Mitie::BinaryRelationDetector.new("rel_classifier_organization.organization.place_founded.svm")
86
+ ```
87
+
88
+ And create a document
89
+
90
+ ```ruby
91
+ doc = model.doc("Shopify was founded in Ottawa")
92
+ ```
93
+
94
+ Get relations
95
+
96
+ ```ruby
97
+ detector.relations(doc)
98
+ ```
99
+
100
+ This returns
101
+
102
+ ```ruby
103
+ [{first: "Shopify", second: "Ottawa", score: 0.17649169745814464}]
104
+ ```
105
+
73
106
  ## History
74
107
 
75
108
  View the [changelog](https://github.com/ankane/mitie/blob/master/CHANGELOG.md)
@@ -89,5 +122,8 @@ To get started with development:
89
122
  git clone https://github.com/ankane/mitie.git
90
123
  cd mitie
91
124
  bundle install
92
- MITIE_NER_PATH=path/to/ner_model.dat bundle exec rake test
125
+ bundle exec rake vendor:all
126
+
127
+ export MITIE_MODELS_PATH=path/to/MITIE-models/english
128
+ bundle exec rake test
93
129
  ```
@@ -2,6 +2,8 @@
2
2
  require "fiddle/import"
3
3
 
4
4
  # modules
5
+ require "mitie/binary_relation_detector"
6
+ require "mitie/document"
5
7
  require "mitie/ner"
6
8
  require "mitie/version"
7
9
 
@@ -11,14 +13,18 @@ module Mitie
11
13
  class << self
12
14
  attr_accessor :ffi_lib
13
15
  end
14
- self.ffi_lib =
16
+ lib_name =
15
17
  if Gem.win_platform?
16
- ["mitie.dll"]
18
+ "mitie.dll"
19
+ elsif RbConfig::CONFIG["arch"] =~ /arm64-darwin/i
20
+ "libmitie.arm64.dylib"
17
21
  elsif RbConfig::CONFIG["host_os"] =~ /darwin/i
18
- ["libmitie.dylib"]
22
+ "libmitie.dylib"
19
23
  else
20
- ["libmitie.so"]
24
+ "libmitie.so"
21
25
  end
26
+ vendor_lib = File.expand_path("../vendor/#{lib_name}", __dir__)
27
+ self.ffi_lib = [vendor_lib]
22
28
 
23
29
  # friendlier error message
24
30
  autoload :FFI, "mitie/ffi"
@@ -0,0 +1,62 @@
1
+ module Mitie
2
+ class BinaryRelationDetector
3
+ def initialize(path)
4
+ # better error message
5
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
6
+ @pointer = FFI.mitie_load_binary_relation_detector(path)
7
+ ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
8
+ end
9
+
10
+ def name
11
+ FFI.mitie_binary_relation_detector_name_string(pointer).to_s
12
+ end
13
+
14
+ def relations(doc)
15
+ raise ArgumentError, "Expected Mitie::Document, not #{doc.class.name}" unless doc.is_a?(Document)
16
+
17
+ entities = doc.entities
18
+ combinations = []
19
+ (entities.size - 1).times do |i|
20
+ combinations << [entities[i], entities[i + 1]]
21
+ combinations << [entities[i + 1], entities[i]]
22
+ end
23
+
24
+ relations = []
25
+ combinations.each do |entity1, entity2|
26
+ relation =
27
+ FFI.mitie_extract_binary_relation(
28
+ doc.model.pointer,
29
+ doc.send(:tokens_ptr),
30
+ entity1[:token_index],
31
+ entity1[:token_length],
32
+ entity2[:token_index],
33
+ entity2[:token_length]
34
+ )
35
+
36
+ score_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_DOUBLE)
37
+ status = FFI.mitie_classify_binary_relation(pointer, relation, score_ptr)
38
+ raise "Bad status: #{status}" if status != 0
39
+ score = score_ptr.to_s(Fiddle::SIZEOF_DOUBLE).unpack1("d")
40
+ if score > 0
41
+ relations << {
42
+ first: entity1[:text],
43
+ second: entity2[:text],
44
+ score: score
45
+ }
46
+ end
47
+ end
48
+ relations
49
+ end
50
+
51
+ private
52
+
53
+ def pointer
54
+ @pointer
55
+ end
56
+
57
+ def self.finalize(pointer)
58
+ # must use proc instead of stabby lambda
59
+ proc { FFI.mitie_free(pointer) }
60
+ end
61
+ end
62
+ end
@@ -0,0 +1,115 @@
1
+ module Mitie
2
+ class Document
3
+ attr_reader :model, :text
4
+
5
+ def initialize(model, text)
6
+ @model = model
7
+ @text = text
8
+ end
9
+
10
+ def tokens
11
+ @tokens ||= tokens_with_offset.map(&:first)
12
+ end
13
+
14
+ def tokens_with_offset
15
+ @tokens_with_offset ||= begin
16
+ if text.is_a?(Array)
17
+ # offsets are unknown when given tokens
18
+ text.map { |v| [v, nil] }
19
+ else
20
+ i = 0
21
+ tokens = []
22
+ loop do
23
+ token = (tokens_ptr + i * Fiddle::SIZEOF_VOIDP).ptr
24
+ break if token.null?
25
+ offset = (offsets_ptr.ptr + i * Fiddle::SIZEOF_LONG).to_s(Fiddle::SIZEOF_LONG).unpack1("L!")
26
+ tokens << [token.to_s.force_encoding(text.encoding), offset]
27
+ i += 1
28
+ end
29
+ tokens
30
+ end
31
+ end
32
+ end
33
+
34
+ def entities
35
+ @entities ||= begin
36
+ begin
37
+ entities = []
38
+ tokens = tokens_with_offset
39
+ detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
40
+ num_detections = FFI.mitie_ner_get_num_detections(detections)
41
+ num_detections.times do |i|
42
+ pos = FFI.mitie_ner_get_detection_position(detections, i)
43
+ len = FFI.mitie_ner_get_detection_length(detections, i)
44
+ tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
45
+ score = FFI.mitie_ner_get_detection_score(detections, i)
46
+ tok = tokens[pos, len]
47
+ offset = tok[0][1]
48
+
49
+ entity = {}
50
+ if offset
51
+ finish = tok[-1][1] + tok[-1][0].bytesize
52
+ entity[:text] = text.byteslice(offset...finish)
53
+ else
54
+ entity[:text] = tok.map(&:first)
55
+ end
56
+ entity[:tag] = tag
57
+ entity[:score] = score
58
+ entity[:offset] = offset if offset
59
+ entity[:token_index] = pos
60
+ entity[:token_length] = len
61
+ entities << entity
62
+ end
63
+ entities
64
+ ensure
65
+ FFI.mitie_free(detections) if detections
66
+ end
67
+ end
68
+ end
69
+
70
+ private
71
+
72
+ def pointer
73
+ model.pointer
74
+ end
75
+
76
+ def tokens_ptr
77
+ tokenize[0]
78
+ end
79
+
80
+ def offsets_ptr
81
+ tokenize[1]
82
+ end
83
+
84
+ def tokenize
85
+ @tokenize ||= begin
86
+ if text.is_a?(Array)
87
+ # malloc uses memset to set all bytes to 0
88
+ tokens_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP * (text.size + 1))
89
+ text.size.times do |i|
90
+ tokens_ptr[i * Fiddle::SIZEOF_VOIDP, Fiddle::SIZEOF_VOIDP] = Fiddle::Pointer.to_ptr(text[i]).ref
91
+ end
92
+ [tokens_ptr, nil]
93
+ else
94
+ offsets_ptr = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
95
+ tokens_ptr = FFI.mitie_tokenize_with_offsets(text, offsets_ptr)
96
+
97
+ ObjectSpace.define_finalizer(tokens_ptr, self.class.finalize(tokens_ptr))
98
+ ObjectSpace.define_finalizer(offsets_ptr, self.class.finalize_ptr(offsets_ptr))
99
+
100
+ [tokens_ptr, offsets_ptr]
101
+ end
102
+ end
103
+ end
104
+
105
+ def self.finalize(pointer)
106
+ # must use proc instead of stabby lambda
107
+ proc { FFI.mitie_free(pointer) }
108
+ end
109
+
110
+ def self.finalize_ptr(pointer)
111
+ # must use proc instead of stabby lambda
112
+ proc { FFI.mitie_free(pointer.ptr) }
113
+ end
114
+ end
115
+ end
@@ -25,5 +25,11 @@ module Mitie
25
25
  extern "unsigned long mitie_ner_get_detection_tag(const mitie_named_entity_detections* dets, unsigned long idx)"
26
26
  extern "const char* mitie_ner_get_detection_tagstr(const mitie_named_entity_detections* dets, unsigned long idx)"
27
27
  extern "double mitie_ner_get_detection_score(const mitie_named_entity_detections* dets, unsigned long idx)"
28
+
29
+ extern "mitie_binary_relation_detector* mitie_load_binary_relation_detector(const char* filename)"
30
+ extern "const char* mitie_binary_relation_detector_name_string(const mitie_binary_relation_detector* detector)"
31
+ extern "int mitie_entities_overlap(unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
32
+ extern "mitie_binary_relation* mitie_extract_binary_relation(const mitie_named_entity_extractor* ner, char** tokens, unsigned long arg1_start, unsigned long arg1_length, unsigned long arg2_start, unsigned long arg2_length)"
33
+ extern "int mitie_classify_binary_relation(const mitie_binary_relation_detector* detector, const mitie_binary_relation* relation, double* score)"
28
34
  end
29
35
  end
@@ -1,6 +1,10 @@
1
1
  module Mitie
2
2
  class NER
3
+ attr_reader :pointer
4
+
3
5
  def initialize(path)
6
+ # better error message
7
+ raise ArgumentError, "File does not exist" unless File.exist?(path)
4
8
  @pointer = FFI.mitie_load_named_entity_extractor(path)
5
9
  ObjectSpace.define_finalizer(self, self.class.finalize(pointer))
6
10
  end
@@ -11,76 +15,20 @@ module Mitie
11
15
  end
12
16
  end
13
17
 
14
- def tokens(text)
15
- tokens = []
16
- ptr = FFI.mitie_tokenize(text)
17
- i = 0
18
- loop do
19
- token = (ptr + i * Fiddle::SIZEOF_VOIDP).ptr
20
- break if token.null?
21
- tokens << token.to_s.force_encoding(text.encoding)
22
- i += 1
23
- end
24
- tokens
25
- ensure
26
- FFI.mitie_free(ptr) if ptr
27
- end
28
-
29
- def tokens_with_offset(text)
30
- tokens, ptr = tokens_with_offset_with_ptr(text)
31
- tokens
32
- ensure
33
- FFI.mitie_free(ptr) if ptr
18
+ def doc(text)
19
+ Document.new(self, text)
34
20
  end
35
21
 
36
22
  def entities(text)
37
- entities = []
38
- tokens, tokens_ptr = tokens_with_offset_with_ptr(text)
39
- detections = FFI.mitie_extract_entities(pointer, tokens_ptr)
40
- num_detections = FFI.mitie_ner_get_num_detections(detections)
41
- num_detections.times do |i|
42
- pos = FFI.mitie_ner_get_detection_position(detections, i)
43
- len = FFI.mitie_ner_get_detection_length(detections, i)
44
- tag = FFI.mitie_ner_get_detection_tagstr(detections, i).to_s
45
- score = FFI.mitie_ner_get_detection_score(detections, i)
46
- tok = tokens[pos, len]
47
- offset = tok[0][1]
48
- finish = tok[-1][1] + tok[-1][0].size
49
- entities << {
50
- text: text[offset...finish],
51
- tag: tag,
52
- score: score,
53
- offset: offset
54
- }
55
- end
56
- entities
57
- ensure
58
- FFI.mitie_free(tokens_ptr) if tokens_ptr
59
- FFI.mitie_free(detections) if detections
23
+ doc(text).entities
60
24
  end
61
25
 
62
- private
63
-
64
- def pointer
65
- @pointer
26
+ def tokens(text)
27
+ doc(text).tokens
66
28
  end
67
29
 
68
- def tokens_with_offset_with_ptr(text)
69
- token_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_VOIDP)
70
- ptr = FFI.mitie_tokenize_with_offsets(text, token_offsets)
71
- i = 0
72
- tokens = []
73
- loop do
74
- token = (ptr + i * Fiddle::SIZEOF_VOIDP).ptr
75
- break if token.null?
76
- offset = (token_offsets.ptr + i * Fiddle::SIZEOF_LONG).to_s(Fiddle::SIZEOF_LONG).unpack1("L!")
77
- tokens << [token.to_s.force_encoding(text.encoding), offset]
78
- i += 1
79
- end
80
- [tokens, ptr]
81
- ensure
82
- # use ptr, not token_offsets.ptr
83
- FFI.mitie_free(token_offsets.ptr) if ptr
30
+ def tokens_with_offset(text)
31
+ doc(text).tokens_with_offset
84
32
  end
85
33
 
86
34
  def self.finalize(pointer)
@@ -1,3 +1,3 @@
1
1
  module Mitie
2
- VERSION = "0.1.0"
2
+ VERSION = "0.1.5"
3
3
  end
@@ -0,0 +1,23 @@
1
+ Boost Software License - Version 1.0 - August 17th, 2003
2
+
3
+ Permission is hereby granted, free of charge, to any person or organization
4
+ obtaining a copy of the software and accompanying documentation covered by
5
+ this license (the "Software") to use, reproduce, display, distribute,
6
+ execute, and transmit the Software, and to prepare derivative works of the
7
+ Software, and to permit third-parties to whom the Software is furnished to
8
+ do so, all subject to the following:
9
+
10
+ The copyright notices in the Software and this entire statement, including
11
+ the above license grant, this restriction and the following disclaimer,
12
+ must be included in all copies of the Software, in whole or in part, and
13
+ all derivative works of the Software, unless such copies or derivative
14
+ works are solely in the form of machine-executable object code generated by
15
+ a source language processor.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
18
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
19
+ FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
20
+ SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
21
+ FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
22
+ ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
23
+ DEALINGS IN THE SOFTWARE.
Binary file
Binary file
Binary file
metadata CHANGED
@@ -1,58 +1,16 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: mitie
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-09-14 00:00:00.000000000 Z
12
- dependencies:
13
- - !ruby/object:Gem::Dependency
14
- name: bundler
15
- requirement: !ruby/object:Gem::Requirement
16
- requirements:
17
- - - ">="
18
- - !ruby/object:Gem::Version
19
- version: '0'
20
- type: :development
21
- prerelease: false
22
- version_requirements: !ruby/object:Gem::Requirement
23
- requirements:
24
- - - ">="
25
- - !ruby/object:Gem::Version
26
- version: '0'
27
- - !ruby/object:Gem::Dependency
28
- name: rake
29
- requirement: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - ">="
32
- - !ruby/object:Gem::Version
33
- version: '0'
34
- type: :development
35
- prerelease: false
36
- version_requirements: !ruby/object:Gem::Requirement
37
- requirements:
38
- - - ">="
39
- - !ruby/object:Gem::Version
40
- version: '0'
41
- - !ruby/object:Gem::Dependency
42
- name: minitest
43
- requirement: !ruby/object:Gem::Requirement
44
- requirements:
45
- - - ">="
46
- - !ruby/object:Gem::Version
47
- version: '5'
48
- type: :development
49
- prerelease: false
50
- version_requirements: !ruby/object:Gem::Requirement
51
- requirements:
52
- - - ">="
53
- - !ruby/object:Gem::Version
54
- version: '5'
55
- description:
11
+ date: 2021-01-30 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description:
56
14
  email: andrew@chartkick.com
57
15
  executables: []
58
16
  extensions: []
@@ -62,14 +20,21 @@ files:
62
20
  - LICENSE.txt
63
21
  - README.md
64
22
  - lib/mitie.rb
23
+ - lib/mitie/binary_relation_detector.rb
24
+ - lib/mitie/document.rb
65
25
  - lib/mitie/ffi.rb
66
26
  - lib/mitie/ner.rb
67
27
  - lib/mitie/version.rb
28
+ - vendor/LICENSE.txt
29
+ - vendor/libmitie.arm64.dylib
30
+ - vendor/libmitie.dylib
31
+ - vendor/libmitie.so
32
+ - vendor/mitie.dll
68
33
  homepage: https://github.com/ankane/mitie
69
34
  licenses:
70
35
  - BSL-1.0
71
36
  metadata: {}
72
- post_install_message:
37
+ post_install_message:
73
38
  rdoc_options: []
74
39
  require_paths:
75
40
  - lib
@@ -84,8 +49,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
84
49
  - !ruby/object:Gem::Version
85
50
  version: '0'
86
51
  requirements: []
87
- rubygems_version: 3.1.2
88
- signing_key:
52
+ rubygems_version: 3.2.3
53
+ signing_key:
89
54
  specification_version: 4
90
55
  summary: Named-entity recognition for Ruby
91
56
  test_files: []