blingfire 0.1.0 → 0.1.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a93f90584141a618ff5bb73860068e2ea662c568a07e2f877a5f2be7e265919e
4
- data.tar.gz: 61b49a4bd7eb304f44c041730bce733102f1936c9340cd9c0c3f1390bd9e853a
3
+ metadata.gz: 2585ca0684a0d6af6beaae99ab7b5250eb928c5200cbfb6274dcd2cbc914dccc
4
+ data.tar.gz: fe4681abb0c63e7d8fd0e8a777abaed95e578bbb0cfd7238bacbe9beecd3d07c
5
5
  SHA512:
6
- metadata.gz: fd0242fc13696c620ffd3a7383b9088aaddade1f7280c4f35c54d5165f66e26b783e88fa2d65ced002e8f233af0637e43e9adb68ad7556a16cfdcbb6aff9ba7e
7
- data.tar.gz: a1206eaa056ed93639c00c03eb16be5fbbec6b077e1677a916b69c6244c26d3c2cce51299dc0394335c54d12de500a9648f30fa6f3785ef749be528904d5ae60
6
+ metadata.gz: 9a02c3f87eea7ea989f73032388f4abe69b30055f1ea7a96aa302385a2f6038eb833cffe00b15e5b69124d7a11bf8dd031067a4a16b415999955b8ce393f25d4
7
+ data.tar.gz: e696df2343cab6ee82af95bedcfc45457b69287113f819514851ef5f943a2f0f8c165b21abff0f944dcd96d4cab45b8faa1190940bf36c294c9c98d55bd9d980
data/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ ## 0.1.5 (2021-03-14)
2
+
3
+ - Updated Bling Fire to 0.1.5
4
+ - Added ARM shared library for Linux
5
+
6
+ ## 0.1.4 (2020-12-28)
7
+
8
+ - Added ARM shared library for Mac
9
+
10
+ ## 0.1.3 (2020-10-01)
11
+
12
+ - Added `text_to_words_with_offsets` method
13
+ - Added `text_to_sentences_with_offsets` method
14
+ - Added `text_to_ids_with_offsets` method
15
+ - Added `normalize_spaces` method
16
+
17
+ ## 0.1.2 (2020-06-25)
18
+
19
+ - Updated Bling Fire to 0.1.3
20
+
21
+ ## 0.1.1 (2020-05-01)
22
+
23
+ - Updated Bling Fire to 0.1.1
24
+ - Improved error message when model not found
25
+
1
26
  ## 0.1.0 (2020-02-24)
2
27
 
3
28
  - First release
data/LICENSE.txt CHANGED
@@ -1,22 +1,22 @@
1
- Copyright (c) 2020 Andrew Kane
2
-
3
1
  MIT License
4
2
 
5
- Permission is hereby granted, free of charge, to any person obtaining
6
- a copy of this software and associated documentation files (the
7
- "Software"), to deal in the Software without restriction, including
8
- without limitation the rights to use, copy, modify, merge, publish,
9
- distribute, sublicense, and/or sell copies of the Software, and to
10
- permit persons to whom the Software is furnished to do so, subject to
11
- the following conditions:
3
+ Copyright (c) Microsoft Corporation. All rights reserved.
4
+ Copyright (c) 2020 Andrew Kane
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining a copy
7
+ of this software and associated documentation files (the "Software"), to deal
8
+ in the Software without restriction, including without limitation the rights
9
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
+ copies of the Software, and to permit persons to whom the Software is
11
+ furnished to do so, subject to the following conditions:
12
12
 
13
- The above copyright notice and this permission notice shall be
14
- included in all copies or substantial portions of the Software.
13
+ The above copyright notice and this permission notice shall be included in all
14
+ copies or substantial portions of the Software.
15
15
 
16
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
+ SOFTWARE
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
- # BlingFire
1
+ # Bling Fire
2
2
 
3
- [BlingFire](https://github.com/microsoft/BlingFire) - high speed text tokenization - for Ruby
3
+ [Bling Fire](https://github.com/microsoft/BlingFire) - high speed text tokenization - for Ruby
4
+
5
+ [![Build Status](https://github.com/ankane/blingfire/workflows/build/badge.svg?branch=master)](https://github.com/ankane/blingfire/actions)
4
6
 
5
7
  ## Installation
6
8
 
@@ -30,14 +32,30 @@ Tokenize sentences
30
32
  model.text_to_sentences(text)
31
33
  ```
32
34
 
35
+ Get offsets for words
36
+
37
+ ```ruby
38
+ words, start_offsets, end_offsets = model.text_to_words_with_offsets(text)
39
+ ```
40
+
41
+ Get offsets for sentences
42
+
43
+ ```ruby
44
+ sentences, start_offsets, end_offsets = model.text_to_sentences_with_offsets(text)
45
+ ```
46
+
33
47
  ## Pre-trained Models
34
48
 
35
- BlingFire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:
49
+ Bling Fire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:
36
50
 
37
- - [BERT Base](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_tok.bin)
38
- - [BERT Base Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_cased_tok.bin)
39
- - [BERT Chinese](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_chinese.bin)
40
- - [BERT Multilingual Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_multi_cased.bin)
51
+ - [BERT Base](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_tok.bin), [BERT Base Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_cased_tok.bin), [BERT Chinese](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_chinese.bin), [BERT Multilingual Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_multi_cased.bin)
52
+ - [GPT-2](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/gpt2.bin)
53
+ - [Laser 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser100k.bin), [Laser 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser250k.bin), [Laser 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser500k.bin)
54
+ - [RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/roberta.bin)
55
+ - [Syllab](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/syllab.bin)
56
+ - [URI 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri100k.bin), [URI 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri250k.bin), [URI 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri500k.bin)
57
+ - [XLM-RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlm_roberta_base.bin)
58
+ - [XLNet](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet.bin), [XLNet No Norm](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet_nonorm.bin)
41
59
  - [WBD](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/wbd_chuni.bin)
42
60
 
43
61
  Load a model
@@ -52,6 +70,12 @@ Convert text to ids
52
70
  model.text_to_ids(text)
53
71
  ```
54
72
 
73
+ Get offsets for ids
74
+
75
+ ```ruby
76
+ ids, start_offsets, end_offsets = model.text_to_ids_with_offsets(text)
77
+ ```
78
+
55
79
  ## History
56
80
 
57
81
  View the [changelog](https://github.com/ankane/blingfire/blob/master/CHANGELOG.md)
@@ -71,6 +95,6 @@ To get started with development:
71
95
  git clone https://github.com/ankane/blingfire.git
72
96
  cd blingfire
73
97
  bundle install
74
- bundle exec rake vendor:all
98
+ bundle exec rake vendor:all download:models
75
99
  bundle exec rake test
76
100
  ```
data/lib/blingfire.rb CHANGED
@@ -15,9 +15,17 @@ module BlingFire
15
15
  if Gem.win_platform?
16
16
  "blingfiretokdll.dll"
17
17
  elsif RbConfig::CONFIG["host_os"] =~ /darwin/i
18
- "libblingfiretokdll.dylib"
18
+ if RbConfig::CONFIG["host_cpu"] =~ /arm/i
19
+ "libblingfiretokdll.arm64.dylib"
20
+ else
21
+ "libblingfiretokdll.dylib"
22
+ end
19
23
  else
20
- "libblingfiretokdll.so"
24
+ if RbConfig::CONFIG["host_cpu"] =~ /aarch64/i
25
+ "libblingfiretokdll.arm64.so"
26
+ else
27
+ "libblingfiretokdll.so"
28
+ end
21
29
  end
22
30
  vendor_lib = File.expand_path("../vendor/#{lib_name}", __dir__)
23
31
  self.ffi_lib = [vendor_lib]
@@ -35,57 +43,144 @@ module BlingFire
35
43
  end
36
44
 
37
45
  def text_to_words(text)
38
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
39
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
40
- out_size = FFI.TextToWords(text, text.bytesize, out, out.size)
41
- check_status out_size
42
- encode_utf8(out[0, out_size - 1]).split(" ")
46
+ text_to(text, " ") do |t, out|
47
+ FFI.TextToWords(t, t.bytesize, out, out.size)
48
+ end
43
49
  end
44
50
 
45
51
  def text_to_words_with_model(model, text)
46
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
47
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
48
- out_size = FFI.TextToWordsWithModel(text, text.bytesize, out, out.size, model)
49
- check_status out_size
50
- encode_utf8(out[0, out_size - 1]).split(" ")
52
+ text_to(text, " ") do |t, out|
53
+ FFI.TextToWordsWithModel(t, t.bytesize, out, out.size, model)
54
+ end
55
+ end
56
+
57
+ def text_to_words_with_offsets(text)
58
+ text_to_with_offsets(text, " ") do |t, out, start_offsets, end_offsets|
59
+ FFI.TextToWordsWithOffsets(t, t.bytesize, out, start_offsets, end_offsets, out.size)
60
+ end
61
+ end
62
+
63
+ def text_to_words_with_offsets_with_model(model, text)
64
+ text_to_with_offsets(text, " ") do |t, out, start_offsets, end_offsets|
65
+ FFI.TextToWordsWithOffsetsWithModel(t, t.bytesize, out, start_offsets, end_offsets, out.size, model)
66
+ end
51
67
  end
52
68
 
53
69
  def text_to_sentences(text)
54
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
55
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
56
- out_size = FFI.TextToSentences(text, text.bytesize, out, out.size)
57
- check_status out_size
58
- encode_utf8(out[0, out_size - 1]).split("\n")
70
+ text_to(text, "\n") do |t, out|
71
+ FFI.TextToSentences(t, t.bytesize, out, out.size)
72
+ end
59
73
  end
60
74
 
61
75
  def text_to_sentences_with_model(model, text)
62
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
63
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
64
- out_size = FFI.TextToSentencesWithModel(text, text.bytesize, out, out.size, model)
65
- check_status out_size
66
- encode_utf8(out[0, out_size - 1]).split("\n")
76
+ text_to(text, "\n") do |t, out|
77
+ FFI.TextToSentencesWithModel(t, t.bytesize, out, out.size, model)
78
+ end
79
+ end
80
+
81
+ def text_to_sentences_with_offsets(text)
82
+ text_to_with_offsets(text, "\n") do |t, out, start_offsets, end_offsets|
83
+ FFI.TextToSentencesWithOffsets(t, t.bytesize, out, start_offsets, end_offsets, out.size)
84
+ end
85
+ end
86
+
87
+ def text_to_sentences_with_offsets_with_model(model, text)
88
+ text_to_with_offsets(text, "\n") do |t, out, start_offsets, end_offsets|
89
+ FFI.TextToSentencesWithOffsetsWithModel(t, t.bytesize, out, start_offsets, end_offsets, out.size, model)
90
+ end
67
91
  end
68
92
 
69
93
  def text_to_ids(model, text, max_len = nil, unk_id = 0)
70
94
  text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
71
95
  ids = Fiddle::Pointer.malloc((max_len || text.size) * Fiddle::SIZEOF_INT)
72
96
  out_size = FFI.TextToIds(model, text, text.bytesize, ids, ids.size, unk_id)
73
- check_status out_size
97
+ check_status out_size, ids
74
98
  ids[0, (max_len || out_size) * Fiddle::SIZEOF_INT].unpack("i!*")
75
99
  end
76
100
 
101
+ def text_to_ids_with_offsets(model, text, max_len = nil, unk_id = 0)
102
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
103
+ ids = Fiddle::Pointer.malloc((max_len || text.size) * Fiddle::SIZEOF_INT)
104
+
105
+ start_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * ids.size)
106
+ end_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * ids.size)
107
+
108
+ out_size = FFI.TextToIdsWithOffsets(model, text, text.bytesize, ids, start_offsets, end_offsets, ids.size, unk_id)
109
+
110
+ check_status out_size, ids
111
+
112
+ result = ids[0, (max_len || out_size) * Fiddle::SIZEOF_INT].unpack("i!*")
113
+ [result].concat(unpack_offsets(start_offsets, end_offsets, result, text))
114
+ end
115
+
77
116
  def free_model(model)
78
117
  FFI.FreeModel(model)
79
118
  end
80
119
 
120
+ def normalize_spaces(text)
121
+ u_space = 0x20
122
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
123
+ out = Fiddle::Pointer.malloc([text.bytesize * 1.5, 20].max)
124
+ out_size = FFI.NormalizeSpaces(text, text.bytesize, out, out.size, u_space)
125
+ check_status out_size, out
126
+ encode_utf8(out.to_str(out_size))
127
+ end
128
+
81
129
  private
82
130
 
83
- def check_status(ret)
84
- raise Error, "Bad status" if ret == -1
131
+ def check_status(ret, ptr)
132
+ raise Error, "Not enough memory allocated" if ret == -1 || ret > ptr.size
133
+ end
134
+
135
+ def text_to(text, sep)
136
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
137
+ # TODO allocate less, and try again if needed
138
+ out = Fiddle::Pointer.malloc([text.bytesize * 3, 20].max)
139
+ out_size = yield(text, out)
140
+ check_status out_size, out
141
+ encode_utf8(out.to_str(out_size - 1)).split(sep)
142
+ end
143
+
144
+ def text_to_with_offsets(text, sep)
145
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
146
+ # TODO allocate less, and try again if needed
147
+ out = Fiddle::Pointer.malloc([text.bytesize * 3, 20].max)
148
+
149
+ start_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * out.size)
150
+ end_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * out.size)
151
+
152
+ out_size = yield(text, out, start_offsets, end_offsets)
153
+
154
+ check_status out_size, out
155
+
156
+ result = encode_utf8(out.to_str(out_size - 1)).split(sep)
157
+ [result].concat(unpack_offsets(start_offsets, end_offsets, result, text))
85
158
  end
86
159
 
87
160
  def encode_utf8(text)
88
161
  text.force_encoding(Encoding::UTF_8)
89
162
  end
163
+
164
+ def unpack_offsets(start_offsets, end_offsets, result, text)
165
+ start_bytes = start_offsets.to_s(Fiddle::SIZEOF_INT * result.size).unpack("i*")
166
+ end_bytes = end_offsets.to_s(Fiddle::SIZEOF_INT * result.size).unpack("i*")
167
+ starts = []
168
+ ends = []
169
+
170
+ # convert byte offsets to character offsets
171
+ # TODO see if more efficient to store next_pos in variable
172
+ pos = 0
173
+ text.each_char.with_index do |c, i|
174
+ while pos == start_bytes[starts.size]
175
+ starts << i
176
+ end
177
+ pos += c.bytesize
178
+ while pos - 1 == end_bytes[ends.size]
179
+ ends << i + 1
180
+ end
181
+ end
182
+
183
+ [starts, ends]
184
+ end
90
185
  end
91
186
  end
data/lib/blingfire/ffi.rb CHANGED
@@ -10,13 +10,35 @@ module BlingFire
10
10
  raise e
11
11
  end
12
12
 
13
+ # https://github.com/microsoft/BlingFire/blob/master/blingfiretools/blingfiretokdll/blingfiretokdll.cpp
14
+
15
+ # version
13
16
  extern "int GetBlingFireTokVersion()"
14
- extern "void* LoadModel(char * pszLdbFileName)"
15
- extern "int FreeModel(void* ModelPtr)"
17
+
18
+ # text to sentences
19
+ extern "int TextToSentencesWithOffsetsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount, void * hModel)"
20
+ extern "int TextToSentencesWithOffsets(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount)"
21
+ extern "int TextToSentencesWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
22
+ extern "int TextToSentences(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
23
+
24
+ # text to words
25
+ extern "int TextToWordsWithOffsetsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount, void * hModel)"
26
+ extern "int TextToWordsWithOffsets(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount)"
16
27
  extern "int TextToWordsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
17
28
  extern "int TextToWords(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
29
+
30
+ # misc
31
+ extern "int NormalizeSpaces(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, int uSpace)"
32
+ extern "int TextToHashes(char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pHashArr, int MaxHashArrLength, int wordNgrams, int bucketSize)"
33
+
34
+ # model
35
+ extern "void* LoadModel(char * pszLdbFileName)"
36
+
37
+ # text to ids
38
+ extern "int TextToIdsWithOffsets(void* ModelPtr, char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pIdsArr, int * pStartOffsets, int * pEndOffsets, int MaxIdsArrLength, int UnkId)"
18
39
  extern "int TextToIds(void* ModelPtr, char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pIdsArr, int MaxIdsArrLength, int UnkId)"
19
- extern "int TextToSentences(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
20
- extern "int TextToSentencesWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
40
+
41
+ # free model
42
+ extern "int FreeModel(void* ModelPtr)"
21
43
  end
22
44
  end
@@ -1,7 +1,9 @@
1
1
  module BlingFire
2
2
  class Model
3
3
  def initialize(path = nil)
4
+ @handle = nil
4
5
  if path
6
+ raise Error, "Model not found" unless File.exist?(path)
5
7
  @handle = FFI.LoadModel(path)
6
8
  ObjectSpace.define_finalizer(self, self.class.finalize(@handle))
7
9
  end
@@ -15,6 +17,14 @@ module BlingFire
15
17
  end
16
18
  end
17
19
 
20
+ def text_to_words_with_offsets(text)
21
+ if @handle
22
+ BlingFire.text_to_words_with_offsets_with_model(@handle, text)
23
+ else
24
+ BlingFire.text_to_words_with_offsets(text)
25
+ end
26
+ end
27
+
18
28
  def text_to_sentences(text)
19
29
  if @handle
20
30
  BlingFire.text_to_sentences_with_model(@handle, text)
@@ -23,6 +33,14 @@ module BlingFire
23
33
  end
24
34
  end
25
35
 
36
+ def text_to_sentences_with_offsets(text)
37
+ if @handle
38
+ BlingFire.text_to_sentences_with_offsets_with_model(@handle, text)
39
+ else
40
+ BlingFire.text_to_sentences_with_offsets(text)
41
+ end
42
+ end
43
+
26
44
  def text_to_ids(text, max_len = nil, unk_id = 0)
27
45
  if @handle
28
46
  BlingFire.text_to_ids(@handle, text, max_len, unk_id)
@@ -31,6 +49,14 @@ module BlingFire
31
49
  end
32
50
  end
33
51
 
52
+ def text_to_ids_with_offsets(text, max_len = nil, unk_id = 0)
53
+ if @handle
54
+ BlingFire.text_to_ids_with_offsets(@handle, text, max_len, unk_id)
55
+ else
56
+ raise "Not implemented"
57
+ end
58
+ end
59
+
34
60
  def to_ptr
35
61
  @handle
36
62
  end
@@ -1,3 +1,3 @@
1
1
  module BlingFire
2
- VERSION = "0.1.0"
2
+ VERSION = "0.1.5"
3
3
  end
Binary file
Binary file
Binary file
Binary file
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: blingfire
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-02-24 00:00:00.000000000 Z
11
+ date: 2021-03-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -52,7 +52,7 @@ dependencies:
52
52
  - - ">="
53
53
  - !ruby/object:Gem::Version
54
54
  version: '5'
55
- description:
55
+ description:
56
56
  email: andrew@chartkick.com
57
57
  executables: []
58
58
  extensions: []
@@ -67,13 +67,15 @@ files:
67
67
  - lib/blingfire/version.rb
68
68
  - vendor/LICENSE
69
69
  - vendor/blingfiretokdll.dll
70
+ - vendor/libblingfiretokdll.arm64.dylib
71
+ - vendor/libblingfiretokdll.arm64.so
70
72
  - vendor/libblingfiretokdll.dylib
71
73
  - vendor/libblingfiretokdll.so
72
74
  homepage: https://github.com/ankane/blingfire
73
75
  licenses:
74
76
  - MIT
75
77
  metadata: {}
76
- post_install_message:
78
+ post_install_message:
77
79
  rdoc_options: []
78
80
  require_paths:
79
81
  - lib
@@ -88,8 +90,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
88
90
  - !ruby/object:Gem::Version
89
91
  version: '0'
90
92
  requirements: []
91
- rubygems_version: 3.1.2
92
- signing_key:
93
+ rubygems_version: 3.2.3
94
+ signing_key:
93
95
  specification_version: 4
94
96
  summary: High speed text tokenization for Ruby
95
97
  test_files: []