blingfire 0.1.1 → 0.1.6

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 225c7163bbbd6bc06e0b7d54b2a0a64c3b622c174e5c6cb0b8fa54b6e68740a3
4
- data.tar.gz: b9d8ce2d21a450c53b1628bdab97b07aa2c39dd1b758fa763c39f052a944084c
3
+ metadata.gz: fa9f7ebf09e5745b6865d4c5645b36f42b7c484ed667ce493febee80887e89ad
4
+ data.tar.gz: 7fb6e8716138c35396964081b8ad60d7c01db88e71e4fbe219d0ea27a6eff94b
5
5
  SHA512:
6
- metadata.gz: 98eed6c71ee974381772e032b27d022cd9c9485a83e8dae25c21817a19489953c7c5dfd4bb4498d796b65a569bd3468f55df558603d904ed9293a4c21e7041d5
7
- data.tar.gz: 15389737eec2ad109f528bf7cd064184205bc17d822d108031f29eb7cf9d421c0a84f40a3d170461e7fa3c853b1fbc912b28f07e403924f9744cb8f6c2294f2d
6
+ metadata.gz: 4f1e20f3520f4be17af2df5db4b3e05b169756f1667dccb08dbafb89ee07f633862d16649fa8e3be505073ca58a11989cc5c9b99d3f571aae2eaff3d4053dd8c
7
+ data.tar.gz: 8f70ecb06ddcad466508a0debe63fd99e9a5cbab35e3702296c8ff522ffbb5075ad52b2ce68d2f316de6685256b756ff098bc5a2b29e807652b89de84f9b4661
data/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ ## 0.1.6 (2021-06-07)
2
+
3
+ - Updated Bling Fire to 0.1.7
4
+ - Added `prefix` option
5
+
6
+ ## 0.1.5 (2021-03-14)
7
+
8
+ - Updated Bling Fire to 0.1.5
9
+ - Added ARM shared library for Linux
10
+
11
+ ## 0.1.4 (2020-12-28)
12
+
13
+ - Added ARM shared library for Mac
14
+
15
+ ## 0.1.3 (2020-10-01)
16
+
17
+ - Added `text_to_words_with_offsets` method
18
+ - Added `text_to_sentences_with_offsets` method
19
+ - Added `text_to_ids_with_offsets` method
20
+ - Added `normalize_spaces` method
21
+
22
+ ## 0.1.2 (2020-06-25)
23
+
24
+ - Updated Bling Fire to 0.1.3
25
+
1
26
  ## 0.1.1 (2020-05-01)
2
27
 
3
28
  - Updated Bling Fire to 0.1.1
data/LICENSE.txt CHANGED
@@ -1,22 +1,22 @@
1
- Copyright (c) 2020 Andrew Kane
2
-
3
1
  MIT License
4
2
 
5
- Permission is hereby granted, free of charge, to any person obtaining
6
- a copy of this software and associated documentation files (the
7
- "Software"), to deal in the Software without restriction, including
8
- without limitation the rights to use, copy, modify, merge, publish,
9
- distribute, sublicense, and/or sell copies of the Software, and to
10
- permit persons to whom the Software is furnished to do so, subject to
11
- the following conditions:
3
+ Copyright (c) Microsoft Corporation. All rights reserved.
4
+ Copyright (c) 2020 Andrew Kane
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining a copy
7
+ of this software and associated documentation files (the "Software"), to deal
8
+ in the Software without restriction, including without limitation the rights
9
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
+ copies of the Software, and to permit persons to whom the Software is
11
+ furnished to do so, subject to the following conditions:
12
12
 
13
- The above copyright notice and this permission notice shall be
14
- included in all copies or substantial portions of the Software.
13
+ The above copyright notice and this permission notice shall be included in all
14
+ copies or substantial portions of the Software.
15
15
 
16
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
+ SOFTWARE
data/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  [Bling Fire](https://github.com/microsoft/BlingFire) - high speed text tokenization - for Ruby
4
4
 
5
- [![Build Status](https://travis-ci.org/ankane/blingfire.svg?branch=master)](https://travis-ci.org/ankane/blingfire) [![Build status](https://ci.appveyor.com/api/projects/status/3gyca4gsjw2w9ns1/branch/master?svg=true)](https://ci.appveyor.com/project/ankane/blingfire/branch/master)
5
+ [![Build Status](https://github.com/ankane/blingfire/workflows/build/badge.svg?branch=master)](https://github.com/ankane/blingfire/actions)
6
6
 
7
7
  ## Installation
8
8
 
@@ -32,17 +32,31 @@ Tokenize sentences
32
32
  model.text_to_sentences(text)
33
33
  ```
34
34
 
35
+ Get offsets for words
36
+
37
+ ```ruby
38
+ words, start_offsets, end_offsets = model.text_to_words_with_offsets(text)
39
+ ```
40
+
41
+ Get offsets for sentences
42
+
43
+ ```ruby
44
+ sentences, start_offsets, end_offsets = model.text_to_sentences_with_offsets(text)
45
+ ```
46
+
35
47
  ## Pre-trained Models
36
48
 
37
- BlingFire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:
49
+ Bling Fire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:
38
50
 
39
- - [BERT Base](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_tok.bin)
40
- - [BERT Base Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_cased_tok.bin)
41
- - [BERT Chinese](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_chinese.bin)
42
- - [BERT Multilingual Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_multi_cased.bin)
51
+ - [BERT Base](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_tok.bin), [BERT Base Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_cased_tok.bin), [BERT Chinese](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_chinese.bin), [BERT Multilingual Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_multi_cased.bin)
52
+ - [GPT-2](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/gpt2.bin)
53
+ - [Laser 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser100k.bin), [Laser 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser250k.bin), [Laser 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser500k.bin)
54
+ - [RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/roberta.bin)
55
+ - [Syllab](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/syllab.bin)
56
+ - [URI 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri100k.bin), [URI 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri250k.bin), [URI 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri500k.bin)
57
+ - [XLM-RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlm_roberta_base.bin)
58
+ - [XLNet](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet.bin), [XLNet No Norm](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet_nonorm.bin)
43
59
  - [WBD](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/wbd_chuni.bin)
44
- - [XLNet](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet.bin)
45
- - [XLNet No Norm](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet_nonorm.bin)
46
60
 
47
61
  Load a model
48
62
 
@@ -56,6 +70,18 @@ Convert text to ids
56
70
  model.text_to_ids(text)
57
71
  ```
58
72
 
73
+ Get offsets for ids
74
+
75
+ ```ruby
76
+ ids, start_offsets, end_offsets = model.text_to_ids_with_offsets(text)
77
+ ```
78
+
79
+ Disable prefix space
80
+
81
+ ```ruby
82
+ model = BlingFire.load_model("roberta.bin", prefix: false)
83
+ ```
84
+
59
85
  ## History
60
86
 
61
87
  View the [changelog](https://github.com/ankane/blingfire/blob/master/CHANGELOG.md)
@@ -75,6 +101,6 @@ To get started with development:
75
101
  git clone https://github.com/ankane/blingfire.git
76
102
  cd blingfire
77
103
  bundle install
78
- bundle exec rake vendor:all
104
+ bundle exec rake vendor:all download:models
79
105
  bundle exec rake test
80
106
  ```
data/lib/blingfire.rb CHANGED
@@ -15,9 +15,17 @@ module BlingFire
15
15
  if Gem.win_platform?
16
16
  "blingfiretokdll.dll"
17
17
  elsif RbConfig::CONFIG["host_os"] =~ /darwin/i
18
- "libblingfiretokdll.dylib"
18
+ if RbConfig::CONFIG["host_cpu"] =~ /arm/i
19
+ "libblingfiretokdll.arm64.dylib"
20
+ else
21
+ "libblingfiretokdll.dylib"
22
+ end
19
23
  else
20
- "libblingfiretokdll.so"
24
+ if RbConfig::CONFIG["host_cpu"] =~ /aarch64/i
25
+ "libblingfiretokdll.arm64.so"
26
+ else
27
+ "libblingfiretokdll.so"
28
+ end
21
29
  end
22
30
  vendor_lib = File.expand_path("../vendor/#{lib_name}", __dir__)
23
31
  self.ffi_lib = [vendor_lib]
@@ -30,8 +38,8 @@ module BlingFire
30
38
  FFI.GetBlingFireTokVersion
31
39
  end
32
40
 
33
- def load_model(path)
34
- Model.new(path)
41
+ def load_model(path, **options)
42
+ Model.new(path, **options)
35
43
  end
36
44
 
37
45
  def text_to_words(text)
@@ -46,6 +54,18 @@ module BlingFire
46
54
  end
47
55
  end
48
56
 
57
+ def text_to_words_with_offsets(text)
58
+ text_to_with_offsets(text, " ") do |t, out, start_offsets, end_offsets|
59
+ FFI.TextToWordsWithOffsets(t, t.bytesize, out, start_offsets, end_offsets, out.size)
60
+ end
61
+ end
62
+
63
+ def text_to_words_with_offsets_with_model(model, text)
64
+ text_to_with_offsets(text, " ") do |t, out, start_offsets, end_offsets|
65
+ FFI.TextToWordsWithOffsetsWithModel(t, t.bytesize, out, start_offsets, end_offsets, out.size, model)
66
+ end
67
+ end
68
+
49
69
  def text_to_sentences(text)
50
70
  text_to(text, "\n") do |t, out|
51
71
  FFI.TextToSentences(t, t.bytesize, out, out.size)
@@ -58,6 +78,18 @@ module BlingFire
58
78
  end
59
79
  end
60
80
 
81
+ def text_to_sentences_with_offsets(text)
82
+ text_to_with_offsets(text, "\n") do |t, out, start_offsets, end_offsets|
83
+ FFI.TextToSentencesWithOffsets(t, t.bytesize, out, start_offsets, end_offsets, out.size)
84
+ end
85
+ end
86
+
87
+ def text_to_sentences_with_offsets_with_model(model, text)
88
+ text_to_with_offsets(text, "\n") do |t, out, start_offsets, end_offsets|
89
+ FFI.TextToSentencesWithOffsetsWithModel(t, t.bytesize, out, start_offsets, end_offsets, out.size, model)
90
+ end
91
+ end
92
+
61
93
  def text_to_ids(model, text, max_len = nil, unk_id = 0)
62
94
  text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
63
95
  ids = Fiddle::Pointer.malloc((max_len || text.size) * Fiddle::SIZEOF_INT)
@@ -66,10 +98,40 @@ module BlingFire
66
98
  ids[0, (max_len || out_size) * Fiddle::SIZEOF_INT].unpack("i!*")
67
99
  end
68
100
 
101
+ def text_to_ids_with_offsets(model, text, max_len = nil, unk_id = 0)
102
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
103
+ ids = Fiddle::Pointer.malloc((max_len || text.size) * Fiddle::SIZEOF_INT)
104
+
105
+ start_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * ids.size)
106
+ end_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * ids.size)
107
+
108
+ out_size = FFI.TextToIdsWithOffsets(model, text, text.bytesize, ids, start_offsets, end_offsets, ids.size, unk_id)
109
+
110
+ check_status out_size, ids
111
+
112
+ result = ids[0, (max_len || out_size) * Fiddle::SIZEOF_INT].unpack("i!*")
113
+ [result].concat(unpack_offsets(start_offsets, end_offsets, result, text))
114
+ end
115
+
69
116
  def free_model(model)
70
117
  FFI.FreeModel(model)
71
118
  end
72
119
 
120
+ def normalize_spaces(text)
121
+ u_space = 0x20
122
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
123
+ out = Fiddle::Pointer.malloc([text.bytesize * 1.5, 20].max)
124
+ out_size = FFI.NormalizeSpaces(text, text.bytesize, out, out.size, u_space)
125
+ check_status out_size, out
126
+ encode_utf8(out.to_str(out_size))
127
+ end
128
+
129
+ def change_settings_dummy_prefix(model, value)
130
+ # use opposite of value
131
+ ret = FFI.SetNoDummyPrefix(model, value ? 0 : 1)
132
+ raise Error, "Bad status: #{ret}" if ret != 1
133
+ end
134
+
73
135
  private
74
136
 
75
137
  def check_status(ret, ptr)
@@ -79,14 +141,52 @@ module BlingFire
79
141
  def text_to(text, sep)
80
142
  text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
81
143
  # TODO allocate less, and try again if needed
82
- out = Fiddle::Pointer.malloc([text.bytesize * 1.5, 20].max)
144
+ out = Fiddle::Pointer.malloc([text.bytesize * 3, 20].max)
83
145
  out_size = yield(text, out)
84
146
  check_status out_size, out
85
147
  encode_utf8(out.to_str(out_size - 1)).split(sep)
86
148
  end
87
149
 
150
+ def text_to_with_offsets(text, sep)
151
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
152
+ # TODO allocate less, and try again if needed
153
+ out = Fiddle::Pointer.malloc([text.bytesize * 3, 20].max)
154
+
155
+ start_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * out.size)
156
+ end_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * out.size)
157
+
158
+ out_size = yield(text, out, start_offsets, end_offsets)
159
+
160
+ check_status out_size, out
161
+
162
+ result = encode_utf8(out.to_str(out_size - 1)).split(sep)
163
+ [result].concat(unpack_offsets(start_offsets, end_offsets, result, text))
164
+ end
165
+
88
166
  def encode_utf8(text)
89
167
  text.force_encoding(Encoding::UTF_8)
90
168
  end
169
+
170
+ def unpack_offsets(start_offsets, end_offsets, result, text)
171
+ start_bytes = start_offsets.to_s(Fiddle::SIZEOF_INT * result.size).unpack("i*")
172
+ end_bytes = end_offsets.to_s(Fiddle::SIZEOF_INT * result.size).unpack("i*")
173
+ starts = []
174
+ ends = []
175
+
176
+ # convert byte offsets to character offsets
177
+ # TODO see if more efficient to store next_pos in variable
178
+ pos = 0
179
+ text.each_char.with_index do |c, i|
180
+ while pos == start_bytes[starts.size] || start_bytes[starts.size] == -1
181
+ starts << i
182
+ end
183
+ pos += c.bytesize
184
+ while pos - 1 == end_bytes[ends.size]
185
+ ends << i + 1
186
+ end
187
+ end
188
+
189
+ [starts, ends]
190
+ end
91
191
  end
92
192
  end
data/lib/blingfire/ffi.rb CHANGED
@@ -10,13 +10,40 @@ module BlingFire
10
10
  raise e
11
11
  end
12
12
 
13
+ typealias "bool", "char"
14
+
15
+ # https://github.com/microsoft/BlingFire/blob/master/blingfiretools/blingfiretokdll/blingfiretokdll.cpp
16
+
17
+ # version
13
18
  extern "int GetBlingFireTokVersion()"
14
- extern "void* LoadModel(char * pszLdbFileName)"
15
- extern "int FreeModel(void* ModelPtr)"
19
+
20
+ # text to sentences
21
+ extern "int TextToSentencesWithOffsetsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount, void * hModel)"
22
+ extern "int TextToSentencesWithOffsets(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount)"
23
+ extern "int TextToSentencesWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
24
+ extern "int TextToSentences(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
25
+
26
+ # text to words
27
+ extern "int TextToWordsWithOffsetsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount, void * hModel)"
28
+ extern "int TextToWordsWithOffsets(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount)"
16
29
  extern "int TextToWordsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
17
30
  extern "int TextToWords(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
31
+
32
+ # misc
33
+ extern "int NormalizeSpaces(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, int uSpace)"
34
+ extern "int TextToHashes(char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pHashArr, int MaxHashArrLength, int wordNgrams, int bucketSize)"
35
+
36
+ # model
37
+ extern "void* LoadModel(char * pszLdbFileName)"
38
+
39
+ # text to ids
40
+ extern "int TextToIdsWithOffsets(void* ModelPtr, char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pIdsArr, int * pStartOffsets, int * pEndOffsets, int MaxIdsArrLength, int UnkId)"
18
41
  extern "int TextToIds(void* ModelPtr, char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pIdsArr, int MaxIdsArrLength, int UnkId)"
19
- extern "int TextToSentences(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
20
- extern "int TextToSentencesWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
42
+
43
+ # free model
44
+ extern "int FreeModel(void* ModelPtr)"
45
+
46
+ # prefix
47
+ extern "int SetNoDummyPrefix(void* ModelPtr, bool fNoDummyPrefix)"
21
48
  end
22
49
  end
@@ -1,10 +1,15 @@
1
1
  module BlingFire
2
2
  class Model
3
- def initialize(path = nil)
3
+ def initialize(path = nil, prefix: nil)
4
+ @handle = nil
4
5
  if path
5
6
  raise Error, "Model not found" unless File.exist?(path)
6
7
  @handle = FFI.LoadModel(path)
7
8
  ObjectSpace.define_finalizer(self, self.class.finalize(@handle))
9
+
10
+ BlingFire.change_settings_dummy_prefix(@handle, prefix) unless prefix.nil?
11
+ else
12
+ raise Error, "prefix option requires path" unless prefix.nil?
8
13
  end
9
14
  end
10
15
 
@@ -16,6 +21,14 @@ module BlingFire
16
21
  end
17
22
  end
18
23
 
24
+ def text_to_words_with_offsets(text)
25
+ if @handle
26
+ BlingFire.text_to_words_with_offsets_with_model(@handle, text)
27
+ else
28
+ BlingFire.text_to_words_with_offsets(text)
29
+ end
30
+ end
31
+
19
32
  def text_to_sentences(text)
20
33
  if @handle
21
34
  BlingFire.text_to_sentences_with_model(@handle, text)
@@ -24,6 +37,14 @@ module BlingFire
24
37
  end
25
38
  end
26
39
 
40
+ def text_to_sentences_with_offsets(text)
41
+ if @handle
42
+ BlingFire.text_to_sentences_with_offsets_with_model(@handle, text)
43
+ else
44
+ BlingFire.text_to_sentences_with_offsets(text)
45
+ end
46
+ end
47
+
27
48
  def text_to_ids(text, max_len = nil, unk_id = 0)
28
49
  if @handle
29
50
  BlingFire.text_to_ids(@handle, text, max_len, unk_id)
@@ -32,6 +53,14 @@ module BlingFire
32
53
  end
33
54
  end
34
55
 
56
+ def text_to_ids_with_offsets(text, max_len = nil, unk_id = 0)
57
+ if @handle
58
+ BlingFire.text_to_ids_with_offsets(@handle, text, max_len, unk_id)
59
+ else
60
+ raise "Not implemented"
61
+ end
62
+ end
63
+
35
64
  def to_ptr
36
65
  @handle
37
66
  end
@@ -1,3 +1,3 @@
1
1
  module BlingFire
2
- VERSION = "0.1.1"
2
+ VERSION = "0.1.6"
3
3
  end
Binary file
Binary file
Binary file
Binary file
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: blingfire
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-05-01 00:00:00.000000000 Z
11
+ date: 2021-06-07 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -52,7 +52,7 @@ dependencies:
52
52
  - - ">="
53
53
  - !ruby/object:Gem::Version
54
54
  version: '5'
55
- description:
55
+ description:
56
56
  email: andrew@chartkick.com
57
57
  executables: []
58
58
  extensions: []
@@ -67,13 +67,15 @@ files:
67
67
  - lib/blingfire/version.rb
68
68
  - vendor/LICENSE
69
69
  - vendor/blingfiretokdll.dll
70
+ - vendor/libblingfiretokdll.arm64.dylib
71
+ - vendor/libblingfiretokdll.arm64.so
70
72
  - vendor/libblingfiretokdll.dylib
71
73
  - vendor/libblingfiretokdll.so
72
74
  homepage: https://github.com/ankane/blingfire
73
75
  licenses:
74
76
  - MIT
75
77
  metadata: {}
76
- post_install_message:
78
+ post_install_message:
77
79
  rdoc_options: []
78
80
  require_paths:
79
81
  - lib
@@ -88,8 +90,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
88
90
  - !ruby/object:Gem::Version
89
91
  version: '0'
90
92
  requirements: []
91
- rubygems_version: 3.1.2
92
- signing_key:
93
+ rubygems_version: 3.2.3
94
+ signing_key:
93
95
  specification_version: 4
94
96
  summary: High speed text tokenization for Ruby
95
97
  test_files: []