blingfire 0.1.0 → 0.1.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a93f90584141a618ff5bb73860068e2ea662c568a07e2f877a5f2be7e265919e
4
- data.tar.gz: 61b49a4bd7eb304f44c041730bce733102f1936c9340cd9c0c3f1390bd9e853a
3
+ metadata.gz: 2585ca0684a0d6af6beaae99ab7b5250eb928c5200cbfb6274dcd2cbc914dccc
4
+ data.tar.gz: fe4681abb0c63e7d8fd0e8a777abaed95e578bbb0cfd7238bacbe9beecd3d07c
5
5
  SHA512:
6
- metadata.gz: fd0242fc13696c620ffd3a7383b9088aaddade1f7280c4f35c54d5165f66e26b783e88fa2d65ced002e8f233af0637e43e9adb68ad7556a16cfdcbb6aff9ba7e
7
- data.tar.gz: a1206eaa056ed93639c00c03eb16be5fbbec6b077e1677a916b69c6244c26d3c2cce51299dc0394335c54d12de500a9648f30fa6f3785ef749be528904d5ae60
6
+ metadata.gz: 9a02c3f87eea7ea989f73032388f4abe69b30055f1ea7a96aa302385a2f6038eb833cffe00b15e5b69124d7a11bf8dd031067a4a16b415999955b8ce393f25d4
7
+ data.tar.gz: e696df2343cab6ee82af95bedcfc45457b69287113f819514851ef5f943a2f0f8c165b21abff0f944dcd96d4cab45b8faa1190940bf36c294c9c98d55bd9d980
data/CHANGELOG.md CHANGED
@@ -1,3 +1,28 @@
1
+ ## 0.1.5 (2021-03-14)
2
+
3
+ - Updated Bling Fire to 0.1.5
4
+ - Added ARM shared library for Linux
5
+
6
+ ## 0.1.4 (2020-12-28)
7
+
8
+ - Added ARM shared library for Mac
9
+
10
+ ## 0.1.3 (2020-10-01)
11
+
12
+ - Added `text_to_words_with_offsets` method
13
+ - Added `text_to_sentences_with_offsets` method
14
+ - Added `text_to_ids_with_offsets` method
15
+ - Added `normalize_spaces` method
16
+
17
+ ## 0.1.2 (2020-06-25)
18
+
19
+ - Updated Bling Fire to 0.1.3
20
+
21
+ ## 0.1.1 (2020-05-01)
22
+
23
+ - Updated Bling Fire to 0.1.1
24
+ - Improved error message when model not found
25
+
1
26
  ## 0.1.0 (2020-02-24)
2
27
 
3
28
  - First release
data/LICENSE.txt CHANGED
@@ -1,22 +1,22 @@
1
- Copyright (c) 2020 Andrew Kane
2
-
3
1
  MIT License
4
2
 
5
- Permission is hereby granted, free of charge, to any person obtaining
6
- a copy of this software and associated documentation files (the
7
- "Software"), to deal in the Software without restriction, including
8
- without limitation the rights to use, copy, modify, merge, publish,
9
- distribute, sublicense, and/or sell copies of the Software, and to
10
- permit persons to whom the Software is furnished to do so, subject to
11
- the following conditions:
3
+ Copyright (c) Microsoft Corporation. All rights reserved.
4
+ Copyright (c) 2020 Andrew Kane
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining a copy
7
+ of this software and associated documentation files (the "Software"), to deal
8
+ in the Software without restriction, including without limitation the rights
9
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10
+ copies of the Software, and to permit persons to whom the Software is
11
+ furnished to do so, subject to the following conditions:
12
12
 
13
- The above copyright notice and this permission notice shall be
14
- included in all copies or substantial portions of the Software.
13
+ The above copyright notice and this permission notice shall be included in all
14
+ copies or substantial portions of the Software.
15
15
 
16
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
- LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
- OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22
+ SOFTWARE
data/README.md CHANGED
@@ -1,6 +1,8 @@
1
- # BlingFire
1
+ # Bling Fire
2
2
 
3
- [BlingFire](https://github.com/microsoft/BlingFire) - high speed text tokenization - for Ruby
3
+ [Bling Fire](https://github.com/microsoft/BlingFire) - high speed text tokenization - for Ruby
4
+
5
+ [![Build Status](https://github.com/ankane/blingfire/workflows/build/badge.svg?branch=master)](https://github.com/ankane/blingfire/actions)
4
6
 
5
7
  ## Installation
6
8
 
@@ -30,14 +32,30 @@ Tokenize sentences
30
32
  model.text_to_sentences(text)
31
33
  ```
32
34
 
35
+ Get offsets for words
36
+
37
+ ```ruby
38
+ words, start_offsets, end_offsets = model.text_to_words_with_offsets(text)
39
+ ```
40
+
41
+ Get offsets for sentences
42
+
43
+ ```ruby
44
+ sentences, start_offsets, end_offsets = model.text_to_sentences_with_offsets(text)
45
+ ```
46
+
33
47
  ## Pre-trained Models
34
48
 
35
- BlingFire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:
49
+ Bling Fire comes with a default model that follows the tokenization logic of NLTK with a few changes. You can also download other models:
36
50
 
37
- - [BERT Base](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_tok.bin)
38
- - [BERT Base Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_cased_tok.bin)
39
- - [BERT Chinese](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_chinese.bin)
40
- - [BERT Multilingual Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_multi_cased.bin)
51
+ - [BERT Base](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_tok.bin), [BERT Base Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_base_cased_tok.bin), [BERT Chinese](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_chinese.bin), [BERT Multilingual Cased](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/bert_multi_cased.bin)
52
+ - [GPT-2](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/gpt2.bin)
53
+ - [Laser 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser100k.bin), [Laser 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser250k.bin), [Laser 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/laser500k.bin)
54
+ - [RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/roberta.bin)
55
+ - [Syllab](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/syllab.bin)
56
+ - [URI 100k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri100k.bin), [URI 250k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri250k.bin), [URI 500k](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/uri500k.bin)
57
+ - [XLM-RoBERTa](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlm_roberta_base.bin)
58
+ - [XLNet](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet.bin), [XLNet No Norm](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/xlnet_nonorm.bin)
41
59
  - [WBD](https://github.com/microsoft/BlingFire/blob/master/dist-pypi/blingfire/wbd_chuni.bin)
42
60
 
43
61
  Load a model
@@ -52,6 +70,12 @@ Convert text to ids
52
70
  model.text_to_ids(text)
53
71
  ```
54
72
 
73
+ Get offsets for ids
74
+
75
+ ```ruby
76
+ ids, start_offsets, end_offsets = model.text_to_ids_with_offsets(text)
77
+ ```
78
+
55
79
  ## History
56
80
 
57
81
  View the [changelog](https://github.com/ankane/blingfire/blob/master/CHANGELOG.md)
@@ -71,6 +95,6 @@ To get started with development:
71
95
  git clone https://github.com/ankane/blingfire.git
72
96
  cd blingfire
73
97
  bundle install
74
- bundle exec rake vendor:all
98
+ bundle exec rake vendor:all download:models
75
99
  bundle exec rake test
76
100
  ```
data/lib/blingfire.rb CHANGED
@@ -15,9 +15,17 @@ module BlingFire
15
15
  if Gem.win_platform?
16
16
  "blingfiretokdll.dll"
17
17
  elsif RbConfig::CONFIG["host_os"] =~ /darwin/i
18
- "libblingfiretokdll.dylib"
18
+ if RbConfig::CONFIG["host_cpu"] =~ /arm/i
19
+ "libblingfiretokdll.arm64.dylib"
20
+ else
21
+ "libblingfiretokdll.dylib"
22
+ end
19
23
  else
20
- "libblingfiretokdll.so"
24
+ if RbConfig::CONFIG["host_cpu"] =~ /aarch64/i
25
+ "libblingfiretokdll.arm64.so"
26
+ else
27
+ "libblingfiretokdll.so"
28
+ end
21
29
  end
22
30
  vendor_lib = File.expand_path("../vendor/#{lib_name}", __dir__)
23
31
  self.ffi_lib = [vendor_lib]
@@ -35,57 +43,144 @@ module BlingFire
35
43
  end
36
44
 
37
45
  def text_to_words(text)
38
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
39
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
40
- out_size = FFI.TextToWords(text, text.bytesize, out, out.size)
41
- check_status out_size
42
- encode_utf8(out[0, out_size - 1]).split(" ")
46
+ text_to(text, " ") do |t, out|
47
+ FFI.TextToWords(t, t.bytesize, out, out.size)
48
+ end
43
49
  end
44
50
 
45
51
  def text_to_words_with_model(model, text)
46
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
47
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
48
- out_size = FFI.TextToWordsWithModel(text, text.bytesize, out, out.size, model)
49
- check_status out_size
50
- encode_utf8(out[0, out_size - 1]).split(" ")
52
+ text_to(text, " ") do |t, out|
53
+ FFI.TextToWordsWithModel(t, t.bytesize, out, out.size, model)
54
+ end
55
+ end
56
+
57
+ def text_to_words_with_offsets(text)
58
+ text_to_with_offsets(text, " ") do |t, out, start_offsets, end_offsets|
59
+ FFI.TextToWordsWithOffsets(t, t.bytesize, out, start_offsets, end_offsets, out.size)
60
+ end
61
+ end
62
+
63
+ def text_to_words_with_offsets_with_model(model, text)
64
+ text_to_with_offsets(text, " ") do |t, out, start_offsets, end_offsets|
65
+ FFI.TextToWordsWithOffsetsWithModel(t, t.bytesize, out, start_offsets, end_offsets, out.size, model)
66
+ end
51
67
  end
52
68
 
53
69
  def text_to_sentences(text)
54
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
55
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
56
- out_size = FFI.TextToSentences(text, text.bytesize, out, out.size)
57
- check_status out_size
58
- encode_utf8(out[0, out_size - 1]).split("\n")
70
+ text_to(text, "\n") do |t, out|
71
+ FFI.TextToSentences(t, t.bytesize, out, out.size)
72
+ end
59
73
  end
60
74
 
61
75
  def text_to_sentences_with_model(model, text)
62
- text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
63
- out = Fiddle::Pointer.malloc(text.bytesize * 3)
64
- out_size = FFI.TextToSentencesWithModel(text, text.bytesize, out, out.size, model)
65
- check_status out_size
66
- encode_utf8(out[0, out_size - 1]).split("\n")
76
+ text_to(text, "\n") do |t, out|
77
+ FFI.TextToSentencesWithModel(t, t.bytesize, out, out.size, model)
78
+ end
79
+ end
80
+
81
+ def text_to_sentences_with_offsets(text)
82
+ text_to_with_offsets(text, "\n") do |t, out, start_offsets, end_offsets|
83
+ FFI.TextToSentencesWithOffsets(t, t.bytesize, out, start_offsets, end_offsets, out.size)
84
+ end
85
+ end
86
+
87
+ def text_to_sentences_with_offsets_with_model(model, text)
88
+ text_to_with_offsets(text, "\n") do |t, out, start_offsets, end_offsets|
89
+ FFI.TextToSentencesWithOffsetsWithModel(t, t.bytesize, out, start_offsets, end_offsets, out.size, model)
90
+ end
67
91
  end
68
92
 
69
93
  def text_to_ids(model, text, max_len = nil, unk_id = 0)
70
94
  text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
71
95
  ids = Fiddle::Pointer.malloc((max_len || text.size) * Fiddle::SIZEOF_INT)
72
96
  out_size = FFI.TextToIds(model, text, text.bytesize, ids, ids.size, unk_id)
73
- check_status out_size
97
+ check_status out_size, ids
74
98
  ids[0, (max_len || out_size) * Fiddle::SIZEOF_INT].unpack("i!*")
75
99
  end
76
100
 
101
+ def text_to_ids_with_offsets(model, text, max_len = nil, unk_id = 0)
102
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
103
+ ids = Fiddle::Pointer.malloc((max_len || text.size) * Fiddle::SIZEOF_INT)
104
+
105
+ start_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * ids.size)
106
+ end_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * ids.size)
107
+
108
+ out_size = FFI.TextToIdsWithOffsets(model, text, text.bytesize, ids, start_offsets, end_offsets, ids.size, unk_id)
109
+
110
+ check_status out_size, ids
111
+
112
+ result = ids[0, (max_len || out_size) * Fiddle::SIZEOF_INT].unpack("i!*")
113
+ [result].concat(unpack_offsets(start_offsets, end_offsets, result, text))
114
+ end
115
+
77
116
  def free_model(model)
78
117
  FFI.FreeModel(model)
79
118
  end
80
119
 
120
+ def normalize_spaces(text)
121
+ u_space = 0x20
122
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
123
+ out = Fiddle::Pointer.malloc([text.bytesize * 1.5, 20].max)
124
+ out_size = FFI.NormalizeSpaces(text, text.bytesize, out, out.size, u_space)
125
+ check_status out_size, out
126
+ encode_utf8(out.to_str(out_size))
127
+ end
128
+
81
129
  private
82
130
 
83
- def check_status(ret)
84
- raise Error, "Bad status" if ret == -1
131
+ def check_status(ret, ptr)
132
+ raise Error, "Not enough memory allocated" if ret == -1 || ret > ptr.size
133
+ end
134
+
135
+ def text_to(text, sep)
136
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
137
+ # TODO allocate less, and try again if needed
138
+ out = Fiddle::Pointer.malloc([text.bytesize * 3, 20].max)
139
+ out_size = yield(text, out)
140
+ check_status out_size, out
141
+ encode_utf8(out.to_str(out_size - 1)).split(sep)
142
+ end
143
+
144
+ def text_to_with_offsets(text, sep)
145
+ text = encode_utf8(text.dup) unless text.encoding == Encoding::UTF_8
146
+ # TODO allocate less, and try again if needed
147
+ out = Fiddle::Pointer.malloc([text.bytesize * 3, 20].max)
148
+
149
+ start_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * out.size)
150
+ end_offsets = Fiddle::Pointer.malloc(Fiddle::SIZEOF_INT * out.size)
151
+
152
+ out_size = yield(text, out, start_offsets, end_offsets)
153
+
154
+ check_status out_size, out
155
+
156
+ result = encode_utf8(out.to_str(out_size - 1)).split(sep)
157
+ [result].concat(unpack_offsets(start_offsets, end_offsets, result, text))
85
158
  end
86
159
 
87
160
  def encode_utf8(text)
88
161
  text.force_encoding(Encoding::UTF_8)
89
162
  end
163
+
164
+ def unpack_offsets(start_offsets, end_offsets, result, text)
165
+ start_bytes = start_offsets.to_s(Fiddle::SIZEOF_INT * result.size).unpack("i*")
166
+ end_bytes = end_offsets.to_s(Fiddle::SIZEOF_INT * result.size).unpack("i*")
167
+ starts = []
168
+ ends = []
169
+
170
+ # convert byte offsets to character offsets
171
+ # TODO see if more efficient to store next_pos in variable
172
+ pos = 0
173
+ text.each_char.with_index do |c, i|
174
+ while pos == start_bytes[starts.size]
175
+ starts << i
176
+ end
177
+ pos += c.bytesize
178
+ while pos - 1 == end_bytes[ends.size]
179
+ ends << i + 1
180
+ end
181
+ end
182
+
183
+ [starts, ends]
184
+ end
90
185
  end
91
186
  end
data/lib/blingfire/ffi.rb CHANGED
@@ -10,13 +10,35 @@ module BlingFire
10
10
  raise e
11
11
  end
12
12
 
13
+ # https://github.com/microsoft/BlingFire/blob/master/blingfiretools/blingfiretokdll/blingfiretokdll.cpp
14
+
15
+ # version
13
16
  extern "int GetBlingFireTokVersion()"
14
- extern "void* LoadModel(char * pszLdbFileName)"
15
- extern "int FreeModel(void* ModelPtr)"
17
+
18
+ # text to sentences
19
+ extern "int TextToSentencesWithOffsetsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount, void * hModel)"
20
+ extern "int TextToSentencesWithOffsets(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount)"
21
+ extern "int TextToSentencesWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
22
+ extern "int TextToSentences(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
23
+
24
+ # text to words
25
+ extern "int TextToWordsWithOffsetsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount, void * hModel)"
26
+ extern "int TextToWordsWithOffsets(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int * pStartOffsets, int * pEndOffsets, int MaxOutUtf8StrByteCount)"
16
27
  extern "int TextToWordsWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
17
28
  extern "int TextToWords(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
29
+
30
+ # misc
31
+ extern "int NormalizeSpaces(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, int uSpace)"
32
+ extern "int TextToHashes(char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pHashArr, int MaxHashArrLength, int wordNgrams, int bucketSize)"
33
+
34
+ # model
35
+ extern "void* LoadModel(char * pszLdbFileName)"
36
+
37
+ # text to ids
38
+ extern "int TextToIdsWithOffsets(void* ModelPtr, char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pIdsArr, int * pStartOffsets, int * pEndOffsets, int MaxIdsArrLength, int UnkId)"
18
39
  extern "int TextToIds(void* ModelPtr, char * pInUtf8Str, int InUtf8StrByteCount, int32_t * pIdsArr, int MaxIdsArrLength, int UnkId)"
19
- extern "int TextToSentences(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount)"
20
- extern "int TextToSentencesWithModel(char * pInUtf8Str, int InUtf8StrByteCount, char * pOutUtf8Str, int MaxOutUtf8StrByteCount, void * hModel)"
40
+
41
+ # free model
42
+ extern "int FreeModel(void* ModelPtr)"
21
43
  end
22
44
  end
@@ -1,7 +1,9 @@
1
1
  module BlingFire
2
2
  class Model
3
3
  def initialize(path = nil)
4
+ @handle = nil
4
5
  if path
6
+ raise Error, "Model not found" unless File.exist?(path)
5
7
  @handle = FFI.LoadModel(path)
6
8
  ObjectSpace.define_finalizer(self, self.class.finalize(@handle))
7
9
  end
@@ -15,6 +17,14 @@ module BlingFire
15
17
  end
16
18
  end
17
19
 
20
+ def text_to_words_with_offsets(text)
21
+ if @handle
22
+ BlingFire.text_to_words_with_offsets_with_model(@handle, text)
23
+ else
24
+ BlingFire.text_to_words_with_offsets(text)
25
+ end
26
+ end
27
+
18
28
  def text_to_sentences(text)
19
29
  if @handle
20
30
  BlingFire.text_to_sentences_with_model(@handle, text)
@@ -23,6 +33,14 @@ module BlingFire
23
33
  end
24
34
  end
25
35
 
36
+ def text_to_sentences_with_offsets(text)
37
+ if @handle
38
+ BlingFire.text_to_sentences_with_offsets_with_model(@handle, text)
39
+ else
40
+ BlingFire.text_to_sentences_with_offsets(text)
41
+ end
42
+ end
43
+
26
44
  def text_to_ids(text, max_len = nil, unk_id = 0)
27
45
  if @handle
28
46
  BlingFire.text_to_ids(@handle, text, max_len, unk_id)
@@ -31,6 +49,14 @@ module BlingFire
31
49
  end
32
50
  end
33
51
 
52
+ def text_to_ids_with_offsets(text, max_len = nil, unk_id = 0)
53
+ if @handle
54
+ BlingFire.text_to_ids_with_offsets(@handle, text, max_len, unk_id)
55
+ else
56
+ raise "Not implemented"
57
+ end
58
+ end
59
+
34
60
  def to_ptr
35
61
  @handle
36
62
  end
@@ -1,3 +1,3 @@
1
1
  module BlingFire
2
- VERSION = "0.1.0"
2
+ VERSION = "0.1.5"
3
3
  end
Binary file
Binary file
Binary file
Binary file
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: blingfire
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Andrew Kane
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-02-24 00:00:00.000000000 Z
11
+ date: 2021-03-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -52,7 +52,7 @@ dependencies:
52
52
  - - ">="
53
53
  - !ruby/object:Gem::Version
54
54
  version: '5'
55
- description:
55
+ description:
56
56
  email: andrew@chartkick.com
57
57
  executables: []
58
58
  extensions: []
@@ -67,13 +67,15 @@ files:
67
67
  - lib/blingfire/version.rb
68
68
  - vendor/LICENSE
69
69
  - vendor/blingfiretokdll.dll
70
+ - vendor/libblingfiretokdll.arm64.dylib
71
+ - vendor/libblingfiretokdll.arm64.so
70
72
  - vendor/libblingfiretokdll.dylib
71
73
  - vendor/libblingfiretokdll.so
72
74
  homepage: https://github.com/ankane/blingfire
73
75
  licenses:
74
76
  - MIT
75
77
  metadata: {}
76
- post_install_message:
78
+ post_install_message:
77
79
  rdoc_options: []
78
80
  require_paths:
79
81
  - lib
@@ -88,8 +90,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
88
90
  - !ruby/object:Gem::Version
89
91
  version: '0'
90
92
  requirements: []
91
- rubygems_version: 3.1.2
92
- signing_key:
93
+ rubygems_version: 3.2.3
94
+ signing_key:
93
95
  specification_version: 4
94
96
  summary: High speed text tokenization for Ruby
95
97
  test_files: []