RubyGems - youtokentome - Versions diffs - 0.1.0 - Mend

youtokentome 0.1.0

Files changed (22) hide show

checksums.yaml +7 -0
data/CHANGELOG.md +3 -0
data/LICENSE.txt +22 -0
data/README.md +104 -0
data/ext/youtokentome/ext.cpp +135 -0
data/ext/youtokentome/extconf.rb +12 -0
data/lib/youtokentome.rb +10 -0
data/lib/youtokentome/bpe.rb +54 -0
data/lib/youtokentome/ext.bundle +0 -0
data/lib/youtokentome/version.rb +3 -0
data/vendor/YouTokenToMe/LICENSE +19 -0
data/vendor/YouTokenToMe/README.md +304 -0
data/vendor/YouTokenToMe/youtokentome/cpp/bpe.cpp +2185 -0
data/vendor/YouTokenToMe/youtokentome/cpp/bpe.h +86 -0
data/vendor/YouTokenToMe/youtokentome/cpp/third_party/LICENSE +23 -0
data/vendor/YouTokenToMe/youtokentome/cpp/third_party/flat_hash_map.h +1502 -0
data/vendor/YouTokenToMe/youtokentome/cpp/utf8.cpp +134 -0
data/vendor/YouTokenToMe/youtokentome/cpp/utf8.h +23 -0
data/vendor/YouTokenToMe/youtokentome/cpp/utils.cpp +119 -0
data/vendor/YouTokenToMe/youtokentome/cpp/utils.h +105 -0
data/vendor/YouTokenToMe/youtokentome/cpp/yttm.pyx +182 -0
metadata +133 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: e55b4f8edc8306370a1e5f7138bbbb00912b29234e7169d4dd7233fee04934cb
+  data.tar.gz: dea649fc649c23a0955ed603867e7312a66d64f241fadad7fb057a9164bda285
+SHA512:
+  metadata.gz: '080b09ffa1cb1721d321e7af92b980087e2bd77e74b76127e4f2131c1cb4a72895ea7ed121d4a2a79ded30989e862d40ab6e3a01c0489da130ef131f66f37b96'
+  data.tar.gz: 741c1c809801be24105a52be8884dd74e273b068c9c9d004dfbede6a9e050ecc662acf0578bc3b2ec8f20002c117003be2f5d3e79ee399408c466e5fe57d2af4

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,3 @@
+## 0.1.0 (unreleased)
+- First release

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2020 Andrew Kane
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,104 @@
+# YouTokenToMe
+:fire: [YouTokenToMe](https://github.com/VKCOM/YouTokenToMe) - the high performance unsupervised text tokenizer - for Ruby
+## Installation
+Add this line to your application’s Gemfile:
+```ruby
+gem 'youtokentome'
+```
+## Getting Started
+Train a model
+```ruby
+model = YouTokenToMe::BPE.train(data: "train.txt", model: "model.txt", vocab_size: 30000)
+```
+Load a model
+```ruby
+model = YouTokenToMe::BPE.new("model.txt")
+```
+Get vocab
+```ruby
+model.vocab
+```
+Encode
+```ruby
+model.encode(sentence)
+```
+Decode
+```ruby
+model.decode(ids)
+```
+Convert between ids and subwords
+```ruby
+model.subword_to_id(subword)
+model.id_to_subword(id)
+```
+## Options
+Train
+```ruby
+YouTokenToMe::BPE.train(
+  data: "train.txt",   # path to file with training data
+  model: "model.txt",  # path to where the trained model will be saved
+  vocab_size: 30000,   # number of tokens in the final vocabulary
+  coverage: 1.0,       # fraction of characters covered by the model
+  n_threads: -1,       # number of parallel threads used to run
+  pad_id: 1,           # reserved id for padding
+  unk_id: 2,           # reserved id for unknown symbols
+  bos_id: 3,           # reserved id for begin of sentence token
+  eos_id: 4            # reserved id for end of sentence token
+)
+```
+Encode
+```ruby
+model.encode(
+  sentences,
+  output_type: :id,    # or :subword
+  bos: false,          # add "beginning of sentence" token
+  eos: false,          # add "end of sentence" token
+  reverse: false,      # reverse output sequence of tokens
+  dropout_prob: 0.0    # BPE-dropout probability
+)
+```
+## History
+View the [changelog](https://github.com/ankane/youtokentome/blob/master/CHANGELOG.md)
+## Contributing
+Everyone is encouraged to help improve this project. Here are a few ways you can help:
+- [Report bugs](https://github.com/ankane/youtokentome/issues)
+- Fix bugs and [submit pull requests](https://github.com/ankane/youtokentome/pulls)
+- Write, clarify, or fix documentation
+- Suggest or add new features
+To get started with development:
+```sh
+git clone https://github.com/ankane/youtokentome.git
+cd youtokentome
+bundle install
+bundle exec rake compile
+bundle exec rake test
+```

data/ext/youtokentome/ext.cpp ADDED Viewed

@@ -0,0 +1,135 @@
+// youtokentome
+#include <bpe.h>
+#include <utils.h>
+// rice
+#include <rice/Array.hpp>
+#include <rice/Data_Type.hpp>
+#include <rice/Object.hpp>
+using Rice::define_class_under;
+using Rice::define_module;
+using Rice::define_module_under;
+using Rice::Array;
+using Rice::Module;
+using Rice::Object;
+void check_status(vkcom::Status& status) {
+  if (!status.ok()) {
+    throw std::invalid_argument(status.error_message());
+  }
+}
+template<>
+Object to_ruby<std::vector<std::string>>(std::vector<std::string> const & x)
+{
+  Array ret;
+  for (auto& v : x) {
+    ret.push(v);
+  }
+  return ret;
+}
+template<>
+std::vector<int> from_ruby<std::vector<int>>(Object x)
+{
+  std::vector<int> ret;
+  Array a = Array(x);
+  for (size_t i = 0; i < a.size(); i++) {
+    ret.push_back(from_ruby<int>(a[i]));
+  }
+  return ret;
+}
+template<>
+std::vector<std::string> from_ruby<std::vector<std::string>>(Object x)
+{
+  std::vector<std::string> ret;
+  Array a = Array(x);
+  for (size_t i = 0; i < a.size(); i++) {
+    ret.push_back(from_ruby<std::string>(a[i]));
+  }
+  return ret;
+}
+extern "C" void Init_ext() {
+  Module rb_mYouTokenToMe = define_module("YouTokenToMe");
+  Module rb_mExt = define_module_under(rb_mYouTokenToMe, "Ext")
+    .define_singleton_method(
+      "train_bpe",
+      *[](std::string &input_path, std::string &model_path, int vocab_size, double coverage,
+          int n_threads, int pad_id, int unk_id, int bos_id, int eos_id) {
+        vkcom::SpecialTokens special_tokens(pad_id, unk_id, bos_id, eos_id);
+        vkcom::BpeConfig config(coverage, n_threads, special_tokens);
+        auto status = vkcom::train_bpe(input_path, model_path, vocab_size, config);
+        check_status(status);
+      });
+  define_class_under<vkcom::BaseEncoder>(rb_mExt, "BaseEncoder")
+    .define_method("vocab_size", &vkcom::BaseEncoder::vocab_size)
+    .define_method("subword_to_id", &vkcom::BaseEncoder::subword_to_id)
+    .define_method(
+      "id_to_subword",
+      *[](vkcom::BaseEncoder& self, int id) {
+        std::string subword;
+        auto status = self.id_to_subword(id, &subword);
+        check_status(status);
+        return subword;
+      })
+    .define_method(
+      "decode",
+      *[](vkcom::BaseEncoder& self, std::vector<int> ids) {
+        std::string sentence;
+        const std::unordered_set<int> ignore_ids;
+        auto status = self.decode(ids, &sentence, &ignore_ids);
+        check_status(status);
+        Array ret;
+        ret.push(sentence);
+        return ret;
+      })
+    .define_method(
+      "encode_as_ids",
+      *[](vkcom::BaseEncoder& self, std::vector<std::string> sentences, bool bos, bool eos, bool reverse, double dropout_prob) {
+        std::vector<std::vector<int>> ids;
+        auto status = self.encode_as_ids(sentences, &ids, bos, eos, reverse, dropout_prob);
+        check_status(status);
+        Array ret;
+        for (auto& v : ids) {
+          Array r;
+          for (auto& v2 : v) {
+            r.push(v2);
+          }
+          ret.push(r);
+        }
+        return ret;
+      })
+    .define_method(
+      "encode_as_subwords",
+      *[](vkcom::BaseEncoder& self, std::vector<std::string> sentences, bool bos, bool eos, bool reverse, double dropout_prob) {
+        std::vector<std::vector<std::string>> subwords;
+        auto status = self.encode_as_subwords(sentences, &subwords, bos, eos, reverse, dropout_prob);
+        check_status(status);
+        Array ret;
+        for (auto& v : subwords) {
+          Array r;
+          for (auto& v2 : v) {
+            r.push(v2);
+          }
+          ret.push(r);
+        }
+        return ret;
+      })
+    .define_method("vocab", &vkcom::BaseEncoder::vocabulary)
+    .define_singleton_method(
+      "new",
+      *[](std::string &model_path, int n_threads) {
+        auto status = vkcom::Status();
+        vkcom::BaseEncoder encoder(model_path, n_threads, &status);
+        check_status(status);
+        return encoder;
+      });
+}

data/ext/youtokentome/extconf.rb ADDED Viewed

@@ -0,0 +1,12 @@
+require "mkmf-rice"
+$CXXFLAGS << " -std=c++11"
+ext = File.expand_path(".", __dir__)
+youtokentome = File.expand_path("../../vendor/YouTokenToMe/youtokentome/cpp", __dir__)
+$srcs = Dir["{#{ext},#{youtokentome}}/*.{cc,cpp}"]
+$INCFLAGS << " -I#{youtokentome}"
+$VPATH << youtokentome
+create_makefile("youtokentome/ext")

data/lib/youtokentome.rb ADDED Viewed

@@ -0,0 +1,10 @@
+# ext
+require "youtokentome/ext"
+# modules
+require "youtokentome/version"
+require "youtokentome/bpe"
+module YouTokenToMe
+  class Error < StandardError; end
+end

data/lib/youtokentome/bpe.rb ADDED Viewed

@@ -0,0 +1,54 @@
+module YouTokenToMe
+  class BPE
+    def initialize(model, n_threads: -1)
+      @encoder = Ext::BaseEncoder.new(model, n_threads)
+    end
+    def vocab_size
+      @encoder.vocab_size
+    end
+    def vocab
+      vocab = @encoder.vocab
+      vocab.each do |v|
+        v.force_encoding(Encoding::UTF_8)
+      end
+      vocab
+    end
+    def subword_to_id(subword)
+      @encoder.subword_to_id(subword)
+    end
+    def id_to_subword(id)
+      @encoder.id_to_subword(id)
+    end
+    def encode(sentences, output_type: :id, bos: false, eos: false, reverse: false, dropout_prob: 0)
+      case output_type
+      when :id
+        @encoder.encode_as_ids(sentences, bos, eos, reverse, dropout_prob)
+      when :subword
+        subwords = @encoder.encode_as_subwords(sentences, bos, eos, reverse, dropout_prob)
+        subwords.each do |s|
+          s.each do |v|
+            v.force_encoding(Encoding::UTF_8)
+          end
+        end
+        subwords
+      else
+        raise ArgumentError, "Unknown output type"
+      end
+    end
+    # TODO add ignore_ids
+    def decode(ids)
+      @encoder.decode(ids)
+    end
+    def self.train(data:, model:, vocab_size:, coverage: 1.0, n_threads: -1, pad_id: 0, unk_id: 1, bos_id: 2, eos_id: 3)
+      Ext.train_bpe(data, model, vocab_size, coverage, n_threads, pad_id, unk_id, bos_id, eos_id)
+      new(model, n_threads: n_threads)
+    end
+  end
+end

data/lib/youtokentome/ext.bundle ADDED Viewed

Binary file

data/lib/youtokentome/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module YouTokenToMe
+  VERSION = "0.1.0"
+end

data/vendor/YouTokenToMe/LICENSE ADDED Viewed

@@ -0,0 +1,19 @@
+Copyright (c) 2019 VK.com
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/vendor/YouTokenToMe/README.md ADDED Viewed

@@ -0,0 +1,304 @@
+![PyPI](https://img.shields.io/pypi/v/youtokentome.svg)
+[![Downloads](https://pepy.tech/badge/youtokentome)](https://pepy.tech/project/youtokentome)
+[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)
+![GitHub](https://img.shields.io/github/license/vkcom/youtokentome.svg)
+[![Build Status](https://travis-ci.org/VKCOM/YouTokenToMe.svg?branch=master)](https://travis-ci.org/VKCOM/YouTokenToMe)
+# YouTokenToMe
+YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [[Sennrich et al.](https://www.aclweb.org/anthology/P16-1162)].
+Our implementation is much faster in training and tokenization than [Hugging Face](https://github.com/huggingface/tokenizers), [fastBPE](https://github.com/glample/fastBPE)
+ and [SentencePiece](https://github.com/google/sentencepiece). In some test cases, it is 90 times faster.
+  Check out our [benchmark](benchmark.md) results.
+Key advantages:
+* Multithreading for training and tokenization
+* The algorithm has  `O(N)` complexity, where `N` is the length of training data
+* Highly efficient implementation in C++
+* Python wrapper and command-line interface
+Extra features:
+* BPE-dropout (as described in [Provilkov et al, 2019](https://arxiv.org/abs/1910.13267))
+As well as in the algorithm from the original paper, ours does not consider tokens
+that cross word boundaries. Just like in [SentencePiece](https://github.com/google/sentencepiece), all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.
+For example, the phrase ```Blazingly fast tokenization!``` can be tokenized into
+`['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']`
+## Installation
+```bash
+pip install youtokentome
+```
+## Python interface
+### Example
+Let's start with a self-contained example.
+```python
+import random
+import youtokentome as yttm
+train_data_path = "train_data.txt"
+model_path = "example.model"
+# Generating random file with training data
+# 10000 lines with 100 characters in each line
+n_lines = 10000
+n_characters = 100
+with open(train_data_path, "w") as fout:
+    for _ in range(n_lines):
+        print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)
+# Generating random text
+test_text = "".join([random.choice("abcde ") for _ in range(100)])
+# Training model
+yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)
+# Loading model
+bpe = yttm.BPE(model=model_path)
+# Two types of tokenization
+print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
+print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))
+```
+&nbsp;
+### Training model
+```python
+youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)
+```
+Trains BPE model and saves to file.
+**Args:**
+* `data`: string, path to file with training data
+* `model`: string, path to where the trained model will be saved
+* `vocab_size`: int, number of tokens in the final vocabulary
+* `coverage`: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
+* `n_threads`: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see [benchmark](benchmark.md#number-of-threads)).
+* `pad_id`: int, reserved id for padding
+* `unk_id`: int, reserved id for unknown symbols
+* `bos_id`: int, reserved id for begin of sentence token
+* `eos_id`: int, reserved id for end of sentence token
+**Returns**: Class `youtokentome.BPE` with the loaded model.
+&nbsp;
+### Model loading
+```python
+youtokentome.BPE(model, n_threads=-1)
+```
+Class constructor. Loads the trained model.
+* `model`: string, path to the trained model
+* `n_threads`: int, number of parallel threads used to run.
+    If equal to -1, then the maximum number of threads available will be used.
+&nbsp;
+### Methods
+Class `youtokentome.BPE` has the following methods:
+#### encode
+```python
+encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)
+```
+**Args:**
+* `sentences`: list of strings, sentences for tokenization.
+* `output_type`: enum, sentence can be tokenized to ids or subwords. Use `OutputType.ID` for ids and `OutputType.SUBWORD` for subwords.
+* `bos`: bool, if True then token “beginning of sentence” will be added
+* `eos`: bool, if True then token “end of sentence” will be added
+* `reverse`: bool, if True the output sequence of tokens will be reversed
+* `dropout_prob`: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].
+**Returns:** If `output_type` is equal to `youtokentome.OutputType.ID` or `youtokentome.OutputType.SUBWORD`
+ then a list of lists of integers or list of lists of strings will be returned
+respectively.
+&nbsp;
+#### vocab
+```python
+vocab(self)
+```
+**Returns:** A list `vocab_size` strings. The i-th string in the list corresponds
+ to i-th subword.
+&nbsp;
+#### vocab_size
+```python
+vocab_size(self)
+```
+**Returns:** int. Size of vocabulary.
+&nbsp;
+#### subword_to_id
+```python
+subword_to_id(self, subword)
+```
+**Args:**
+* `subword`: string.
+**Returns:**
+Integer from the range [0, vocab_size-1]. Id of subword or,
+ if there is no such subword in the vocabulary, `unk_id` will be
+returned.
+&nbsp;
+#### id_to_subword
+```python
+id_to_subword(self, id)
+```
+**Args:**
+* `id`: int, must be in the range [0, vocab_size-1]
+**Returns:** string. Subword from vocabulary by id.
+&nbsp;
+#### decode
+```python
+decode(self, ids, ignore_ids=None)
+```
+Convert each id to subword and concatenate with space symbol.
+**Args:**
+  * `ids`: list of lists of integers. All integers must be in the range [0, vocab_size-1]
+  * `ignore_ids`: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]
+**Returns:** List of strings.
+## Command line interface
+### Example
+```bash
+$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
+$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA
+```
+### Supported commands
+`YouTokenToMe` supports the following commands:
+```
+$ yttm --help
+Usage: yttm [OPTIONS] COMMAND [ARGS]...
+Options:
+  --help  Show this message and exit.
+Commands:
+  bpe     Train BPE model.
+  decode  Decode ids to text.
+  encode  Encode text to ids or subwords.
+  vocab   Print list of learned subwords.
+```
+Command `bpe` allows you to train Byte Pair Encoding model based on a text file.
+```
+$ yttm bpe --help
+Usage: yttm bpe [OPTIONS]
+  Train BPE model.
+Options:
+  --data PATH           Training data file path.  [required]
+  --model PATH          Output model file path.  [required]
+  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
+  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
+  --n_threads INTEGER   Number of threads.  [default: -1]
+  --pad_id INTEGER      Padding token id.  [default: 0]
+  --unk_id INTEGER      Unknown token id.  [default: 1]
+  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
+  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
+  --help                Show this message and exit.
+```
+Apply BPE encoding for a corpus of sentences. Use `stdin` for input and `stdout` for output.
+By default, encoding works in parallel using `n_threads` threads. Number of threads is limited by
+8 (see [benchmark](benchmark.md#number-of-threads)).
+With the `--stream` option, `--n_threads` will be ignored and all sentences will be processed one by one.
+ Each sentence will be tokenized and written to the `stdout` before the next sentence is read.
+```
+$ yttm encode --help
+Usage: yttm encode [OPTIONS]
+  Encode text to ids or subwords.
+Options:
+  --model PATH         Path to file with learned model.  [required]
+  --output_type TEXT   'id' or 'subword'.  [required]
+  --n_threads INTEGER  Number of threads.  [default: -1]
+  --bos                Add tab 'begin of sentence'.
+  --eos                Add tab 'end of sentence'.
+  --reverse            Reverse output sequence of tokens.
+  --stream             Process each line before reading the next one.
+  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
+  --help               Show this message and exit.
+```
+Print vocabulary. This can be useful for understanding the model.
+```
+$ yttm vocab --help
+Usage: yttm vocab [OPTIONS]
+  Print list of learned subwords.
+Options:
+  --model PATH  Path to file with learned model.  [required]
+  --verbose     Add merging rules.
+  --help        Show this message and exit.
+```
+Convert ids back to text. Use `stdin` for input and `stdout` for output.
+```
+$ yttm decode --help
+Usage: yttm decode [OPTIONS]
+  Decode ids to text.
+Options:
+  --model PATH  Path to file with learned model.  [required]
+  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
+  --help        Show this message and exit.
+```