RubyGems - informers - Versions diffs - 0.1.1 → 0.2.0 - Mend

informers 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +14 -0
data/README.md +53 -18
data/lib/informers/feature_extraction.rb +59 -0
data/lib/informers/fill_mask.rb +109 -0
data/lib/informers/ner.rb +3 -3
data/lib/informers/question_answering.rb +2 -5
data/lib/informers/sentiment_analysis.rb +3 -2
data/lib/informers/text_generation.rb +54 -0
data/lib/informers/version.rb +1 -1
data/lib/informers.rb +4 -0
data/vendor/LICENSE-gpt2.txt +24 -0
data/vendor/LICENSE-roberta.txt +21 -0
data/vendor/gpt2.bin +0 -0
data/vendor/gpt2.i2w +0 -0
data/vendor/roberta.bin +0 -0
data/vendor/roberta.i2w +0 -0
metadata +25 -58

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7fab4014ceee446289bf0fb5a3c5b32630462eddba5c1b10e2104c3c42ee43ea
-  data.tar.gz: a09cdc2dc9676a91a5e6c0ab0c3712653b99d2dbae1fe91b1ac55b32777c8cb9
+  metadata.gz: 22f7bcebf0670078b65fdf9cba4d2b937c853a3b10cf36e47f50781e2663225c
+  data.tar.gz: 940c96ec6b749b7e0b0c283456e40bfe9e6cbb3a58e8fa11f6367e87b05d8694
 SHA512:
-  metadata.gz: eb4693b6ff9cd60ccdaf727de793a90b0b8576bdc57376a22e78759032144c4bc470b9750bdc38e2288229370dd43f5ca982b407d553ea7ca73b5fff0d5a9e3a
-  data.tar.gz: 7ad9384587b2c12ff09d21c4f1074b297758c7ef65142bfc19bb730e3034d7b9febd180253068b6181814803d6d599e4685a4ef736c734b1b45c41ee700fd3ec
+  metadata.gz: 4cd8b58aae6e885409e297bc1ba09aedd029bb3dc26a193251f33c2bf6c9f6a8da69cb3727f799296a8c6644b014afc715e783a1e19a1074982af531e40db57b
+  data.tar.gz: 6f63489d0b303e9a7de13df11d5074bd4cb2dfa44febee4061262d5c188eeb62a7c975e89567048f801fa183c8d56925275768fccc9a4b5a48255abeeb379345

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,17 @@
+## 0.2.0 (2022-09-06)
+- Added support for `optimum` and `transformers.onnx` models
+- Dropped support for Ruby < 2.7
+## 0.1.3 (2021-09-25)
+- Added text generation
+- Added fill mask
+## 0.1.2 (2020-11-24)
+- Added feature extraction
 ## 0.1.1 (2020-10-05)
 - Fixed question answering for Ruby < 2.7

data/README.md CHANGED Viewed

@@ -7,24 +7,16 @@ Supports:
 - Sentiment analysis
 - Question answering
 - Named-entity recognition
-- Text generation - *in development*
-- Summarization - *in development*
-- Translation - *in development*
+- Text generation
-[![Build Status](https://travis-ci.org/ankane/informers.svg?branch=master)](https://travis-ci.org/ankane/informers)
+[![Build Status](https://github.com/ankane/informers/workflows/build/badge.svg?branch=master)](https://github.com/ankane/informers/actions)
 ## Installation
 Add this line to your application’s Gemfile:
 ```ruby
-gem 'informers'
-```
-On Mac, also install OpenMP:
-```sh
-brew install libomp
+gem "informers"
 ```
 ## Getting Started
@@ -32,6 +24,9 @@ brew install libomp
 - [Sentiment analysis](#sentiment-analysis)
 - [Question answering](#question-answering)
 - [Named-entity recognition](#named-entity-recognition)
+- [Text generation](#text-generation)
+- [Feature extraction](#feature-extraction)
+- [Fill mask](#fill-mask)
 ### Sentiment Analysis
@@ -58,11 +53,7 @@ model.predict(["This is super cool", "I didn't like it"])
 ### Question Answering
-First, download the [pretrained model](https://github.com/ankane/informers/releases/download/v0.1.0/question-answering.onnx) and add Numo to your application’s Gemfile:
-```ruby
-gem 'numo-narray'
-```
+First, download the [pretrained model](https://github.com/ankane/informers/releases/download/v0.1.0/question-answering.onnx).
 Ask a question with some context
@@ -101,15 +92,59 @@ This returns
 ]
 ```
+### Text Generation
+First, export the [pretrained model](tools/export.md).
+Pass a prompt
+```ruby
+model = Informers::TextGeneration.new("text-generation.onnx")
+model.predict("As far as I am concerned, I will", max_length: 50)
+```
+This returns
+```text
+As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea
+```
+### Feature Extraction
+First, export a [pretrained model](tools/export.md).
+```ruby
+model = Informers::FeatureExtraction.new("feature-extraction.onnx")
+model.predict("This is super cool")
+```
+### Fill Mask
+First, export a [pretrained model](tools/export.md).
+```ruby
+model = Informers::FillMask.new("fill-mask.onnx")
+model.predict("This is a great <mask>")
+```
 ## Models
 Task | Description | Contributor | License | Link
 --- | --- | --- | --- | ---
 Sentiment analysis | DistilBERT fine-tuned on SST-2 | Hugging Face | Apache-2.0 | [Link](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
-Question answering | DistilBERT | Hugging Face | Apache-2.0 | [Link](https://huggingface.co/distilbert-base-cased-distilled-squad)
+Question answering | DistilBERT fine-tuned on SQuAD | Hugging Face | Apache-2.0 | [Link](https://huggingface.co/distilbert-base-cased-distilled-squad)
 Named-entity recognition | BERT fine-tuned on CoNLL03 | Bayerische Staatsbibliothek | In-progress | [Link](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
+Text generation | GPT-2 | OpenAI | [Custom](https://github.com/openai/gpt-2/blob/master/LICENSE) | [Link](https://huggingface.co/gpt2)
+Some models are [quantized](https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7) to make them faster and smaller.
-Models are [quantized](https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7) to make them faster and smaller.
+## Deployment
+Check out [Trove](https://github.com/ankane/trove) for deploying models.
+```sh
+trove push sentiment-analysis.onnx
+```
 ## Credits

data/lib/informers/feature_extraction.rb ADDED Viewed

@@ -0,0 +1,59 @@
+# Copyright 2018 The HuggingFace Inc. team.
+# Copyright 2020 Andrew Kane.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+module Informers
+  class FeatureExtraction
+    def initialize(model_path)
+      tokenizer_path = File.expand_path("../../vendor/bert_base_cased_tok.bin", __dir__)
+      @tokenizer = BlingFire.load_model(tokenizer_path)
+      @model = OnnxRuntime::Model.new(model_path)
+    end
+    def predict(texts)
+      singular = !texts.is_a?(Array)
+      texts = [texts] if singular
+      # tokenize
+      input_ids =
+        texts.map do |text|
+          tokens = @tokenizer.text_to_ids(text, nil, 100) # unk token
+          tokens.unshift(101) # cls token
+          tokens << 102 # sep token
+          tokens
+        end
+      max_tokens = input_ids.map(&:size).max
+      attention_mask = []
+      input_ids.each do |ids|
+        zeros = [0] * (max_tokens - ids.size)
+        mask = ([1] * ids.size) + zeros
+        attention_mask << mask
+        ids.concat(zeros)
+      end
+      # infer
+      input = {
+        input_ids: input_ids,
+        attention_mask: attention_mask
+      }
+      output = @model.predict(input)
+      scores = output["output_0"] || output["last_hidden_state"]
+      singular ? scores.first : scores
+    end
+  end
+end

data/lib/informers/fill_mask.rb ADDED Viewed

@@ -0,0 +1,109 @@
+# Copyright 2018 The HuggingFace Inc. team.
+# Copyright 2021 Andrew Kane.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+module Informers
+  class FillMask
+    def initialize(model_path)
+      encoder_path = File.expand_path("../../vendor/roberta.bin", __dir__)
+      @encoder = BlingFire.load_model(encoder_path, prefix: false)
+      decoder_path = File.expand_path("../../vendor/roberta.i2w", __dir__)
+      @decoder = BlingFire.load_model(decoder_path)
+      @model = OnnxRuntime::Model.new(model_path)
+    end
+    def predict(texts)
+      singular = !texts.is_a?(Array)
+      texts = [texts] if singular
+      mask_token = 50264
+      # tokenize
+      input_ids =
+        texts.map do |text|
+          tokens = @encoder.text_to_ids(text, nil, 3) # unk token
+          # add mask token
+          mask_sequence = [28696, 43776, 15698]
+          masks = []
+          (tokens.size - 2).times do |i|
+            masks << i if tokens[i..(i + 2)] == mask_sequence
+          end
+          masks.reverse.each do |mask|
+            tokens = tokens[0...mask] + [mask_token] + tokens[(mask + 3)..-1]
+          end
+          tokens.unshift(0) # cls token
+          tokens << 2 # sep token
+          tokens
+        end
+      max_tokens = input_ids.map(&:size).max
+      attention_mask = []
+      input_ids.each do |ids|
+        zeros = [0] * (max_tokens - ids.size)
+        mask = ([1] * ids.size) + zeros
+        attention_mask << mask
+        ids.concat(zeros)
+      end
+      input = {
+        input_ids: input_ids,
+        attention_mask: attention_mask
+      }
+      masked_index = input_ids.map { |v| v.each_index.select { |i| v[i] == mask_token } }
+      masked_index.each do |v|
+        raise "No mask_token (<mask>) found on the input" if v.size < 1
+        raise "More than one mask_token (<mask>) is not supported" if v.size > 1
+      end
+      res = @model.predict(input)
+      outputs = res["output_0"] || res["logits"]
+      batch_size = outputs.size
+      results = []
+      batch_size.times do |i|
+        result = []
+        logits = outputs[i][masked_index[i][0]]
+        values = logits.map { |v| Math.exp(v) }
+        sum = values.sum
+        probs = values.map { |v| v / sum }
+        res = probs.each_with_index.sort_by { |v| -v[0] }.first(5)
+        res.each do |(v, p)|
+          tokens = input[:input_ids][i].dup
+          tokens[masked_index[i][0]] = p
+          result << {
+            sequence: @decoder.ids_to_text(tokens),
+            score: v,
+            token: p,
+            # TODO figure out prefix space
+            token_str: @decoder.ids_to_text([p], skip_special_tokens: false)
+          }
+        end
+        results += [result]
+      end
+      singular ? results.first : results
+    end
+  end
+end

data/lib/informers/ner.rb CHANGED Viewed

@@ -38,12 +38,12 @@ module Informers
           attention_mask: [[1] * tokens.size],
           token_type_ids: [[0] * tokens.size]
         }
-        output = @model.predict(input)
+        res = @model.predict(input)
         # transform
-        entities = output["output_0"][0]
+        output = res["output_0"] || res["logits"]
         score =
-          entities.map do |e|
+          output[0].map do |e|
             values = e.map { |v| Math.exp(v) }
             sum = values.sum
             values.map { |v| v / sum }

data/lib/informers/question_answering.rb CHANGED Viewed

@@ -16,9 +16,6 @@
 module Informers
   class QuestionAnswering
     def initialize(model_path)
-      # make sure Numo is available
-      require "numo/narray"
       tokenizer_path = File.expand_path("../../vendor/bert_base_cased_tok.bin", __dir__)
       @tokenizer = BlingFire.load_model(tokenizer_path)
       @model = OnnxRuntime::Model.new(model_path)
@@ -70,8 +67,8 @@ module Informers
       }
       output = @model.predict(input)
-      start = output["output_0"]
-      stop = output["output_1"]
+      start = output["output_0"] || output["start_logits"]
+      stop = output["output_1"] || output["end_logits"]
       # transform
       answers = []

data/lib/informers/sentiment_analysis.rb CHANGED Viewed

@@ -50,11 +50,12 @@ module Informers
         input_ids: input_ids,
         attention_mask: attention_mask
       }
-      output = @model.predict(input)
+      res = @model.predict(input)
+      output = res["output_0"] || res["logits"]
       # transform
       scores =
-        output["output_0"].map do |row|
+        output.map do |row|
           mapped = row.map { |v| Math.exp(v) }
           sum = mapped.sum
           mapped.map { |v| v / sum }

data/lib/informers/text_generation.rb ADDED Viewed

@@ -0,0 +1,54 @@
+# Copyright 2018 The HuggingFace Inc. team.
+# Copyright 2021 Andrew Kane.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+module Informers
+  class TextGeneration
+    def initialize(model_path)
+      encoder_path = File.expand_path("../../vendor/gpt2.bin", __dir__)
+      @encoder = BlingFire.load_model(encoder_path, prefix: false)
+      decoder_path = File.expand_path("../../vendor/gpt2.i2w", __dir__)
+      @decoder = BlingFire.load_model(decoder_path)
+      @model = OnnxRuntime::Model.new(model_path)
+    end
+    def predict(text, max_length: 50)
+      tokens = @encoder.text_to_ids(text)
+      input = {
+        input_ids: [tokens]
+      }
+      if @model.inputs.any? { |i| i[:name] == "attention_mask" }
+        input[:attention_mask] = [[1] * tokens.size]
+      end
+      output_name =
+        if @model.outputs.any? { |o| o[:name] == "output_0" }
+          "output_0"
+        else
+          "logits"
+        end
+      (max_length - tokens.size).times do |i|
+        output = @model.predict(input, output_type: :numo, output_names: [output_name])
+        # passed to input_ids
+        tokens << output[output_name][0, true, true][-1, true].max_index
+      end
+      @decoder.ids_to_text(tokens)
+    end
+  end
+end

data/lib/informers/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Informers
-  VERSION = "0.1.1"
+  VERSION = "0.2.0"
 end

data/lib/informers.rb CHANGED Viewed

@@ -1,9 +1,13 @@
 # dependencies
 require "blingfire"
+require "numo/narray"
 require "onnxruntime"
 # modules
+require "informers/feature_extraction"
+require "informers/fill_mask"
 require "informers/ner"
 require "informers/question_answering"
 require "informers/sentiment_analysis"
+require "informers/text_generation"
 require "informers/version"

data/vendor/LICENSE-gpt2.txt ADDED Viewed

@@ -0,0 +1,24 @@
+Modified MIT License
+Software Copyright (c) 2019 OpenAI
+We don’t claim ownership of the content you create with GPT-2, so it is yours to do with as you please.
+We only ask that you use GPT-2 responsibly and clearly indicate your content was created using GPT-2.
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
+associated documentation files (the "Software"), to deal in the Software without restriction,
+including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
+and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+The above copyright notice and this permission notice shall be included
+in all copies or substantial portions of the Software.
+The above copyright notice and this permission notice need not be included
+with content created by the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
+INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
+OR OTHER DEALINGS IN THE SOFTWARE.

data/vendor/LICENSE-roberta.txt ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) Facebook, Inc. and its affiliates.
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/vendor/gpt2.bin ADDED Viewed

Binary file

data/vendor/gpt2.i2w ADDED Viewed

Binary file

data/vendor/roberta.bin ADDED Viewed

Binary file

data/vendor/roberta.i2w ADDED Viewed

Binary file

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: informers
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.2.0
 platform: ruby
 authors:
 - Andrew Kane
-autorequire:
+autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-10-05 00:00:00.000000000 Z
+date: 2022-09-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: blingfire
@@ -16,16 +16,16 @@ dependencies:
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: 0.1.3
+        version: 0.1.7
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: 0.1.3
+        version: 0.1.7
 - !ruby/object:Gem::Dependency
-  name: onnxruntime
+  name: numo-narray
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
@@ -39,63 +39,21 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: bundler
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
-  type: :development
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
-- !ruby/object:Gem::Dependency
-  name: rake
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
-  type: :development
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '0'
-- !ruby/object:Gem::Dependency
-  name: minitest
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '5'
-  type: :development
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        version: '5'
-- !ruby/object:Gem::Dependency
-  name: numo-narray
+  name: onnxruntime
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: '0'
-  type: :development
+        version: 0.5.1
+  type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: '0'
-description:
-email: andrew@chartkick.com
+        version: 0.5.1
+description:
+email: andrew@ankane.org
 executables: []
 extensions: []
 extra_rdoc_files: []
@@ -104,19 +62,28 @@ files:
 - LICENSE.txt
 - README.md
 - lib/informers.rb
+- lib/informers/feature_extraction.rb
+- lib/informers/fill_mask.rb
 - lib/informers/ner.rb
 - lib/informers/question_answering.rb
 - lib/informers/sentiment_analysis.rb
+- lib/informers/text_generation.rb
 - lib/informers/version.rb
 - vendor/LICENSE-bert.txt
 - vendor/LICENSE-blingfire.txt
+- vendor/LICENSE-gpt2.txt
+- vendor/LICENSE-roberta.txt
 - vendor/bert_base_cased_tok.bin
 - vendor/bert_base_tok.bin
+- vendor/gpt2.bin
+- vendor/gpt2.i2w
+- vendor/roberta.bin
+- vendor/roberta.i2w
 homepage: https://github.com/ankane/informers
 licenses:
 - Apache-2.0
 metadata: {}
-post_install_message:
+post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -124,15 +91,15 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: '2.5'
+      version: '2.7'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.2
-signing_key:
+rubygems_version: 3.3.7
+signing_key:
 specification_version: 4
 summary: State-of-the-art natural language processing for Ruby
 test_files: []