RubyGems - informers - Versions diffs - 0.2.0 → 1.0.0 - Mend

informers 0.2.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +6 -0
data/README.md +63 -99
data/lib/informers/configs.rb +48 -0
data/lib/informers/env.rb +14 -0
data/lib/informers/model.rb +31 -0
data/lib/informers/models.rb +294 -0
data/lib/informers/pipelines.rb +439 -0
data/lib/informers/tokenizers.rb +141 -0
data/lib/informers/utils/core.rb +7 -0
data/lib/informers/utils/hub.rb +240 -0
data/lib/informers/utils/math.rb +44 -0
data/lib/informers/utils/tensor.rb +26 -0
data/lib/informers/version.rb +1 -1
data/lib/informers.rb +28 -9
metadata +21 -41
data/lib/informers/feature_extraction.rb +0 -59
data/lib/informers/fill_mask.rb +0 -109
data/lib/informers/ner.rb +0 -106
data/lib/informers/question_answering.rb +0 -197
data/lib/informers/sentiment_analysis.rb +0 -72
data/lib/informers/text_generation.rb +0 -54
data/vendor/LICENSE-bert.txt +0 -202
data/vendor/LICENSE-blingfire.txt +0 -21
data/vendor/LICENSE-gpt2.txt +0 -24
data/vendor/LICENSE-roberta.txt +0 -21
data/vendor/bert_base_cased_tok.bin +0 -0
data/vendor/bert_base_tok.bin +0 -0
data/vendor/gpt2.bin +0 -0
data/vendor/gpt2.i2w +0 -0
data/vendor/roberta.bin +0 -0
data/vendor/roberta.i2w +0 -0

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 22f7bcebf0670078b65fdf9cba4d2b937c853a3b10cf36e47f50781e2663225c
-  data.tar.gz: 940c96ec6b749b7e0b0c283456e40bfe9e6cbb3a58e8fa11f6367e87b05d8694
+  metadata.gz: 37ea3d1f5f6e4988731e3c3dd5854ede2fb0211a5dbde18fe70d09a713b12a1c
+  data.tar.gz: ac7b05dc9364e1984d35ccbfc2b7604d8ec9dc76f0f8c1a33f21ba489deed8f4
 SHA512:
-  metadata.gz: 4cd8b58aae6e885409e297bc1ba09aedd029bb3dc26a193251f33c2bf6c9f6a8da69cb3727f799296a8c6644b014afc715e783a1e19a1074982af531e40db57b
-  data.tar.gz: 6f63489d0b303e9a7de13df11d5074bd4cb2dfa44febee4061262d5c188eeb62a7c975e89567048f801fa183c8d56925275768fccc9a4b5a48255abeeb379345
+  metadata.gz: dcd02d4ff94ed472713de26e781cfbf963136eb07da1a9a195c4482c585e1b8ab19875583118f33669b10005bf08f607c09af040b3f53bbed896fb6d19fcf9e4
+  data.tar.gz: 990ea77bf9fdf859354d5532d0a1acefec6576b1d322efb41b27a43aa06b1f0fa2dea81825d0ca3631969bfd0aaf1323091defddc7ed951557370095ab7d209b

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,9 @@
+## 1.0.0 (2024-08-26)
+- Replaced task classes with `pipeline` method
+- Added `Model` class
+- Dropped support for Ruby < 3.1
 ## 0.2.0 (2022-09-06)
 - Added support for `optimum` and `transformers.onnx` models

data/README.md CHANGED Viewed

@@ -1,15 +1,10 @@
 # Informers
-:slightly_smiling_face: State-of-the-art natural language processing for Ruby
+:fire: Fast [transformer](https://github.com/xenova/transformers.js) inference for Ruby
-Supports:
+For non-ONNX models, check out [Transformers.rb](https://github.com/ankane/transformers-ruby)
-- Sentiment analysis
-- Question answering
-- Named-entity recognition
-- Text generation
-[![Build Status](https://github.com/ankane/informers/workflows/build/badge.svg?branch=master)](https://github.com/ankane/informers/actions)
+[![Build Status](https://github.com/ankane/informers/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/informers/actions)
 ## Installation
@@ -21,140 +16,111 @@ gem "informers"
 ## Getting Started
-- [Sentiment analysis](#sentiment-analysis)
-- [Question answering](#question-answering)
-- [Named-entity recognition](#named-entity-recognition)
-- [Text generation](#text-generation)
-- [Feature extraction](#feature-extraction)
-- [Fill mask](#fill-mask)
+- [Models](#models)
+- [Pipelines](#pipelines)
-### Sentiment Analysis
+## Models
-First, download the [pretrained model](https://github.com/ankane/informers/releases/download/v0.1.0/sentiment-analysis.onnx).
+### sentence-transformers/all-MiniLM-L6-v2
-Predict sentiment
+[Docs](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
 ```ruby
-model = Informers::SentimentAnalysis.new("sentiment-analysis.onnx")
-model.predict("This is super cool")
-```
+sentences = ["This is an example sentence", "Each sentence is converted"]
-This returns
-```ruby
-{label: "positive", score: 0.999855186578301}
+model = Informers::Model.new("sentence-transformers/all-MiniLM-L6-v2")
+embeddings = model.embed(sentences)
 ```
-Predict multiple at once
+For a quantized version, use:
 ```ruby
-model.predict(["This is super cool", "I didn't like it"])
+model = Informers::Model.new("Xenova/all-MiniLM-L6-v2", quantized: true)
 ```
-### Question Answering
-First, download the [pretrained model](https://github.com/ankane/informers/releases/download/v0.1.0/question-answering.onnx).
-Ask a question with some context
+### Xenova/multi-qa-MiniLM-L6-cos-v1
-```ruby
-model = Informers::QuestionAnswering.new("question-answering.onnx")
-model.predict(
-  question: "Who invented Ruby?",
-  context: "Ruby is a programming language created by Matz"
-)
-```
-This returns
+[Docs](https://huggingface.co/Xenova/multi-qa-MiniLM-L6-cos-v1)
 ```ruby
-{answer: "Matz", score: 0.9980658360049758, start: 42, end: 46}
+query = "How many people live in London?"
+docs = ["Around 9 Million people live in London", "London is known for its financial district"]
+model = Informers::Model.new("Xenova/multi-qa-MiniLM-L6-cos-v1")
+query_embedding = model.embed(query)
+doc_embeddings = model.embed(docs)
+scores = doc_embeddings.map { |e| e.zip(query_embedding).sum { |d, q| d * q } }
+doc_score_pairs = docs.zip(scores).sort_by { |d, s| -s }
 ```
-### Named-Entity Recognition
+### mixedbread-ai/mxbai-embed-large-v1
-First, export the [pretrained model](tools/export.md).
-Get entities
+[Docs](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)
 ```ruby
-model = Informers::NER.new("ner.onnx")
-model.predict("Nat works at GitHub in San Francisco")
-```
-This returns
-```ruby
-[
-  {text: "Nat",           tag: "person",   score: 0.9840519576513487, start: 0,  end: 3},
-  {text: "GitHub",        tag: "org",      score: 0.9426134775785775, start: 13, end: 19},
-  {text: "San Francisco", tag: "location", score: 0.9952414982243061, start: 23, end: 36}
+def transform_query(query)
+  "Represent this sentence for searching relevant passages: #{query}"
+end
+docs = [
+  transform_query("puppy"),
+  "The dog is barking",
+  "The cat is purring"
 ]
-```
-### Text Generation
+model = Informers::Model.new("mixedbread-ai/mxbai-embed-large-v1")
+embeddings = model.embed(docs)
+```
-First, export the [pretrained model](tools/export.md).
+## Pipelines
-Pass a prompt
+Named-entity recognition
 ```ruby
-model = Informers::TextGeneration.new("text-generation.onnx")
-model.predict("As far as I am concerned, I will", max_length: 50)
+ner = Informers.pipeline("ner")
+ner.("Ruby is a programming language created by Matz")
 ```
-This returns
+Sentiment analysis
-```text
-As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea
+```ruby
+classifier = Informers.pipeline("sentiment-analysis")
+classifier.("We are very happy to show you the 🤗 Transformers library.")
 ```
-### Feature Extraction
-First, export a [pretrained model](tools/export.md).
+Question answering
 ```ruby
-model = Informers::FeatureExtraction.new("feature-extraction.onnx")
-model.predict("This is super cool")
+qa = Informers.pipeline("question-answering")
+qa.("Who invented Ruby?", "Ruby is a programming language created by Matz")
 ```
-### Fill Mask
-First, export a [pretrained model](tools/export.md).
+Feature extraction
 ```ruby
-model = Informers::FillMask.new("fill-mask.onnx")
-model.predict("This is a great <mask>")
+extractor = Informers.pipeline("feature-extraction")
+extractor.("We are very happy to show you the 🤗 Transformers library.")
 ```
-## Models
-Task | Description | Contributor | License | Link
---- | --- | --- | --- | ---
-Sentiment analysis | DistilBERT fine-tuned on SST-2 | Hugging Face | Apache-2.0 | [Link](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
-Question answering | DistilBERT fine-tuned on SQuAD | Hugging Face | Apache-2.0 | [Link](https://huggingface.co/distilbert-base-cased-distilled-squad)
-Named-entity recognition | BERT fine-tuned on CoNLL03 | Bayerische Staatsbibliothek | In-progress | [Link](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)
-Text generation | GPT-2 | OpenAI | [Custom](https://github.com/openai/gpt-2/blob/master/LICENSE) | [Link](https://huggingface.co/gpt2)
-Some models are [quantized](https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7) to make them faster and smaller.
-## Deployment
+## Credits
-Check out [Trove](https://github.com/ankane/trove) for deploying models.
+This library was ported from [Transformers.js](https://github.com/xenova/transformers.js) and is available under the same license.
-```sh
-trove push sentiment-analysis.onnx
-```
+## Upgrading
-## Credits
+### 1.0
-This project uses many state-of-the-art technologies:
+Task classes have been replaced with the `pipeline` method.
-- [Transformers](https://github.com/huggingface/transformers) for transformer models
-- [Bling Fire](https://github.com/microsoft/BlingFire) and [BERT](https://github.com/google-research/bert) for high-performance text tokenization
-- [ONNX Runtime](https://github.com/Microsoft/onnxruntime) for high-performance inference
+```ruby
+# before
+model = Informers::SentimentAnalysis.new("sentiment-analysis.onnx")
+model.predict("This is super cool")
-Some code was ported from Transformers and is available under the same license.
+# after
+model = Informers.pipeline("sentiment-analysis")
+model.("This is super cool")
+```
 ## History
@@ -175,7 +141,5 @@ To get started with development:
 git clone https://github.com/ankane/informers.git
 cd informers
 bundle install
-export MODELS_PATH=path/to/onnx/models
 bundle exec rake test
 ```

data/lib/informers/configs.rb ADDED Viewed

@@ -0,0 +1,48 @@
+module Informers
+  class PretrainedConfig
+    attr_reader :model_type, :problem_type, :id2label
+    def initialize(config_json)
+      @is_encoder_decoder = false
+      @model_type = config_json["model_type"]
+      @problem_type = config_json["problem_type"]
+      @id2label = config_json["id2label"]
+    end
+    def [](key)
+      instance_variable_get("@#{key}")
+    end
+    def self.from_pretrained(
+      pretrained_model_name_or_path,
+      progress_callback: nil,
+      config: nil,
+      cache_dir: nil,
+      local_files_only: false,
+      revision: "main",
+      **kwargs
+    )
+      data = config || load_config(
+        pretrained_model_name_or_path,
+        progress_callback:,
+        config:,
+        cache_dir:,
+        local_files_only:,
+        revision:
+      )
+      new(data)
+    end
+    def self.load_config(pretrained_model_name_or_path, **options)
+      info = Utils::Hub.get_model_json(pretrained_model_name_or_path, "config.json", true, **options)
+      info
+    end
+  end
+  class AutoConfig
+    def self.from_pretrained(...)
+      PretrainedConfig.from_pretrained(...)
+    end
+  end
+end

data/lib/informers/env.rb ADDED Viewed

@@ -0,0 +1,14 @@
+module Informers
+  CACHE_HOME = ENV.fetch("XDG_CACHE_HOME", File.join(ENV.fetch("HOME"), ".cache"))
+  DEFAULT_CACHE_DIR = File.expand_path(File.join(CACHE_HOME, "informers"))
+  class << self
+    attr_accessor :allow_remote_models, :remote_host, :remote_path_template, :cache_dir
+  end
+  self.allow_remote_models = ENV["INFORMERS_OFFLINE"].to_s.empty?
+  self.remote_host = "https://huggingface.co/"
+  self.remote_path_template = "{model}/resolve/{revision}/"
+  self.cache_dir = DEFAULT_CACHE_DIR
+end

data/lib/informers/model.rb ADDED Viewed

@@ -0,0 +1,31 @@
+module Informers
+  class Model
+    def initialize(model_id, quantized: false)
+      @model_id = model_id
+      @model = Informers.pipeline("feature-extraction", model_id, quantized: quantized)
+      # TODO better pattern
+      if model_id == "sentence-transformers/all-MiniLM-L6-v2"
+        @model.instance_variable_get(:@model).instance_variable_set(:@output_names, ["sentence_embedding"])
+      end
+    end
+    def embed(texts)
+      is_batched = texts.is_a?(Array)
+      texts = [texts] unless is_batched
+      case @model_id
+      when "sentence-transformers/all-MiniLM-L6-v2"
+        output = @model.(texts)
+      when "Xenova/all-MiniLM-L6-v2", "Xenova/multi-qa-MiniLM-L6-cos-v1"
+        output = @model.(texts, pooling: "mean", normalize: true)
+      when "mixedbread-ai/mxbai-embed-large-v1"
+        output = @model.(texts, pooling: "cls")
+      else
+        raise Error, "model not supported: #{@model_id}"
+      end
+      is_batched ? output : output[0]
+    end
+  end
+end

data/lib/informers/models.rb ADDED Viewed

@@ -0,0 +1,294 @@
+module Informers
+  MODEL_TYPES = {
+    EncoderOnly: 0,
+    EncoderDecoder: 1,
+    Seq2Seq: 2,
+    Vision2Seq: 3,
+    DecoderOnly: 4,
+    MaskGeneration: 5
+  }
+  # NOTE: These will be populated fully later
+  MODEL_TYPE_MAPPING = {}
+  MODEL_NAME_TO_CLASS_MAPPING = {}
+  MODEL_CLASS_TO_NAME_MAPPING = {}
+  class PretrainedMixin
+    def self.from_pretrained(
+      pretrained_model_name_or_path,
+      quantized: true,
+      progress_callback: nil,
+      config: nil,
+      cache_dir: nil,
+      local_files_only: false,
+      revision: "main",
+      model_file_name: nil
+    )
+      options = {
+        quantized:,
+        progress_callback:,
+        config:,
+        cache_dir:,
+        local_files_only:,
+        revision:,
+        model_file_name:
+      }
+      config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **options)
+      if options[:config].nil?
+        # If no config was passed, reuse this config for future processing
+        options[:config] = config
+      end
+      if !const_defined?(:MODEL_CLASS_MAPPINGS)
+        raise Error, "`MODEL_CLASS_MAPPINGS` not implemented for this type of `AutoClass`: #{name}"
+      end
+      const_get(:MODEL_CLASS_MAPPINGS).each do |model_class_mapping|
+        model_info = model_class_mapping[config.model_type]
+        if !model_info
+          next # Item not found in this mapping
+        end
+        return model_info[1].from_pretrained(pretrained_model_name_or_path, **options)
+      end
+      if const_defined?(:BASE_IF_FAIL)
+        warn "Unknown model class #{config.model_type.inspect}, attempting to construct from base class."
+        PreTrainedModel.from_pretrained(pretrained_model_name_or_path, **options)
+      else
+        raise Error, "Unsupported model type: #{config.model_type}"
+      end
+    end
+  end
+  class PreTrainedModel
+    attr_reader :config
+    def initialize(config, session)
+      super()
+      @config = config
+      @session = session
+      @output_names = nil
+      model_name = MODEL_CLASS_TO_NAME_MAPPING[self.class]
+      model_type = MODEL_TYPE_MAPPING[model_name]
+      case model_type
+      when MODEL_TYPES[:DecoderOnly]
+        raise Todo
+      when MODEL_TYPES[:Seq2Seq], MODEL_TYPES[:Vision2Seq]
+        raise Todo
+      when MODEL_TYPES[:EncoderDecoder]
+        raise Todo
+      else
+        @forward = method(:encoder_forward)
+      end
+    end
+    def self.from_pretrained(
+      pretrained_model_name_or_path,
+      quantized: true,
+      progress_callback: nil,
+      config: nil,
+      cache_dir: nil,
+      local_files_only: false,
+      revision: "main",
+      model_file_name: nil
+    )
+      options = {
+        quantized:,
+        progress_callback:,
+        config:,
+        cache_dir:,
+        local_files_only:,
+        revision:,
+        model_file_name:
+      }
+      model_name = MODEL_CLASS_TO_NAME_MAPPING[self]
+      model_type = MODEL_TYPE_MAPPING[model_name]
+      if model_type == MODEL_TYPES[:DecoderOnly]
+        raise Todo
+      elsif model_type == MODEL_TYPES[:Seq2Seq] || model_type == MODEL_TYPES[:Vision2Seq]
+        raise Todo
+      elsif model_type == MODEL_TYPES[:MaskGeneration]
+        raise Todo
+      elsif model_type == MODEL_TYPES[:EncoderDecoder]
+        raise Todo
+      else
+        if model_type != MODEL_TYPES[:EncoderOnly]
+          warn "Model type for '#{model_name || config&.model_type}' not found, assuming encoder-only architecture. Please report this."
+        end
+        info = [
+          AutoConfig.from_pretrained(pretrained_model_name_or_path, **options),
+          construct_session(pretrained_model_name_or_path, options[:model_file_name] || "model", **options)
+        ]
+      end
+      new(*info)
+    end
+    def self.construct_session(pretrained_model_name_or_path, file_name, **options)
+      model_file_name = "onnx/#{file_name}#{options[:quantized] ? "_quantized" : ""}.onnx"
+      path = Utils::Hub.get_model_file(pretrained_model_name_or_path, model_file_name, true, **options)
+      OnnxRuntime::InferenceSession.new(path)
+    end
+    def call(model_inputs)
+      @forward.(model_inputs)
+    end
+    private
+    def encoder_forward(model_inputs)
+      encoder_feeds = {}
+      @session.inputs.each do |input|
+        key = input[:name].to_sym
+        encoder_feeds[key] = model_inputs[key]
+      end
+      if @session.inputs.any? { |v| v[:name] == "token_type_ids" } && !encoder_feeds[:token_type_ids]
+        raise Todo
+      end
+      session_run(@session, encoder_feeds)
+    end
+    def session_run(session, inputs)
+      checked_inputs = validate_inputs(session, inputs)
+      begin
+        output = session.run(@output_names, checked_inputs)
+        output = replace_tensors(output)
+        output
+      rescue => e
+        raise e
+      end
+    end
+    # TODO
+    def replace_tensors(obj)
+      obj
+    end
+    # TODO
+    def validate_inputs(session, inputs)
+      inputs
+    end
+  end
+  class BertPreTrainedModel < PreTrainedModel
+  end
+  class BertModel < BertPreTrainedModel
+  end
+  class BertForSequenceClassification < BertPreTrainedModel
+    def call(model_inputs)
+      SequenceClassifierOutput.new(*super(model_inputs))
+    end
+  end
+  class BertForTokenClassification < BertPreTrainedModel
+    def call(model_inputs)
+      TokenClassifierOutput.new(*super(model_inputs))
+    end
+  end
+  class DistilBertPreTrainedModel < PreTrainedModel
+  end
+  class DistilBertModel < DistilBertPreTrainedModel
+  end
+  class DistilBertForSequenceClassification < DistilBertPreTrainedModel
+    def call(model_inputs)
+      SequenceClassifierOutput.new(*super(model_inputs))
+    end
+  end
+  class DistilBertForQuestionAnswering < DistilBertPreTrainedModel
+    def call(model_inputs)
+      QuestionAnsweringModelOutput.new(*super(model_inputs))
+    end
+  end
+  MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = {
+    "bert" => ["BertForSequenceClassification", BertForSequenceClassification],
+    "distilbert" => ["DistilBertForSequenceClassification", DistilBertForSequenceClassification]
+  }
+  MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = {
+    "bert" => ["BertForTokenClassification", BertForTokenClassification]
+  }
+  MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = {
+    "distilbert" => ["DistilBertForQuestionAnswering", DistilBertForQuestionAnswering]
+  }
+  MODEL_CLASS_TYPE_MAPPING = [
+    [MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES[:EncoderOnly]],
+    [MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES[:EncoderOnly]],
+    [MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES, MODEL_TYPES[:EncoderOnly]]
+  ]
+  MODEL_CLASS_TYPE_MAPPING.each do |mappings, type|
+    mappings.values.each do |name, model|
+      MODEL_TYPE_MAPPING[name] = type
+      MODEL_CLASS_TO_NAME_MAPPING[model] = name
+      MODEL_NAME_TO_CLASS_MAPPING[name] = model
+    end
+  end
+  class AutoModel < PretrainedMixin
+    MODEL_CLASS_MAPPINGS = MODEL_CLASS_TYPE_MAPPING.map { |x| x[0] }
+    BASE_IF_FAIL = true
+  end
+  class AutoModelForSequenceClassification < PretrainedMixin
+    MODEL_CLASS_MAPPINGS = [MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES]
+  end
+  class AutoModelForTokenClassification < PretrainedMixin
+    MODEL_CLASS_MAPPINGS = [MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES]
+  end
+  class AutoModelForQuestionAnswering < PretrainedMixin
+    MODEL_CLASS_MAPPINGS = [MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES]
+  end
+  class ModelOutput
+  end
+  class SequenceClassifierOutput < ModelOutput
+    attr_reader :logits
+    def initialize(logits)
+      super()
+      @logits = logits
+    end
+  end
+  class TokenClassifierOutput < ModelOutput
+    attr_reader :logits
+    def initialize(logits)
+      super()
+      @logits = logits
+    end
+  end
+  class QuestionAnsweringModelOutput < ModelOutput
+    attr_reader :start_logits, :end_logits
+    def initialize(start_logits, end_logits)
+      super()
+      @start_logits = start_logits
+      @end_logits = end_logits
+    end
+  end
+end