RubyGems - informers - Versions diffs - 1.0.2 → 1.0.3 - Mend

informers 1.0.2 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4ea317272c5054b01616643e7e0f0b2b2fe0c4a87fe8399350a6b8d0a279c5a1
-  data.tar.gz: 530f8aaab9a5ca71811a82adca0272e2ca84525bcf1f60f2209c394cbd0f9c2a
+  metadata.gz: f5340da0bce9d55a0339fac6b8806f09119df3e89567ecb37a77e1a5921b8fa2
+  data.tar.gz: 66a9d275cb2999ad14ba1cfd900bdcbf9fdc3d26ce29387acdd74452bf2050ef
 SHA512:
-  metadata.gz: 76059b486e6f6c0b0054450f76813dd4bf12845da6f46e8089585cd1a69be7db86a0acf446cc5a18e48108393403324626f6656d09bdb69083f2651abc0d2448
-  data.tar.gz: f466f5382edd76a7092dc6ada349a3e58fe7eedcd481726ca765f8ddfb4543b7269dab96c00a93d10b0fd67f800afd70a619cfb15d78dde494b29cc13d21ef1a
+  metadata.gz: a4a0c3da3d8a3555a6f2debca8f2939b6536ac76386cdd6c7264890b2d00842d537ecfca352021fa349ff9c4636ba49c189f652a66676746d9ec2a8d97eecc2a
+  data.tar.gz: a06aa115b5966fd1b8da7a80d8481d3e61778f31c3bb0da143f329e81ae3f73d4a1d1b2ee01672f4e90742a35d68a23dd5c871c3b68ffad0c16d8e5de480a60f

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,8 @@
+## 1.0.3 (2024-08-29)
+- Added `model_output` option
+- Improved `model_file_name` option
 ## 1.0.2 (2024-08-28)
 - Added `embedding` pipeline

data/README.md CHANGED Viewed

@@ -30,10 +30,15 @@ Embedding
 - [intfloat/e5-base-v2](#intfloate5-base-v2)
 - [nomic-ai/nomic-embed-text-v1](#nomic-ainomic-embed-text-v1)
 - [BAAI/bge-base-en-v1.5](#baaibge-base-en-v15)
+- [jinaai/jina-embeddings-v2-base-en](#jinaaijina-embeddings-v2-base-en)
+- [Snowflake/snowflake-arctic-embed-m-v1.5](#snowflakesnowflake-arctic-embed-m-v15)
+- [Xenova/all-mpnet-base-v2](#xenovaall-mpnet-base-v2)
-Reranking (experimental)
+Reranking
 - [mixedbread-ai/mxbai-rerank-base-v1](#mixedbread-aimxbai-rerank-base-v1)
+- [jinaai/jina-reranker-v1-turbo-en](#jinaaijina-reranker-v1-turbo-en)
+- [BAAI/bge-reranker-base](#baaibge-reranker-base)
 ### sentence-transformers/all-MiniLM-L6-v2
@@ -72,18 +77,16 @@ doc_score_pairs = docs.zip(scores).sort_by { |d, s| -s }
 [Docs](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)
 ```ruby
-def transform_query(query)
-  "Represent this sentence for searching relevant passages: #{query}"
-end
+query_prefix = "Represent this sentence for searching relevant passages: "
-docs = [
-  transform_query("puppy"),
+input = [
   "The dog is barking",
-  "The cat is purring"
+  "The cat is purring",
+  query_prefix + "puppy"
 ]
 model = Informers.pipeline("embedding", "mixedbread-ai/mxbai-embed-large-v1")
-embeddings = model.(docs)
+embeddings = model.(input)
 ```
 ### Supabase/gte-small
@@ -102,9 +105,12 @@ embeddings = model.(sentences)
 [Docs](https://huggingface.co/intfloat/e5-base-v2)
 ```ruby
+doc_prefix = "passage: "
+query_prefix = "query: "
 input = [
-  "passage: Ruby is a programming language created by Matz",
-  "query: Ruby creator"
+  doc_prefix + "Ruby is a programming language created by Matz",
+  query_prefix + "Ruby creator"
 ]
 model = Informers.pipeline("embedding", "intfloat/e5-base-v2")
@@ -116,9 +122,13 @@ embeddings = model.(input)
 [Docs](https://huggingface.co/nomic-ai/nomic-embed-text-v1)
 ```ruby
+doc_prefix = "search_document: "
+query_prefix = "search_query: "
 input = [
-  "search_document: The dog is barking",
-  "search_query: puppy"
+  doc_prefix + "The dog is barking",
+  doc_prefix + "The cat is purring",
+  query_prefix + "puppy"
 ]
 model = Informers.pipeline("embedding", "nomic-ai/nomic-embed-text-v1")
@@ -130,20 +140,57 @@ embeddings = model.(input)
 [Docs](https://huggingface.co/BAAI/bge-base-en-v1.5)
 ```ruby
-def transform_query(query)
-  "Represent this sentence for searching relevant passages: #{query}"
-end
+query_prefix = "Represent this sentence for searching relevant passages: "
 input = [
-  transform_query("puppy"),
   "The dog is barking",
-  "The cat is purring"
+  "The cat is purring",
+  query_prefix + "puppy"
 ]
 model = Informers.pipeline("embedding", "BAAI/bge-base-en-v1.5")
 embeddings = model.(input)
 ```
+### jinaai/jina-embeddings-v2-base-en
+[Docs](https://huggingface.co/jinaai/jina-embeddings-v2-base-en)
+```ruby
+sentences = ["How is the weather today?", "What is the current weather like today?"]
+model = Informers.pipeline("embedding", "jinaai/jina-embeddings-v2-base-en", model_file_name: "../model")
+embeddings = model.(sentences)
+```
+### Snowflake/snowflake-arctic-embed-m-v1.5
+[Docs](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5)
+```ruby
+query_prefix = "Represent this sentence for searching relevant passages: "
+input = [
+  "The dog is barking",
+  "The cat is purring",
+  query_prefix + "puppy"
+]
+model = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
+embeddings = model.(input, model_output: "sentence_embedding", pooling: "none")
+```
+### Xenova/all-mpnet-base-v2
+[Docs](https://huggingface.co/Xenova/all-mpnet-base-v2)
+```ruby
+sentences = ["This is an example sentence", "Each sentence is converted"]
+model = Informers.pipeline("embedding", "Xenova/all-mpnet-base-v2")
+embeddings = model.(sentences)
+```
 ### mixedbread-ai/mxbai-rerank-base-v1
 [Docs](https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1)
@@ -156,6 +203,30 @@ model = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-base-v1")
 result = model.(query, docs)
 ```
+### jinaai/jina-reranker-v1-turbo-en
+[Docs](https://huggingface.co/jinaai/jina-reranker-v1-turbo-en)
+```ruby
+query = "How many people live in London?"
+docs = ["Around 9 Million people live in London", "London is known for its financial district"]
+model = Informers.pipeline("reranking", "jinaai/jina-reranker-v1-turbo-en")
+result = model.(query, docs)
+```
+### BAAI/bge-reranker-base
+[Docs](https://huggingface.co/BAAI/bge-reranker-base)
+```ruby
+query = "How many people live in London?"
+docs = ["Around 9 Million people live in London", "London is known for its financial district"]
+model = Informers.pipeline("reranking", "BAAI/bge-reranker-base")
+result = model.(query, docs)
+```
 ### Other
 You can use the feature extraction pipeline directly.
@@ -165,7 +236,7 @@ model = Informers.pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2", quan
 embeddings = model.(sentences, pooling: "mean", normalize: true)
 ```
-The model files must include `onnx/model.onnx` or `onnx/model_quantized.onnx` ([example](https://huggingface.co/Xenova/all-MiniLM-L6-v2/tree/main/onnx)).
+The model must include a `.onnx` file ([example](https://huggingface.co/Xenova/all-MiniLM-L6-v2/tree/main/onnx)). If the file is not at `onnx/model.onnx` or `onnx/model_quantized.onnx`, use the `model_file_name` option to specify the location.
 ## Pipelines
@@ -176,7 +247,7 @@ embed = Informers.pipeline("embedding")
 embed.("We are very happy to show you the 🤗 Transformers library.")
 ```
-Reranking (experimental)
+Reranking
 ```ruby
 rerank = Informers.pipeline("reranking")

data/lib/informers/model.rb CHANGED Viewed

@@ -6,19 +6,14 @@ module Informers
     end
     def embed(texts)
-      is_batched = texts.is_a?(Array)
-      texts = [texts] unless is_batched
       case @model_id
       when "sentence-transformers/all-MiniLM-L6-v2", "Xenova/all-MiniLM-L6-v2", "Xenova/multi-qa-MiniLM-L6-cos-v1", "Supabase/gte-small"
-        output = @model.(texts)
+        @model.(texts)
       when "mixedbread-ai/mxbai-embed-large-v1"
-        output = @model.(texts, pooling: "cls", normalize: false)
+        @model.(texts, pooling: "cls", normalize: false)
       else
         raise Error, "Use the embedding pipeline for this model: #{@model_id}"
       end
-      is_batched ? output : output[0]
     end
   end
 end

data/lib/informers/models.rb CHANGED Viewed

@@ -135,7 +135,15 @@ module Informers
     end
     def self.construct_session(pretrained_model_name_or_path, file_name, **options)
-      model_file_name = "onnx/#{file_name}#{options[:quantized] ? "_quantized" : ""}.onnx"
+      prefix = "onnx/"
+      if file_name.start_with?("../")
+        prefix = ""
+        file_name = file_name[3..]
+      elsif file_name.start_with?("/")
+        prefix = ""
+        file_name = file_name[1..]
+      end
+      model_file_name = "#{prefix}#{file_name}#{options[:quantized] ? "_quantized" : ""}.onnx"
       path = Utils::Hub.get_model_file(pretrained_model_name_or_path, model_file_name, true, **options)
       OnnxRuntime::InferenceSession.new(path)
@@ -229,16 +237,37 @@ module Informers
     end
   end
+  class MPNetPreTrainedModel < PreTrainedModel
+  end
+  class MPNetModel < MPNetPreTrainedModel
+  end
+  class XLMRobertaPreTrainedModel < PreTrainedModel
+  end
+  class XLMRobertaModel < XLMRobertaPreTrainedModel
+  end
+  class XLMRobertaForSequenceClassification < XLMRobertaPreTrainedModel
+    def call(model_inputs)
+      SequenceClassifierOutput.new(*super(model_inputs))
+    end
+  end
   MODEL_MAPPING_NAMES_ENCODER_ONLY = {
     "bert" => ["BertModel", BertModel],
     "nomic_bert" => ["NomicBertModel", NomicBertModel],
     "deberta-v2" => ["DebertaV2Model", DebertaV2Model],
-    "distilbert" => ["DistilBertModel", DistilBertModel]
+    "mpnet" => ["MPNetModel", MPNetModel],
+    "distilbert" => ["DistilBertModel", DistilBertModel],
+    "xlm-roberta" => ["XLMRobertaModel", XLMRobertaModel]
   }
   MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = {
     "bert" => ["BertForSequenceClassification", BertForSequenceClassification],
-    "distilbert" => ["DistilBertForSequenceClassification", DistilBertForSequenceClassification]
+    "distilbert" => ["DistilBertForSequenceClassification", DistilBertForSequenceClassification],
+    "xlm-roberta" => ["XLMRobertaForSequenceClassification", XLMRobertaForSequenceClassification]
   }
   MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES = {

data/lib/informers/pipelines.rb CHANGED Viewed

@@ -249,7 +249,8 @@ module Informers
       pooling: "none",
       normalize: false,
       quantize: false,
-      precision: "binary"
+      precision: "binary",
+      model_output: nil
     )
       # Run tokenization
       model_inputs = @tokenizer.(texts,
@@ -258,8 +259,10 @@ module Informers
       )
       model_options = {}
-      # optimization for sentence-transformers/all-MiniLM-L6-v2
-      if @model.instance_variable_get(:@output_names) == ["token_embeddings"] && pooling == "mean" && normalize
+      if !model_output.nil?
+        model_options[:output_names] = Array(model_output)
+      elsif @model.instance_variable_get(:@output_names) == ["token_embeddings"] && pooling == "mean" && normalize
+        # optimization for sentence-transformers/all-MiniLM-L6-v2
         model_options[:output_names] = ["sentence_embedding"]
         pooling = "none"
         normalize = false
@@ -271,7 +274,9 @@ module Informers
       # TODO improve
       result =
         if outputs.is_a?(Array)
-          raise Error, "unexpected outputs" if outputs.size != 1
+          # TODO show returned instead of all
+          output_names = @model.instance_variable_get(:@session).outputs.map { |v| v[:name] }
+          raise Error, "unexpected outputs: #{output_names}" if outputs.size != 1
           outputs[0]
         else
           outputs.logits
@@ -285,6 +290,7 @@ module Informers
       when "cls"
         result = result.map(&:first)
       else
+        # TODO raise ArgumentError in 2.0
         raise Error, "Pooling method '#{pooling}' not supported."
       end
@@ -304,9 +310,10 @@ module Informers
     def call(
       texts,
       pooling: "mean",
-      normalize: true
+      normalize: true,
+      model_output: nil
     )
-      super(texts, pooling:, normalize:)
+      super(texts, pooling:, normalize:, model_output:)
     end
   end

data/lib/informers/tokenizers.rb CHANGED Viewed

@@ -91,11 +91,23 @@ module Informers
   class DistilBertTokenizer < PreTrainedTokenizer
   end
+  class RobertaTokenizer < PreTrainedTokenizer
+  end
+  class XLMRobertaTokenizer < PreTrainedTokenizer
+  end
+  class MPNetTokenizer < PreTrainedTokenizer
+  end
   class AutoTokenizer
     TOKENIZER_CLASS_MAPPING = {
       "BertTokenizer" => BertTokenizer,
       "DebertaV2Tokenizer" => DebertaV2Tokenizer,
-      "DistilBertTokenizer" => DistilBertTokenizer
+      "DistilBertTokenizer" => DistilBertTokenizer,
+      "RobertaTokenizer" => RobertaTokenizer,
+      "XLMRobertaTokenizer" => XLMRobertaTokenizer,
+      "MPNetTokenizer" => MPNetTokenizer
     }
     def self.from_pretrained(

data/lib/informers/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Informers
-  VERSION = "1.0.2"
+  VERSION = "1.0.3"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: informers
 version: !ruby/object:Gem::Version
-  version: 1.0.2
+  version: 1.0.3
 platform: ruby
 authors:
 - Andrew Kane
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-08-28 00:00:00.000000000 Z
+date: 2024-08-29 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: onnxruntime