RubyGems - informers - Versions diffs - 1.0.1 → 1.0.2 - Mend

informers 1.0.1 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3abc738d8975839b873bc5e07bb95305d455a9ac1eec94c432415b713411f20b
-  data.tar.gz: b9c36794c33316378752dd816fb517714c6d8186062562a778d3c8539ba7d79a
+  metadata.gz: 4ea317272c5054b01616643e7e0f0b2b2fe0c4a87fe8399350a6b8d0a279c5a1
+  data.tar.gz: 530f8aaab9a5ca71811a82adca0272e2ca84525bcf1f60f2209c394cbd0f9c2a
 SHA512:
-  metadata.gz: ce05bfcdebce333fd6b5abefca703850d3a6d6a50c3c1589bf675e91ae24b424f2e43e6bc0270ad4ea8a520f5be9d636c5e8a5a66deae2c0183adae6cbc517aa
-  data.tar.gz: 6cc9b08b6e0f9e8ea23f306c0c460dc2557e4ee5113ef26300b517608485ea528fcb9254d51f395c37b557bf1728051c2c3dd8a20a25b5bd4826832a4ff30bf8
+  metadata.gz: 76059b486e6f6c0b0054450f76813dd4bf12845da6f46e8089585cd1a69be7db86a0acf446cc5a18e48108393403324626f6656d09bdb69083f2651abc0d2448
+  data.tar.gz: f466f5382edd76a7092dc6ada349a3e58fe7eedcd481726ca765f8ddfb4543b7269dab96c00a93d10b0fd67f800afd70a619cfb15d78dde494b29cc13d21ef1a

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,9 @@
+## 1.0.2 (2024-08-28)
+- Added `embedding` pipeline
+- Added experimental `reranking` pipeline
+- Added support for `nomic-ai/nomic-embed-text-v1`
 ## 1.0.1 (2024-08-27)
 - Added support for `Supabase/gte-small` to `Model`

data/README.md CHANGED Viewed

@@ -21,6 +21,20 @@ gem "informers"
 ## Models
+Embedding
+- [sentence-transformers/all-MiniLM-L6-v2](#sentence-transformersall-MiniLM-L6-v2)
+- [Xenova/multi-qa-MiniLM-L6-cos-v1](#xenovamulti-qa-MiniLM-L6-cos-v1)
+- [mixedbread-ai/mxbai-embed-large-v1](#mixedbread-aimxbai-embed-large-v1)
+- [Supabase/gte-small](#supabasegte-small)
+- [intfloat/e5-base-v2](#intfloate5-base-v2)
+- [nomic-ai/nomic-embed-text-v1](#nomic-ainomic-embed-text-v1)
+- [BAAI/bge-base-en-v1.5](#baaibge-base-en-v15)
+Reranking (experimental)
+- [mixedbread-ai/mxbai-rerank-base-v1](#mixedbread-aimxbai-rerank-base-v1)
 ### sentence-transformers/all-MiniLM-L6-v2
 [Docs](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
@@ -28,14 +42,14 @@ gem "informers"
 ```ruby
 sentences = ["This is an example sentence", "Each sentence is converted"]
-model = Informers::Model.new("sentence-transformers/all-MiniLM-L6-v2")
-embeddings = model.embed(sentences)
+model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")
+embeddings = model.(sentences)
 ```
 For a quantized version, use:
 ```ruby
-model = Informers::Model.new("Xenova/all-MiniLM-L6-v2", quantized: true)
+model = Informers.pipeline("embedding", "Xenova/all-MiniLM-L6-v2", quantized: true)
 ```
 ### Xenova/multi-qa-MiniLM-L6-cos-v1
@@ -46,9 +60,9 @@ model = Informers::Model.new("Xenova/all-MiniLM-L6-v2", quantized: true)
 query = "How many people live in London?"
 docs = ["Around 9 Million people live in London", "London is known for its financial district"]
-model = Informers::Model.new("Xenova/multi-qa-MiniLM-L6-cos-v1")
-query_embedding = model.embed(query)
-doc_embeddings = model.embed(docs)
+model = Informers.pipeline("embedding", "Xenova/multi-qa-MiniLM-L6-cos-v1")
+query_embedding = model.(query)
+doc_embeddings = model.(docs)
 scores = doc_embeddings.map { |e| e.zip(query_embedding).sum { |d, q| d * q } }
 doc_score_pairs = docs.zip(scores).sort_by { |d, s| -s }
 ```
@@ -68,8 +82,8 @@ docs = [
   "The cat is purring"
 ]
-model = Informers::Model.new("mixedbread-ai/mxbai-embed-large-v1")
-embeddings = model.embed(docs)
+model = Informers.pipeline("embedding", "mixedbread-ai/mxbai-embed-large-v1")
+embeddings = model.(docs)
 ```
 ### Supabase/gte-small
@@ -79,12 +93,96 @@ embeddings = model.embed(docs)
 ```ruby
 sentences = ["That is a happy person", "That is a very happy person"]
-model = Informers::Model.new("Supabase/gte-small")
-embeddings = model.embed(sentences)
+model = Informers.pipeline("embedding", "Supabase/gte-small")
+embeddings = model.(sentences)
+```
+### intfloat/e5-base-v2
+[Docs](https://huggingface.co/intfloat/e5-base-v2)
+```ruby
+input = [
+  "passage: Ruby is a programming language created by Matz",
+  "query: Ruby creator"
+]
+model = Informers.pipeline("embedding", "intfloat/e5-base-v2")
+embeddings = model.(input)
+```
+### nomic-ai/nomic-embed-text-v1
+[Docs](https://huggingface.co/nomic-ai/nomic-embed-text-v1)
+```ruby
+input = [
+  "search_document: The dog is barking",
+  "search_query: puppy"
+]
+model = Informers.pipeline("embedding", "nomic-ai/nomic-embed-text-v1")
+embeddings = model.(input)
+```
+### BAAI/bge-base-en-v1.5
+[Docs](https://huggingface.co/BAAI/bge-base-en-v1.5)
+```ruby
+def transform_query(query)
+  "Represent this sentence for searching relevant passages: #{query}"
+end
+input = [
+  transform_query("puppy"),
+  "The dog is barking",
+  "The cat is purring"
+]
+model = Informers.pipeline("embedding", "BAAI/bge-base-en-v1.5")
+embeddings = model.(input)
+```
+### mixedbread-ai/mxbai-rerank-base-v1
+[Docs](https://huggingface.co/mixedbread-ai/mxbai-rerank-base-v1)
+```ruby
+query = "How many people live in London?"
+docs = ["Around 9 Million people live in London", "London is known for its financial district"]
+model = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-base-v1")
+result = model.(query, docs)
 ```
+### Other
+You can use the feature extraction pipeline directly.
+```ruby
+model = Informers.pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2", quantized: false)
+embeddings = model.(sentences, pooling: "mean", normalize: true)
+```
+The model files must include `onnx/model.onnx` or `onnx/model_quantized.onnx` ([example](https://huggingface.co/Xenova/all-MiniLM-L6-v2/tree/main/onnx)).
 ## Pipelines
+Embedding
+```ruby
+embed = Informers.pipeline("embedding")
+embed.("We are very happy to show you the 🤗 Transformers library.")
+```
+Reranking (experimental)
+```ruby
+rerank = Informers.pipeline("reranking")
+rerank.("Who created Ruby?", ["Matz created Ruby", "Another doc"])
+```
 Named-entity recognition
 ```ruby

data/lib/informers/model.rb CHANGED Viewed

@@ -2,12 +2,7 @@ module Informers
   class Model
     def initialize(model_id, quantized: false)
       @model_id = model_id
-      @model = Informers.pipeline("feature-extraction", model_id, quantized: quantized)
-      # TODO better pattern
-      if model_id == "sentence-transformers/all-MiniLM-L6-v2"
-        @model.instance_variable_get(:@model).instance_variable_set(:@output_names, ["sentence_embedding"])
-      end
+      @model = Informers.pipeline("embedding", model_id, quantized: quantized)
     end
     def embed(texts)
@@ -15,14 +10,12 @@ module Informers
       texts = [texts] unless is_batched
       case @model_id
-      when "sentence-transformers/all-MiniLM-L6-v2"
+      when "sentence-transformers/all-MiniLM-L6-v2", "Xenova/all-MiniLM-L6-v2", "Xenova/multi-qa-MiniLM-L6-cos-v1", "Supabase/gte-small"
         output = @model.(texts)
-      when "Xenova/all-MiniLM-L6-v2", "Xenova/multi-qa-MiniLM-L6-cos-v1", "Supabase/gte-small"
-        output = @model.(texts, pooling: "mean", normalize: true)
       when "mixedbread-ai/mxbai-embed-large-v1"
-        output = @model.(texts, pooling: "cls")
+        output = @model.(texts, pooling: "cls", normalize: false)
       else
-        raise Error, "model not supported: #{@model_id}"
+        raise Error, "Use the embedding pipeline for this model: #{@model_id}"
       end
       is_batched ? output : output[0]

data/lib/informers/models.rb CHANGED Viewed

@@ -141,13 +141,13 @@ module Informers
       OnnxRuntime::InferenceSession.new(path)
     end
-    def call(model_inputs)
-      @forward.(model_inputs)
+    def call(model_inputs, **kwargs)
+      @forward.(model_inputs, **kwargs)
     end
     private
-    def encoder_forward(model_inputs)
+    def encoder_forward(model_inputs, output_names: nil)
       encoder_feeds = {}
       @session.inputs.each do |input|
         key = input[:name].to_sym
@@ -156,13 +156,13 @@ module Informers
       if @session.inputs.any? { |v| v[:name] == "token_type_ids" } && !encoder_feeds[:token_type_ids]
         raise Todo
       end
-      session_run(@session, encoder_feeds)
+      session_run(@session, encoder_feeds, output_names:)
     end
-    def session_run(session, inputs)
+    def session_run(session, inputs, output_names:)
       checked_inputs = validate_inputs(session, inputs)
       begin
-        output = session.run(@output_names, checked_inputs)
+        output = session.run(output_names || @output_names, checked_inputs)
         output = replace_tensors(output)
         output
       rescue => e
@@ -199,6 +199,18 @@ module Informers
     end
   end
+  class NomicBertPreTrainedModel < PreTrainedModel
+  end
+  class NomicBertModel < NomicBertPreTrainedModel
+  end
+  class DebertaV2PreTrainedModel < PreTrainedModel
+  end
+  class DebertaV2Model < DebertaV2PreTrainedModel
+  end
   class DistilBertPreTrainedModel < PreTrainedModel
   end
@@ -217,6 +229,13 @@ module Informers
     end
   end
+  MODEL_MAPPING_NAMES_ENCODER_ONLY = {
+    "bert" => ["BertModel", BertModel],
+    "nomic_bert" => ["NomicBertModel", NomicBertModel],
+    "deberta-v2" => ["DebertaV2Model", DebertaV2Model],
+    "distilbert" => ["DistilBertModel", DistilBertModel]
+  }
   MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES = {
     "bert" => ["BertForSequenceClassification", BertForSequenceClassification],
     "distilbert" => ["DistilBertForSequenceClassification", DistilBertForSequenceClassification]
@@ -231,6 +250,7 @@ module Informers
   }
   MODEL_CLASS_TYPE_MAPPING = [
+    [MODEL_MAPPING_NAMES_ENCODER_ONLY, MODEL_TYPES[:EncoderOnly]],
     [MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES[:EncoderOnly]],
     [MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES[:EncoderOnly]],
     [MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES, MODEL_TYPES[:EncoderOnly]]

data/lib/informers/pipelines.rb CHANGED Viewed

@@ -10,10 +10,6 @@ module Informers
   end
   class TextClassificationPipeline < Pipeline
-    def initialize(**options)
-      super(**options)
-    end
     def call(texts, top_k: 1)
       # Run tokenization
       model_inputs = @tokenizer.(texts,
@@ -56,10 +52,6 @@ module Informers
   end
   class TokenClassificationPipeline < Pipeline
-    def initialize(**options)
-      super(**options)
-    end
     def call(
       texts,
       ignore_labels: ["O"],
@@ -200,10 +192,6 @@ module Informers
   end
   class QuestionAnsweringPipeline < Pipeline
-    def initialize(**options)
-      super(**options)
-    end
     def call(question, context, top_k: 1)
       # Run tokenization
       inputs = @tokenizer.(question,
@@ -256,10 +244,6 @@ module Informers
   end
   class FeatureExtractionPipeline < Pipeline
-    def initialize(**options)
-      super(**options)
-    end
     def call(
       texts,
       pooling: "none",
@@ -272,12 +256,27 @@ module Informers
         padding: true,
         truncation: true
       )
+      model_options = {}
+      # optimization for sentence-transformers/all-MiniLM-L6-v2
+      if @model.instance_variable_get(:@output_names) == ["token_embeddings"] && pooling == "mean" && normalize
+        model_options[:output_names] = ["sentence_embedding"]
+        pooling = "none"
+        normalize = false
+      end
       # Run model
-      outputs = @model.(model_inputs)
+      outputs = @model.(model_inputs, **model_options)
+      # TODO improve
+      result =
+        if outputs.is_a?(Array)
+          raise Error, "unexpected outputs" if outputs.size != 1
+          outputs[0]
+        else
+          outputs.logits
+        end
-      # TODO check outputs.last_hidden_state
-      result = outputs.logits
       case pooling
       when "none"
         # Skip pooling
@@ -301,6 +300,46 @@ module Informers
     end
   end
+  class EmbeddingPipeline < FeatureExtractionPipeline
+    def call(
+      texts,
+      pooling: "mean",
+      normalize: true
+    )
+      super(texts, pooling:, normalize:)
+    end
+  end
+  class RerankingPipeline < Pipeline
+    def call(
+      query,
+      documents,
+      return_documents: false,
+      top_k: nil
+    )
+      model_inputs = @tokenizer.([query] * documents.size,
+        text_pair: documents,
+        padding: true,
+        truncation: true
+      )
+      outputs = @model.(model_inputs)
+      result =
+        Utils.sigmoid(outputs[0].map(&:first))
+          .map.with_index { |s, i| {doc_id: i, score: s} }
+          .sort_by { |v| -v[:score] }
+      if return_documents
+        result.each do |v|
+          v[:text] = documents[v[:doc_id]]
+        end
+      end
+      top_k ? result.first(top_k) : result
+    end
+  end
   SUPPORTED_TASKS = {
     "text-classification" => {
       tokenizer: AutoTokenizer,
@@ -337,6 +376,24 @@ module Informers
         model: "Xenova/all-MiniLM-L6-v2"
       },
       type: "text"
+    },
+    "embedding" => {
+      tokenizer: AutoTokenizer,
+      pipeline: EmbeddingPipeline,
+      model: AutoModel,
+      default: {
+        model: "sentence-transformers/all-MiniLM-L6-v2"
+      },
+      type: "text"
+    },
+    "reranking" => {
+      tokenizer: AutoTokenizer,
+      pipeline: RerankingPipeline,
+      model: AutoModel,
+      default: {
+        model: "mixedbread-ai/mxbai-rerank-base-v1"
+      },
+      type: "text"
     }
   }
@@ -361,11 +418,13 @@ module Informers
     end
   end
+  NO_DEFAULT = Object.new
   class << self
     def pipeline(
       task,
       model = nil,
-      quantized: true,
+      quantized: NO_DEFAULT,
       progress_callback: DEFAULT_PROGRESS_CALLBACK,
       config: nil,
       cache_dir: nil,
@@ -373,6 +432,11 @@ module Informers
       revision: "main",
       model_file_name: nil
     )
+      if quantized == NO_DEFAULT
+        # TODO move default to task class
+        quantized = !["embedding", "reranking"].include?(task)
+      end
       # Apply aliases
       task = TASK_ALIASES[task] || task
@@ -408,6 +472,10 @@ module Informers
       results = load_items(classes, model, pretrained_options)
       results[:task] = task
+      if model == "sentence-transformers/all-MiniLM-L6-v2"
+        results[:model].instance_variable_set(:@output_names, ["token_embeddings"])
+      end
       Utils.dispatch_callback(progress_callback, {
         status: "ready",
         task: task,

data/lib/informers/tokenizers.rb CHANGED Viewed

@@ -83,12 +83,18 @@ module Informers
     # self.return_token_type_ids = true
   end
+  class DebertaV2Tokenizer < PreTrainedTokenizer
+    # TODO
+    # self.return_token_type_ids = true
+  end
   class DistilBertTokenizer < PreTrainedTokenizer
   end
   class AutoTokenizer
     TOKENIZER_CLASS_MAPPING = {
       "BertTokenizer" => BertTokenizer,
+      "DebertaV2Tokenizer" => DebertaV2Tokenizer,
       "DistilBertTokenizer" => DistilBertTokenizer
     }

data/lib/informers/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Informers
-  VERSION = "1.0.1"
+  VERSION = "1.0.2"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: informers
 version: !ruby/object:Gem::Version
-  version: 1.0.1
+  version: 1.0.2
 platform: ruby
 authors:
 - Andrew Kane