RubyGems - torchtext - Versions diffs - 0.1.0 - Mend

torchtext 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +7 -0
data/CHANGELOG.md +3 -0
data/LICENSE.txt +30 -0
data/README.md +62 -0
data/lib/torchtext.rb +19 -0
data/lib/torchtext/data/utils.rb +60 -0
data/lib/torchtext/datasets/text_classification.rb +166 -0
data/lib/torchtext/datasets/text_classification_dataset.rb +29 -0
data/lib/torchtext/version.rb +3 -0
data/lib/torchtext/vocab.rb +87 -0
metadata +107 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 86469f8148e519b940a643f81b5317d3e180d6ebc031da14cb0599b48e3f6556
+  data.tar.gz: 499079c8a32de3ea6704b58a04ad8511f7a6784cc138b08e87696c69d7835863
+SHA512:
+  metadata.gz: e3ea0d3719d35a58b757ac3d11adeda30912f35f69f7de37047ef702c556e5384862f950e055565db8396d8495a760b4919fd416affbdc0fd815dc14ed02e3a3
+  data.tar.gz: 16d2817864dc4bba2d54ca4a7288bc609b95ab5d59da51c67eccf70107fd78e67af141b765d01b8eea82c0a112b3dc779c174d7864afee6fb5651a5d787df7c5

data/CHANGELOG.md ADDED

@@ -0,0 +1,3 @@
+## 0.1.0 (2020-08-24)
+- First release

data/LICENSE.txt ADDED

@@ -0,0 +1,30 @@
+BSD 3-Clause License
+Copyright (c) James Bradbury and Soumith Chintala 2016,
+Copyright (c) Andrew Kane 2020,
+All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

data/README.md ADDED

@@ -0,0 +1,62 @@
+# TorchText
+:fire: Data loaders and abstractions for text and NLP - for Ruby
+## Installation
+Add this line to your application’s Gemfile:
+```ruby
+gem 'torchtext'
+```
+## Getting Started
+This library follows the [Python API](https://pytorch.org/text/). Many methods and options are missing at the moment. PRs welcome!
+## Examples
+Text classification
+- [PyTorch tutorial](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)
+- [Ruby code](examples/text_classification)
+## Datasets
+Load a dataset
+```ruby
+train_dataset, test_dataset = TorchText::Datasets::AG_NEWS.load(root: ".data", ngrams: 2)
+```
+Supported datasets are:
+- [AG_NEWS](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)
+## Disclaimer
+This library downloads and prepares public datasets. We don’t host any datasets. Be sure to adhere to the license for each dataset.
+If you’re a dataset owner and wish to update any details or remove it from this project, let us know.
+## History
+View the [changelog](https://github.com/ankane/torchtext/blob/master/CHANGELOG.md)
+## Contributing
+Everyone is encouraged to help improve this project. Here are a few ways you can help:
+- [Report bugs](https://github.com/ankane/torchtext/issues)
+- Fix bugs and [submit pull requests](https://github.com/ankane/torchtext/pulls)
+- Write, clarify, or fix documentation
+- Suggest or add new features
+To get started with development:
+```sh
+git clone https://github.com/ankane/torchtext.git
+cd torchtext
+bundle install
+bundle exec rake test
+```

data/lib/torchtext.rb ADDED

@@ -0,0 +1,19 @@
+# dependencies
+require "torch"
+# stdlib
+require "csv"
+require "fileutils"
+require "rubygems/package"
+require "set"
+# modules
+require "torchtext/data/utils"
+require "torchtext/datasets/text_classification"
+require "torchtext/datasets/text_classification_dataset"
+require "torchtext/vocab"
+require "torchtext/version"
+module TorchText
+  class Error < StandardError; end
+end

data/lib/torchtext/data/utils.rb ADDED

@@ -0,0 +1,60 @@
+module TorchText
+  module Data
+    module Utils
+      def tokenizer(tokenizer, language: "en")
+        return method(:split_tokenizer) if tokenizer.nil?
+        if tokenizer == "basic_english"
+          if language != "en"
+            raise ArgumentError, "Basic normalization is only available for English(en)"
+          end
+          return method(:basic_english_normalize)
+        end
+        raise "Not implemented yet"
+      end
+      def ngrams_iterator(token_list, ngrams)
+        return enum_for(:ngrams_iterator, token_list, ngrams) unless block_given?
+        get_ngrams = lambda do |n|
+          (token_list.size - n + 1).times.map { |i| token_list[i...(i + n)] }
+        end
+        token_list.each do |x|
+          yield x
+        end
+        2.upto(ngrams) do |n|
+          get_ngrams.call(n).each do |x|
+            yield x.join(" ")
+          end
+        end
+      end
+      private
+      def split_tokenizer(x)
+        x.split
+      end
+      _patterns = [%r{\'}, %r{\"}, %r{\.}, %r{<br \/>}, %r{,}, %r{\(}, %r{\)}, %r{\!}, %r{\?}, %r{\;}, %r{\:}, %r{\s+}]
+      _replacements = [" \'  ", "", " . ", " ", " , ", " ( ", " ) ", " ! ", " ? ", " ", " ", " "]
+      PATTERNS_DICT = _patterns.zip(_replacements)
+      def basic_english_normalize(line)
+        line = line.downcase
+        PATTERNS_DICT.each do |pattern_re, replaced_str|
+          line.sub!(pattern_re, replaced_str)
+        end
+        line.split
+      end
+      extend self
+    end
+    # TODO only tokenizer method
+    extend Utils
+  end
+end

data/lib/torchtext/datasets/text_classification.rb ADDED

@@ -0,0 +1,166 @@
+module TorchText
+  module Datasets
+    module TextClassification
+      URLS = {
+        "AG_NEWS" => "https://drive.google.com/uc?export=download&id=0Bz8a_Dbh9QhbUDNpeUdjb0wxRms"
+      }
+      PATHS = {
+        "AG_NEWS" => "ag_news_csv"
+      }
+      FILENAMES = {
+        "AG_NEWS" => "ag_news_csv.tar.gz"
+      }
+      class << self
+        def ag_news(*args, **kwargs)
+          setup_datasets("AG_NEWS", *args, **kwargs)
+        end
+        private
+        def setup_datasets(dataset_name, root: ".data", ngrams: 1, vocab: nil, include_unk: false)
+          dataset_tar = download_from_url(URLS[dataset_name], root: root, filename: FILENAMES[dataset_name])
+          to_path = extract_archive(dataset_tar)
+          extracted_files = Dir["#{to_path}/#{PATHS[dataset_name]}/*"]
+          train_csv_path = nil
+          test_csv_path = nil
+          extracted_files.each do |fname|
+            if fname.end_with?("train.csv")
+              train_csv_path = fname
+            elsif fname.end_with?("test.csv")
+              test_csv_path = fname
+            end
+          end
+          if vocab.nil?
+            vocab = Vocab.build_vocab_from_iterator(_csv_iterator(train_csv_path, ngrams))
+          else
+            unless vocab.is_a?(Vocab)
+              raise ArgumentError, "Passed vocabulary is not of type Vocab"
+            end
+          end
+          train_data, train_labels = _create_data_from_iterator(vocab, _csv_iterator(train_csv_path, ngrams, yield_cls: true), include_unk)
+          test_data, test_labels = _create_data_from_iterator(vocab, _csv_iterator(test_csv_path, ngrams, yield_cls: true), include_unk)
+          if (train_labels ^ test_labels).length > 0
+            raise ArgumentError, "Training and test labels don't match"
+          end
+          [
+            TextClassificationDataset.new(vocab, train_data, train_labels),
+            TextClassificationDataset.new(vocab, test_data, test_labels)
+          ]
+        end
+        def _csv_iterator(data_path, ngrams, yield_cls: false)
+          return enum_for(:_csv_iterator, data_path, ngrams, yield_cls: yield_cls) unless block_given?
+          tokenizer = Data.tokenizer("basic_english")
+          CSV.foreach(data_path) do |row|
+            tokens = row[1..-1].join(" ")
+            tokens = tokenizer.call(tokens)
+            if yield_cls
+              yield row[0].to_i - 1, Data::Utils.ngrams_iterator(tokens, ngrams)
+            else
+              yield Data::Utils.ngrams_iterator(tokens, ngrams)
+            end
+          end
+        end
+        def _create_data_from_iterator(vocab, iterator, include_unk)
+          data = []
+          labels = []
+          iterator.each do |cls, tokens|
+            if include_unk
+              tokens = Torch.tensor(tokens.map { |token| vocab[token] })
+            else
+              token_ids = tokens.map { |token| vocab[token] }.select { |x| x != Vocab::UNK }
+              tokens = Torch.tensor(token_ids)
+            end
+            data << [cls, tokens]
+            labels << cls
+          end
+          [data, Set.new(labels)]
+        end
+        # extra filename parameter
+        def download_from_url(url, root:, filename:)
+          path = File.join(root, filename)
+          return path if File.exist?(path)
+          FileUtils.mkdir_p(root)
+          puts "Downloading #{url}..."
+          download_url_to_file(url, path)
+        end
+        # follows redirects
+        def download_url_to_file(url, dst)
+          uri = URI(url)
+          tmp = nil
+          location = nil
+          Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == "https") do |http|
+            request = Net::HTTP::Get.new(uri)
+            http.request(request) do |response|
+              case response
+              when Net::HTTPRedirection
+                location = response["location"]
+              when Net::HTTPSuccess
+                tmp = "#{Dir.tmpdir}/#{Time.now.to_f}" # TODO better name
+                File.open(tmp, "wb") do |f|
+                  response.read_body do |chunk|
+                    f.write(chunk)
+                  end
+                end
+              else
+                raise Error, "Bad response"
+              end
+            end
+          end
+          if location
+            download_url_to_file(location, dst)
+          else
+            FileUtils.mv(tmp, dst)
+            dst
+          end
+        end
+        # extract_tar_gz doesn't list files, so just return to_path
+        def extract_archive(from_path, to_path: nil, overwrite: nil)
+          to_path ||= File.dirname(from_path)
+          if from_path.end_with?(".tar.gz") || from_path.end_with?(".tgz")
+            File.open(from_path, "rb") do |io|
+              Gem::Package.new("").extract_tar_gz(io, to_path)
+            end
+            return to_path
+          end
+          raise "Not implemented yet"
+        end
+      end
+      DATASETS = {
+        "AG_NEWS" => method(:ag_news)
+      }
+      LABELS = {
+        "AG_NEWS" => {
+          0 => "World",
+          1 => "Sports",
+          2 => "Business",
+          3 => "Sci/Tech"
+        }
+      }
+    end
+    class AG_NEWS
+      def self.load(*args, **kwargs)
+        TextClassification.ag_news(*args, **kwargs)
+      end
+    end
+  end
+end

data/lib/torchtext/datasets/text_classification_dataset.rb ADDED

@@ -0,0 +1,29 @@
+module TorchText
+  module Datasets
+    class TextClassificationDataset < Torch::Utils::Data::Dataset
+      attr_reader :labels, :vocab
+      def initialize(vocab, data, labels)
+        super()
+        @data = data
+        @labels = labels
+        @vocab = vocab
+      end
+      def [](i)
+        @data[i]
+      end
+      def length
+        @data.length
+      end
+      alias_method :size, :length
+      def each
+        @data.each do |x|
+          yield x
+        end
+      end
+    end
+  end
+end

data/lib/torchtext/version.rb ADDED

@@ -0,0 +1,3 @@
+module TorchText
+  VERSION = "0.1.0"
+end

data/lib/torchtext/vocab.rb ADDED

@@ -0,0 +1,87 @@
+module TorchText
+  class Vocab
+    UNK = "<unk>"
+    def initialize(
+      counter, max_size: nil, min_freq: 1, specials: ["<unk>", "<pad>"],
+      vectors: nil, unk_init: nil, vectors_cache: nil, specials_first: true
+    )
+      @freqs = counter
+      counter = counter.dup
+      min_freq = [min_freq, 1].max
+      @itos = []
+      @unk_index = nil
+      if specials_first
+        @itos = specials
+        # only extend max size if specials are prepended
+        max_size += specials.size if max_size
+      end
+      # frequencies of special tokens are not counted when building vocabulary
+      # in frequency order
+      specials.each do |tok|
+        counter.delete(tok)
+      end
+      # sort by frequency, then alphabetically
+      words_and_frequencies = counter.sort_by { |k, v| [-v, k] }
+      words_and_frequencies.each do |word, freq|
+        break if freq < min_freq || @itos.length == max_size
+        @itos << word
+      end
+      if specials.include?(UNK)  # hard-coded for now
+        unk_index = specials.index(UNK)  # position in list
+        # account for ordering of specials, set variable
+        @unk_index = specials_first ? unk_index : @itos.length + unk_index
+        @stoi = Hash.new(@unk_index)
+      else
+        @stoi = {}
+      end
+      if !specials_first
+        @itos.concat(specials)
+      end
+      # stoi is simply a reverse dict for itos
+      @itos.each_with_index do |tok, i|
+        @stoi[tok] = i
+      end
+      @vectors = nil
+      if !vectors.nil?
+        # self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
+        raise "Not implemented yet"
+      else
+        raise "Failed assertion" unless unk_init.nil?
+        raise "Failed assertion" unless vectors_cache.nil?
+      end
+    end
+    def [](token)
+      @stoi.fetch(token, @stoi.fetch(UNK))
+    end
+    def length
+      @itos.length
+    end
+    alias_method :size, :length
+    def self.build_vocab_from_iterator(iterator)
+      counter = Hash.new(0)
+      i = 0
+      iterator.each do |tokens|
+        tokens.each do |token|
+          counter[token] += 1
+        end
+        i += 1
+        puts "Processed #{i}" if i % 10000 == 0
+      end
+      Vocab.new(counter)
+    end
+  end
+end

metadata ADDED

@@ -0,0 +1,107 @@
+--- !ruby/object:Gem::Specification
+name: torchtext
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Andrew Kane
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2020-08-24 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: torch-rb
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.3.2
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 0.3.2
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '5'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '5'
+description:
+email: andrew@chartkick.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- CHANGELOG.md
+- LICENSE.txt
+- README.md
+- lib/torchtext.rb
+- lib/torchtext/data/utils.rb
+- lib/torchtext/datasets/text_classification.rb
+- lib/torchtext/datasets/text_classification_dataset.rb
+- lib/torchtext/version.rb
+- lib/torchtext/vocab.rb
+homepage: https://github.com/ankane/torchtext
+licenses:
+- BSD-3-Clause
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '2.5'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.1.2
+signing_key:
+specification_version: 4
+summary: Data loaders and abstractions for text and NLP
+test_files: []