RubyGems - easy_ml - Versions diffs - 0.1.1 - Mend

easy_ml 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

checksums.yaml +7 -0
data/README.md +270 -0
data/Rakefile +12 -0
data/app/models/easy_ml/model.rb +59 -0
data/app/models/easy_ml/models/xgboost.rb +9 -0
data/app/models/easy_ml/models.rb +5 -0
data/lib/easy_ml/core/model.rb +29 -0
data/lib/easy_ml/core/model_core.rb +181 -0
data/lib/easy_ml/core/model_evaluator.rb +137 -0
data/lib/easy_ml/core/models/hyperparameters/base.rb +34 -0
data/lib/easy_ml/core/models/hyperparameters/xgboost.rb +19 -0
data/lib/easy_ml/core/models/hyperparameters.rb +8 -0
data/lib/easy_ml/core/models/xgboost.rb +10 -0
data/lib/easy_ml/core/models/xgboost_core.rb +220 -0
data/lib/easy_ml/core/models.rb +10 -0
data/lib/easy_ml/core/tuner/adapters/base_adapter.rb +63 -0
data/lib/easy_ml/core/tuner/adapters/xgboost_adapter.rb +50 -0
data/lib/easy_ml/core/tuner/adapters.rb +10 -0
data/lib/easy_ml/core/tuner.rb +105 -0
data/lib/easy_ml/core/uploaders/model_uploader.rb +24 -0
data/lib/easy_ml/core/uploaders.rb +7 -0
data/lib/easy_ml/core.rb +9 -0
data/lib/easy_ml/core_ext/pathname.rb +9 -0
data/lib/easy_ml/core_ext.rb +5 -0
data/lib/easy_ml/data/dataloader.rb +6 -0
data/lib/easy_ml/data/dataset/data/preprocessor/statistics.json +31 -0
data/lib/easy_ml/data/dataset/data/sample_info.json +1 -0
data/lib/easy_ml/data/dataset/dataset/files/sample_info.json +1 -0
data/lib/easy_ml/data/dataset/splits/file_split.rb +140 -0
data/lib/easy_ml/data/dataset/splits/in_memory_split.rb +49 -0
data/lib/easy_ml/data/dataset/splits/split.rb +98 -0
data/lib/easy_ml/data/dataset/splits.rb +11 -0
data/lib/easy_ml/data/dataset/splitters/date_splitter.rb +43 -0
data/lib/easy_ml/data/dataset/splitters.rb +9 -0
data/lib/easy_ml/data/dataset.rb +430 -0
data/lib/easy_ml/data/datasource/datasource_factory.rb +60 -0
data/lib/easy_ml/data/datasource/file_datasource.rb +40 -0
data/lib/easy_ml/data/datasource/merged_datasource.rb +64 -0
data/lib/easy_ml/data/datasource/polars_datasource.rb +41 -0
data/lib/easy_ml/data/datasource/s3_datasource.rb +89 -0
data/lib/easy_ml/data/datasource.rb +33 -0
data/lib/easy_ml/data/preprocessor/preprocessor.rb +205 -0
data/lib/easy_ml/data/preprocessor/simple_imputer.rb +403 -0
data/lib/easy_ml/data/preprocessor/utils.rb +17 -0
data/lib/easy_ml/data/preprocessor.rb +238 -0
data/lib/easy_ml/data/utils.rb +50 -0
data/lib/easy_ml/data.rb +8 -0
data/lib/easy_ml/deployment.rb +5 -0
data/lib/easy_ml/engine.rb +26 -0
data/lib/easy_ml/initializers/inflections.rb +4 -0
data/lib/easy_ml/logging.rb +38 -0
data/lib/easy_ml/railtie/generators/migration/migration_generator.rb +42 -0
data/lib/easy_ml/railtie/templates/migration/create_easy_ml_models.rb.tt +23 -0
data/lib/easy_ml/support/age.rb +27 -0
data/lib/easy_ml/support/est.rb +1 -0
data/lib/easy_ml/support/file_rotate.rb +23 -0
data/lib/easy_ml/support/git_ignorable.rb +66 -0
data/lib/easy_ml/support/synced_directory.rb +134 -0
data/lib/easy_ml/support/utc.rb +1 -0
data/lib/easy_ml/support.rb +10 -0
data/lib/easy_ml/trainer.rb +92 -0
data/lib/easy_ml/transforms.rb +29 -0
data/lib/easy_ml/version.rb +5 -0
data/lib/easy_ml.rb +23 -0
metadata +353 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 7a959176791dac2307979438ad0f9a9319b4295fe25e489214df5f3b4c908466
+  data.tar.gz: c665ef3c19fda35197be653d9c14a34e9b2256d3c1b7387be4313086ba5d2c11
+SHA512:
+  metadata.gz: bda834230add0f3de2b57d8df4abde48f79f64275946b372d5069b49c610ab09c8cc1117a89a2d9206100b19b30c8f30bbfbe2d8973ce5db551626bfb612949d
+  data.tar.gz: 3d80ffc4930323f3b8cff51ef2e3475f557589e96d54717c750a63507c3c3ab682650c7ccb4daf2bd1168cb35627c0101579e8bd26b310e86906e134c6626249

data/README.md ADDED Viewed

@@ -0,0 +1,270 @@
+<img src="easy_ml.svg" alt="EasyML Logo" style="width: 310px; height: 300px;">
+# EasyML
+EasyML is a Ruby gem designed to simplify the process of building, deploying, and managing the lifecycle of machine learning models within a Ruby on Rails application. It is a plug-and-play, opinionated framework that currently supports XGBoost, with plans to expand support to a variety of models and infrastructures. EasyML aims to make deployment and lifecycle management straightforward and efficient.
+## Features
+- **Plug-and-Play Architecture**: EasyML is designed to be easily extendable, allowing for the integration of various machine learning models and data sources.
+- **Opinionated Framework**: Provides a structured approach to model management, ensuring best practices are followed.
+- **Model Lifecycle On Rails**: Seamlessly integrates with Ruby on Rails, allowing simplified deployment of models to production.
+## Current and Planned Features
+### Models Available
+| XGBoost | LightGBM | TensorFlow | PyTorch |
+| ------- | -------- | ---------- | ------- |
+| ✅      | ❌       | ❌         | ❌      |
+### Datasources Available
+| S3  | File | Polars | SQL Databases | REST APIs |
+| --- | ---- | ------ | ------------- | --------- |
+| ✅  | ✅   | ✅     | ❌            | ❌        |
+_Note: Features marked with ❌ are part of the roadmap and are not yet implemented._
+## Quick Start:
+Building a Production pipeline is as easy as 1,2,3!
+### 1. Create Your Dataset
+```ruby
+class MyDataset < EasyML::Data::Dataset
+  datasource :s3, s3_bucket: "my-bucket" # Every time the data changes, we'll pull new data
+  target "revenue" # What are we trying to predict?
+  splitter :date, date_column: "created_at" # How should we partition data into training, test, and validation datasets?
+  transforms DataPipeline # Class that manages data transformation, adding new columns, etc.
+  preprocessing_steps({
+    training: {
+      annual_revenue: { median: true, clip: { min: 0, max: 500_000 } }
+    }
+  }) # If annual revenue is missing, use the median value, after clipping the values into the approved list
+end
+```
+### 2. Create a Model
+```ruby
+class MyModel < EasyML::Models::XGBoost
+  dataset MyDataset
+  task :regression # Or classification
+  hyperparameters({
+    max_depth: 5,
+    learning_rate: 0.1,
+    objective: "reg:squarederror"
+  })
+end
+```
+### 3. Create a Trainer
+```ruby
+class MyTrainer < EasyML::Trainer
+  model MyModel
+  evaluator MyMetrics
+end
+class MyMetrics
+  def metric_we_make_money(y_pred, y_true)
+    return true if model_makes_money?
+    return false if model_lose_money?
+  end
+  def metric_sales_team_has_enough_leads(y_pred, y_true)
+    return false if sales_will_be_sitting_on_their_hands?
+  end
+end
+```
+Now you're ready to predict in production!
+```ruby
+MyTrainer.train # Yay, we did it!
+MyTrainer.deploy # Let the production hosts know it's live!
+MyTrainer.predict(customer_data: "I am worth a lot of money")
+# prediction: true!
+```
+## Data Management
+EasyML provides a comprehensive data management system that handles all preprocessing tasks, including splitting data into train, test, and validation sets, and avoiding data leakage. The primary abstraction for data handling is the `Dataset` class, which ensures data is properly managed and prepared for machine learning tasks.
+### Preprocessing Features
+EasyML offers a variety of preprocessing features to prepare your data for machine learning models. Here's a complete list of available preprocessing steps and examples of when to use them:
+- **Mean Imputation**: Replace missing values with the mean of the feature. Use this when you want to maintain the average value of the data.
+  ```ruby
+  annual_revenue: {
+    mean: true
+  }
+  ```
+- **Median Imputation**: Replace missing values with the median of the feature. This is useful when you want to maintain the central tendency of the data without being affected by outliers.
+  ```ruby
+  annual_revenue: {
+    median: true
+  }
+  ```
+- **Forward Fill (ffill)**: Fill missing values with the last observed value. Use this for time series data where the last known value is a reasonable estimate for missing values.
+  ```ruby
+  created_date: {
+    ffill: true
+  }
+  ```
+- **Most Frequent Imputation**: Replace missing values with the most frequently occurring value. This is useful for categorical data where the mode is a reasonable estimate for missing values.
+  ```ruby
+  loan_purpose: {
+    most_frequent: true
+  }
+  ```
+- **Constant Imputation**: Replace missing values with a constant value. Use this when you have a specific value that should be used for missing data.
+  ```ruby
+  loan_purpose: {
+    constant: { fill_value: 'unknown' }
+  }
+  ```
+- **Today Imputation**: Fill missing date values with the current date. Use this for features that should default to the current date.
+  ```ruby
+  created_date: {
+    today: true
+  }
+  ```
+- **One-Hot Encoding**: Convert categorical variables into a set of binary variables. Use this when you have categorical data that needs to be converted into a numerical format for model training.
+  ```ruby
+  loan_purpose: {
+    one_hot: true
+  }
+  ```
+- **Label Encoding**: Convert categorical variables into integer labels. Use this when you have categorical data that can be ordinally encoded.
+  ```ruby
+  loan_purpose: {
+    categorical: {
+      encode_labels: true
+    }
+  }
+  ```
+### Other Dataset Features
+- **Data Splitting**: Automatically split data into train, test, and validation sets using various strategies, such as date-based splitting.
+- **Data Synchronization**: Ensure data is synced from its source, such as S3 or local files.
+- **Batch Processing**: Process data in batches to handle large datasets efficiently.
+- **Null Handling**: Alert and handle null values in datasets to ensure data quality.
+## Installation
+Install necessary Python dependencies
+1. **Install Python dependencies (don't worry, all code is in Ruby, we just call through to Python)**
+```bash
+pip install wandb
+pip install optuna
+```
+1. **Install the gem**:
+   ```bash
+   gem install easy_ml
+   ```
+2. **Run the generator to store model versions**:
+   ```bash
+   rails generate easy_ml:migration
+   rails db:migrate
+   ```
+3. **Configure CarrierWave for S3 storage**:
+   Ensure you have CarrierWave configured to use AWS S3. If not, add the following configuration:
+   ```ruby
+   # config/initializers/carrierwave.rb
+   CarrierWave.configure do |config|
+     config.fog_provider = 'fog/aws'
+     config.fog_credentials = {
+       provider: 'AWS',
+       aws_access_key_id: ENV['AWS_ACCESS_KEY_ID'],
+       aws_secret_access_key: ENV['AWS_SECRET_ACCESS_KEY'],
+       region: ENV['AWS_REGION'],
+     }
+     config.fog_directory = ENV['AWS_S3_BUCKET']
+     config.fog_public = false
+     config.storage = :fog
+   end
+   ```
+## Usage
+To use EasyML in your Rails application, follow these steps:
+1. **Define your preprocessing steps** in a configuration hash. For example:
+   ```ruby
+   preprocessing_steps = {
+     training: {
+       annual_revenue: {
+         median: true,
+         clip: { min: 0, max: 1_000_000 }
+       },
+       loan_purpose: {
+         categorical: {
+           categorical_min: 2,
+           one_hot: true
+         }
+       }
+     }
+   }
+   ```
+2. **Create a dataset** using the `EasyML::Data::Dataset` class, providing necessary configurations such as data source, target, and preprocessing steps.
+3. **Train a model** using the `EasyML::Models` module, specifying the model class and configuration.
+4. **Deploy the model** by marking it as live and storing it in the configured S3 bucket.
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/easy_ml. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/[USERNAME]/easy_ml/blob/main/CODE_OF_CONDUCT.md).
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
+## Code of Conduct
+Everyone interacting in the EasyML project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/easy_ml/blob/main/CODE_OF_CONDUCT.md).
+## Expected Future Enhancements
+- **Support for Additional Models**: Integration with LightGBM, TensorFlow, and PyTorch.
+- **Expanded Data Source Support**: Ability to pull data from SQL databases and REST APIs.
+- **Enhanced Deployment Options**: More flexible deployment strategies and integration with CI/CD pipelines.
+- **Advanced Monitoring and Logging**: Improved tools for monitoring model performance and logging.
+- **User Interface Improvements**: Enhanced UI components for managing models and datasets.

data/Rakefile ADDED Viewed

@@ -0,0 +1,12 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+require "rubocop/rake_task"
+RuboCop::RakeTask.new
+task default: %i[spec rubocop]

data/app/models/easy_ml/model.rb ADDED Viewed

@@ -0,0 +1,59 @@
+require_relative "../../../lib/easy_ml/core/model"
+module EasyML
+  class Model < ActiveRecord::Base
+    include EasyML::Core::ModelCore
+    self.table_name = "easy_ml_models"
+    scope :live, -> { where(is_live: true) }
+    attribute :root_dir, :string
+    after_initialize :apply_defaults
+    validate :only_one_model_is_live?
+    def only_one_model_is_live?
+      return if @marking_live
+      if previous_versions.live.count > 1
+        raise "Multiple previous versions of #{name} are live! This should never happen. Update previous versions to is_live=false before proceeding"
+      end
+      return unless previous_versions.live.any? && is_live
+      errors.add(:is_live,
+                 "cannot mark model live when previous version is live. Explicitly use the mark_live method to mark this as the live version")
+    end
+    def mark_live
+      transaction do
+        self.class.where(name: name).where.not(id: id).update_all(is_live: false)
+        self.class.where(id: id).update_all(is_live: true)
+      end
+    end
+    def previous_versions
+      EasyML::Model.where(name: name).order(id: :desc)
+    end
+    private
+    def files_to_keep
+      live_models = self.class.live
+      recent_copies = live_models.flat_map do |live|
+        # Fetch all models with the same name
+        self.class.where(name: live.name).where(is_live: false).order(created_at: :desc).limit(live.name == name ? 4 : 5)
+      end
+      recent_versions = self.class
+                            .where.not(
+                              "EXISTS (SELECT 1 FROM easy_ml_models e2 WHERE e2.name = easy_ml_models.name AND e2.is_live = true)"
+                            )
+                            .where("created_at >= ?", 2.days.ago)
+                            .order(created_at: :desc)
+                            .group_by(&:name)
+                            .flat_map { |_, models| models.take(5) }
+      ([self] + recent_versions + recent_copies + live_models).compact.map(&:file).map(&:path).uniq
+    end
+  end
+end

data/app/models/easy_ml/models/xgboost.rb ADDED Viewed

@@ -0,0 +1,9 @@
+require_relative "../model"
+module EasyML
+  module Models
+    class XGBoost < EasyML::Model
+      include EasyML::Core::Models::XGBoostCore
+    end
+  end
+end

data/app/models/easy_ml/models.rb ADDED Viewed

@@ -0,0 +1,5 @@
+module EasyML
+  module Models
+    require_relative "models/xgboost"
+  end
+end

data/lib/easy_ml/core/model.rb ADDED Viewed

@@ -0,0 +1,29 @@
+require "carrierwave"
+require_relative "model_core"
+require_relative "uploaders/model_uploader"
+module EasyML
+  module Core
+    class Model
+      include GlueGun::DSL
+      attribute :name, :string
+      attribute :version, :string
+      attribute :task, :string, default: "regression"
+      attribute :metrics, :array
+      attribute :ml_model, :string
+      attribute :file, :string
+      attribute :root_dir, :string
+      attribute :objective
+      attribute :evaluator
+      attribute :evaluator_metric
+      include EasyML::Core::ModelCore
+      def initialize(options = {})
+        super
+        apply_defaults
+      end
+    end
+  end
+end

data/lib/easy_ml/core/model_core.rb ADDED Viewed

@@ -0,0 +1,181 @@
+require "carrierwave"
+require_relative "uploaders/model_uploader"
+module EasyML
+  module Core
+    module ModelCore
+      attr_accessor :dataset
+      def self.included(base)
+        base.send(:include, GlueGun::DSL)
+        base.send(:extend, CarrierWave::Mount)
+        base.send(:mount_uploader, :file, EasyML::Core::Uploaders::ModelUploader)
+        base.class_eval do
+          validates :task, inclusion: { in: %w[regression classification] }
+          validates :task, presence: true
+          validate :dataset_is_a_dataset?
+          validate :validate_any_metrics?
+          validate :validate_metrics_for_task
+          before_validation :save_model_file, if: -> { fit? }
+        end
+      end
+      def dataset_is_a_dataset?
+        return if dataset.nil?
+        return if dataset.class.ancestors.include?(EasyML::Data::Dataset)
+        errors.add(:dataset, "Must be a subclass of EasyML::Dataset")
+      end
+      def validate_any_metrics?
+        return if metrics.any?
+        errors.add(:metrics, "Must include at least one metric. Allowed metrics are #{allowed_metrics.join(", ")}")
+      end
+      def validate_metrics_for_task
+        nonsensical_metrics = metrics.select do |metric|
+          allowed_metrics.exclude?(metric)
+        end
+        return unless nonsensical_metrics.any?
+        errors.add(:metrics,
+                   "cannot use metrics: #{nonsensical_metrics.join(", ")} for task #{task}. Allowed metrics are: #{allowed_metrics.join(", ")}")
+      end
+      def fit(x_train: nil, y_train: nil, x_valid: nil, y_valid: nil)
+        if x_train.nil?
+          dataset.refresh!
+          train_in_batches
+        else
+          train(x_train, y_train, x_valid, y_valid)
+        end
+        @is_fit = true
+      end
+      def decode_labels(ys, col: nil)
+        dataset.decode_labels(ys, col: col)
+      end
+      def evaluate(y_pred: nil, y_true: nil, x_true: nil, evaluator: nil)
+        evaluator ||= self.evaluator
+        EasyML::Core::ModelEvaluator.evaluate(model: self, y_pred: y_pred, y_true: y_true, x_true: x_true,
+                                              evaluator: evaluator)
+      end
+      def predict(xs)
+        raise NotImplementedError, "Subclasses must implement predict method"
+      end
+      def load
+        raise NotImplementedError, "Subclasses must implement load method"
+      end
+      def _save_model_file
+        raise NotImplementedError, "Subclasses must implement _save_model_file method"
+      end
+      def save
+        super if defined?(super) && self.class.superclass.method_defined?(:save)
+        save_model_file
+      end
+      def save_model_file
+        raise "No trained model! Need to train model before saving (call model.fit)" unless fit?
+        path = File.join(model_dir, "#{version}.json")
+        ensure_directory_exists(File.dirname(path))
+        _save_model_file(path)
+        File.open(path) do |f|
+          self.file = f
+        end
+        file.store!
+        cleanup
+      end
+      def get_params
+        @hyperparameters.to_h
+      end
+      def allowed_metrics
+        return [] unless task.present?
+        case task.to_sym
+        when :regression
+          %w[mean_absolute_error mean_squared_error root_mean_squared_error r2_score]
+        when :classification
+          %w[accuracy_score precision_score recall_score f1_score auc roc_auc]
+        else
+          []
+        end
+      end
+      def cleanup!
+        [file_dir, model_dir].each do |dir|
+          EasyML::FileRotate.new(dir, []).cleanup(extension_allowlist)
+        end
+      end
+      def cleanup
+        [file_dir, model_dir].each do |dir|
+          EasyML::FileRotate.new(dir, files_to_keep).cleanup(extension_allowlist)
+        end
+      end
+      def fit?
+        @is_fit == true
+      end
+      private
+      def file_dir
+        return unless file.path.present?
+        File.dirname(file.path).split("/")[0..-2].join("/")
+      end
+      def extension_allowlist
+        EasyML::Core::Uploaders::ModelUploader.new.extension_allowlist
+      end
+      def _save_model_file(path = nil)
+        raise NotImplementedError, "Subclasses must implement _save_model_file method"
+      end
+      def ensure_directory_exists(dir)
+        FileUtils.mkdir_p(dir) unless File.directory?(dir)
+      end
+      def apply_defaults
+        self.version ||= generate_version_string
+        self.metrics ||= allowed_metrics
+        self.ml_model ||= get_ml_model
+      end
+      def get_ml_model
+        self.class.name.split("::").last.underscore
+      end
+      def generate_version_string
+        timestamp = Time.now.utc.strftime("%Y%m%d%H%M%S")
+        model_name = self.class.name.split("::").last.underscore
+        "#{model_name}_#{timestamp}"
+      end
+      def model_dir
+        File.join(root_dir, "easy_ml_models", name.present? ? name.split.join.underscore : "")
+      end
+      def files_to_keep
+        Dir.glob(File.join(file_dir, "*")).select { |f| File.file?(f) }.sort_by do |filename|
+          Time.parse(filename.split("/").last.gsub(/\D/, ""))
+        end.reverse.take(5)
+      end
+    end
+  end
+end