RubyGems - lazar - Versions diffs - 1.0.1 → 1.1.0 - Mend

lazar 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

checksums.yaml +4 -4
data/README.md +98 -2
data/VERSION +1 -1
data/lib/caret.rb +5 -6
data/lib/classification.rb +5 -0
data/lib/crossvalidation.rb +1 -0
data/lib/dataset.rb +1 -1
data/lib/lazar.rb +5 -2
data/lib/leave-one-out-validation.rb +1 -0
data/lib/model.rb +30 -18
data/lib/regression.rb +1 -1
data/lib/train-test-validation.rb +2 -0
data/lib/unique_descriptors.rb +2 -1
data/lib/validation-statistics.rb +24 -46
data/test/dataset.rb +1 -1
data/test/feature.rb +5 -5
data/test/model-classification.rb +5 -3
data/test/model-regression.rb +14 -14
data/test/model-validation.rb +1 -1
data/test/setup.rb +2 -0
data/test/validation-classification.rb +1 -1
data/test/validation-regression.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: c17dc3fb7cae4c75aca1be7c0a6286cfbc3f22ce
-  data.tar.gz: 5b9fb4bae6230e427188e0c8e34153fd5a6efa0a
+  metadata.gz: 698fd96821077269f4c31fdfca6bead6beab36f0
+  data.tar.gz: 2fd49abf99e8f5367b83764735c5c2e49caad4d2
 SHA512:
-  metadata.gz: 7cae1ffb410cd9a2d1afd1516ebf99499e2b2447af8707a4381adb652cb59711e1875c11e80cec8fc101f8368224ab21bc378f685b0084ab29c631d798145dca
-  data.tar.gz: d01273022852b6a0b59941a0e881a85ed1400a984912018d97fc137f8ab602cff1fd6f5fb42a65df5d9375cb43ca2809adeefbde7a0e385fd832c189df0da031
+  metadata.gz: ef5d243c765f5d1d6bb4c2dbd53bf2cdcca380df532cef8c48447b907ff01086f1b50c261e1f494827a30e52e7d8e62e2c077334cdbe30370546680ffd018886
+  data.tar.gz: d9bb905832388dcb44bb3211fb6976c4e0894d75eb02fdf60c17fe19f25dec73ba24b5585fea8cb85e77b6793b46518462cf1d9b2488c465332b9db3af132028

data/README.md CHANGED Viewed

@@ -59,7 +59,75 @@ Execute the following commands either from an interactive Ruby shell or a Ruby s
 #### Experiment with other algorithms
-  You can pass algorithms parameters to the `Model::Validation.create_from_csv_file` command. The [API documentation](http://rdoc.info/gems/lazar) provides detailed instructions.
+  You can pass algorithm specifications as parameters to the `Model::Validation.create_from_csv_file` and `Model::Lazar.create` commands. Algorithms for descriptors, similarity calculations, feature_selection and local models are specified in the `algorithm` parameter. Unspecified algorithms and parameters are substituted by default values. The example below selects
+  - MP2D fingerprint descriptors
+  - Tanimoto similarity with a threshold of 0.1
+  - no feature selection
+  - weighted majority vote predictions
+  ```
+algorithms = {
+  :descriptors => { # descriptor algorithm
+    :method => "fingerprint", # fingerprint descriptors
+    :type => "MP2D" # fingerprint type, e.g. FP4, MACCS
+  },
+  :similarity => { # similarity algorithm
+    :method => "Algorithm::Similarity.tanimoto",
+    :min => 0.1 # similarity threshold for neighbors
+  },
+  :feature_selection => nil, # no feature selection
+  :prediction => { # local modelling algorithm
+    :method => "Algorithm::Classification.weighted_majority_vote",
+  },
+}
+training_dataset = Dataset.from_csv_file "hamster_carcinogenicity.csv"
+model = Model::Lazar.create  training_dataset: training_dataset, algorithms: algorithms
+  ```
+  The next example creates a regression model with
+  - calculated descriptors from OpenBabel libraries
+  - weighted cosine similarity and a threshold of 0.5
+  - descriptors that are correlated with the endpoint
+  - local partial least squares models from the R caret package
+  ```
+algorithms = {
+  :descriptors => { # descriptor algorithm
+    :method => "calculate_properties",
+    :features => PhysChem.openbabel_descriptors,
+  },
+  :similarity => { # similarity algorithm
+    :method => "Algorithm::Similarity.weighted_cosine",
+    :min => 0.5
+  },
+  :feature_selection => { # feature selection algorithm
+    :method => "Algorithm::FeatureSelection.correlation_filter",
+  },
+  :prediction => { # local modelling algorithm
+    :method => "Algorithm::Caret.pls",
+  },
+}
+training_dataset = Dataset.from_csv_file "EPAFHM_log10.csv"
+model = Model::Lazar.create(training_dataset:training_dataset, algorithms:algorithms)
+    ```
+Please consult the [API documentation](http://rdoc.info/gems/lazar) and [source code](https:://github.com/opentox/lazar) for up to date information about implemented algorithms:
+- Descriptor algorithms
+  - [Compounds](http://www.rubydoc.info/gems/lazar/OpenTox/Compound)
+  - [Nanoparticles](http://www.rubydoc.info/gems/lazar/OpenTox/Nanoparticle)
+- [Similarity algorithms](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Similarity)
+- [Feature selection algorithms](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/FeatureSelection)
+- Local models
+  - [Classification](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Classification)
+  - [Regression](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Regression)
+  - [R caret](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Caret)
+You can find more working examples in the `lazar` `model-*.rb` and `validation-*.rb` [tests](https://github.com/opentox/lazar/tree/master/test).
 ### Create and use `lazar` nanoparticle models
@@ -87,7 +155,35 @@ Execute the following commands either from an interactive Ruby shell or a Ruby s
 #### Experiment with other datasets, endpoints and algorithms
-  You can pass training_dataset, prediction_feature and algorithms parameters to the `Model::Validation.create_from_enanomapper` command. The [API documentation](http://rdoc.info/gems/lazar) provides detailed instructions. Detailed documentation and validation results can be found in this [publication](https://github.com/enanomapper/nano-lazar-paper/blob/master/nano-lazar.pdf).
+  You can pass training_dataset, prediction_feature and algorithms parameters to the `Model::Validation.create_from_enanomapper` command. Procedure and options are the same as for compounds. The following commands create and validate a `nano-lazar` model with
+  - measured P-CHEM properties as descriptors
+  - descriptors selected with correlation filter
+  - weighted cosine similarity with a threshold of 0.5
+  - Caret random forests
+```
+algorithms = {
+  :descriptors => {
+    :method => "properties",
+    :categories => ["P-CHEM"],
+  },
+  :similarity => {
+    :method => "Algorithm::Similarity.weighted_cosine",
+    :min => 0.5
+  },
+  :feature_selection => {
+    :method => "Algorithm::FeatureSelection.correlation_filter",
+  },
+  :prediction => {
+    :method => "Algorithm::Caret.rf",
+  },
+}
+validation_model = Model::Validation.from_enanomapper algorithms: algorithms
+```
+  Detailed documentation and validation results for nanoparticle models can be found in this [publication](https://github.com/enanomapper/nano-lazar-paper/blob/master/nano-lazar.pdf).
 Documentation
 -------------

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 1.0.1
1	+ 1.1.0

data/lib/caret.rb CHANGED Viewed

@@ -22,12 +22,11 @@ module OpenTox
         end
         if independent_variables.flatten.uniq == ["NA"] or independent_variables.flatten.uniq == []
           prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-          prediction[:warning] = "No variables for regression model. Using weighted average of similar substances."
+          prediction[:warnings] << "No variables for regression model. Using weighted average of similar substances."
         elsif
           dependent_variables.size < 3
           prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-          prediction[:warning] = "Insufficient number of neighbors (#{dependent_variables.size}) for regression model. Using weighted average of similar substances."
+          prediction[:warnings] << "Insufficient number of neighbors (#{dependent_variables.size}) for regression model. Using weighted average of similar substances."
         else
           dependent_variables.each_with_index do |v,i|
             dependent_variables[i] = to_r(v)
@@ -52,7 +51,7 @@ module OpenTox
             $logger.debug dependent_variables
             $logger.debug independent_variables
             prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-            prediction[:warning] = "R caret model creation error. Using weighted average of similar substances."
+            prediction[:warnings] << "R caret model creation error. Using weighted average of similar substances."
             return prediction
           end
           begin
@@ -73,12 +72,12 @@ module OpenTox
             $logger.debug "R caret prediction error for:"
             $logger.debug self.inspect
             prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-            prediction[:warning] = "R caret prediction error. Using weighted average of similar substances"
+            prediction[:warnings] << "R caret prediction error. Using weighted average of similar substances"
             return prediction
           end
           if prediction.nil? or prediction[:value].nil?
             prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-            prediction[:warning] = "Could not create local caret model. Using weighted average of similar substances."
+            prediction[:warnings] << "Empty R caret prediction. Using weighted average of similar substances."
           end
         end
         prediction

data/lib/classification.rb CHANGED Viewed

@@ -18,6 +18,11 @@ module OpenTox
         class_weights.each do |a,w|
           probabilities[a] = w.sum/weights.sum
         end
+        # DG: hack to ensure always two probability values
+        if probabilities.keys.uniq.size == 1
+          missing_key = probabilities.keys.uniq[0].match(/^non/) ? probabilities.keys.uniq[0].sub(/non-/,"") : "non-"+probabilities.keys.uniq[0]
+          probabilities[missing_key] = 0.0
+        end
         probabilities = probabilities.collect{|a,p| [a,weights.max*p]}.to_h
         p_max = probabilities.collect{|a,p| p}.max
         prediction = probabilities.key(p_max)

data/lib/crossvalidation.rb CHANGED Viewed

@@ -90,6 +90,7 @@ module OpenTox
       field :within_prediction_interval, type: Integer, default:0
       field :out_of_prediction_interval, type: Integer, default:0
       field :correlation_plot_id, type: BSON::ObjectId
+      field :warnings, type: Array
     end
     # Independent repeated crossvalidations

data/lib/dataset.rb CHANGED Viewed

@@ -46,7 +46,7 @@ module OpenTox
       if data_entries[substance.to_s] and data_entries[substance.to_s][feature.to_s]
         data_entries[substance.to_s][feature.to_s]
       else
-        nil
+        [nil]
       end
     end

data/lib/lazar.rb CHANGED Viewed

@@ -16,16 +16,19 @@ raise "Incorrect lazar environment variable LAZAR_ENV '#{ENV["LAZAR_ENV"]}', ple
 ENV["MONGOID_ENV"] = ENV["LAZAR_ENV"]
 ENV["RACK_ENV"] = ENV["LAZAR_ENV"] # should set sinatra environment
+# search for a central mongo database in use
+# http://opentox.github.io/installation/2017/03/07/use-central-mongodb-in-docker-environment
+CENTRAL_MONGO_IP = `grep -oP '^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}(?=.*mongodb)' /etc/hosts`.chomp
 Mongoid.load_configuration({
   :clients => {
     :default => {
       :database => ENV["LAZAR_ENV"],
-      :hosts => ["localhost:27017"],
+      :hosts => (CENTRAL_MONGO_IP.blank? ? ["localhost:27017"] : ["#{CENTRAL_MONGO_IP}:27017"]),
     }
   }
 })
 Mongoid.raise_not_found_error = false # return nil if no document is found
-$mongo = Mongo::Client.new("mongodb://127.0.0.1:27017/#{ENV['LAZAR_ENV']}")
+$mongo = Mongo::Client.new("mongodb://#{(CENTRAL_MONGO_IP.blank? ? "127.0.0.1" : CENTRAL_MONGO_IP)}:27017/#{ENV['LAZAR_ENV']}")
 $gridfs = $mongo.database.fs
 # Logger setup

data/lib/leave-one-out-validation.rb CHANGED Viewed

@@ -58,6 +58,7 @@ module OpenTox
       field :within_prediction_interval, type: Integer, default:0
       field :out_of_prediction_interval, type: Integer, default:0
       field :correlation_plot_id, type: BSON::ObjectId
+      field :warnings, type: Array
     end
   end

data/lib/model.rb CHANGED Viewed

@@ -57,7 +57,7 @@ module OpenTox
           model.version = {:warning => "git is not installed"}
         end
-        # set defaults
+        # set defaults#
         substance_classes = training_dataset.substances.collect{|s| s.class.to_s}.uniq
         bad_request_error "Cannot create models for mixed substance classes '#{substance_classes.join ', '}'." unless substance_classes.size == 1
@@ -68,10 +68,6 @@ module OpenTox
               :method => "fingerprint",
               :type => "MP2D",
             },
-            :similarity => {
-              :method => "Algorithm::Similarity.tanimoto",
-              :min => 0.1
-            },
             :feature_selection => nil
           }
@@ -79,9 +75,17 @@ module OpenTox
             model.algorithms[:prediction] = {
                 :method => "Algorithm::Classification.weighted_majority_vote",
             }
+            model.algorithms[:similarity] = {
+              :method => "Algorithm::Similarity.tanimoto",
+              :min => 0.1,
+            }
           elsif model.class == LazarRegression
             model.algorithms[:prediction] = {
-              :method => "Algorithm::Caret.pls",
+              :method => "Algorithm::Caret.rf",
+            }
+            model.algorithms[:similarity] = {
+              :method => "Algorithm::Similarity.tanimoto",
+              :min => 0.5,
             }
           end
@@ -93,7 +97,7 @@ module OpenTox
             },
             :similarity => {
               :method => "Algorithm::Similarity.weighted_cosine",
-              :min => 0.5
+              :min => 0.5,
             },
             :prediction => {
               :method => "Algorithm::Caret.rf",
@@ -141,7 +145,6 @@ module OpenTox
           end
           model.descriptor_ids = model.fingerprints.flatten.uniq
           model.descriptor_ids.each do |d|
-            # resulting model may break BSON size limit (e.g. f Kazius dataset)
             model.independent_variables << model.substance_ids.collect_with_index{|s,i| model.fingerprints[i].include? d} if model.algorithms[:prediction][:method].match /Caret/
           end
         # calculate physchem properties
@@ -191,7 +194,7 @@ module OpenTox
       # Predict a substance (compound or nanoparticle)
       # @param [OpenTox::Substance]
       # @return [Hash]
-      def predict_substance substance
+      def predict_substance substance, threshold = self.algorithms[:similarity][:min]
         @independent_variables = Marshal.load $gridfs.find_one(_id: self.independent_variables_id).data
         case algorithms[:similarity][:method]
@@ -221,20 +224,19 @@ module OpenTox
           bad_request_error "Unknown descriptor type '#{descriptors}' for similarity method '#{similarity[:method]}'."
         end
-        prediction = {}
+        prediction = {:warnings => [], :measurements => []}
+        prediction[:warnings] << "Similarity threshold #{threshold} < #{algorithms[:similarity][:min]}, prediction may be out of applicability domain." if threshold < algorithms[:similarity][:min]
         neighbor_ids = []
         neighbor_similarities = []
         neighbor_dependent_variables = []
         neighbor_independent_variables = []
-        prediction = {}
         # find neighbors
         substance_ids.each_with_index do |s,i|
           # handle query substance
           if substance.id.to_s == s
-            prediction[:measurements] ||= []
             prediction[:measurements] << dependent_variables[i]
-            prediction[:warning] = "Substance '#{substance.name}, id:#{substance.id}' has been excluded from neighbors, because it is identical with the query substance."
+            prediction[:info] = "Substance '#{substance.name}, id:#{substance.id}' has been excluded from neighbors, because it is identical with the query substance."
           else
             if fingerprints?
               neighbor_descriptors = fingerprints[i]
@@ -243,7 +245,7 @@ module OpenTox
               neighbor_descriptors = scaled_variables.collect{|v| v[i]}
             end
             sim = Algorithm.run algorithms[:similarity][:method], [similarity_descriptors, neighbor_descriptors, descriptor_weights]
-            if sim >= algorithms[:similarity][:min]
+            if sim >= threshold
               neighbor_ids << s
               neighbor_similarities << sim
               neighbor_dependent_variables << dependent_variables[i]
@@ -258,17 +260,27 @@ module OpenTox
         measurements = nil
         if neighbor_similarities.empty?
-          prediction.merge!({:value => nil,:warning => "Could not find similar substances with experimental data in the training dataset.",:neighbors => []})
+          prediction[:value] = nil
+          prediction[:warnings] << "Could not find similar substances with experimental data in the training dataset."
         elsif neighbor_similarities.size == 1
-          prediction.merge!({:value => dependent_variables.first, :probabilities => nil, :warning => "Only one similar compound in the training set. Predicting its experimental value.", :neighbors => [{:id => neighbor_ids.first, :similarity => neighbor_similarities.first}]})
+          prediction[:value] = nil
+          prediction[:warnings] << "Cannot create prediction: Only one similar compound in the training set."
+          prediction[:neighbors] = [{:id => neighbor_ids.first, :similarity => neighbor_similarities.first}]
         else
           query_descriptors.collect!{|d| d ? 1 : 0} if algorithms[:feature_selection] and algorithms[:descriptors][:method] == "fingerprint"
           # call prediction algorithm
           result = Algorithm.run algorithms[:prediction][:method], dependent_variables:neighbor_dependent_variables,independent_variables:neighbor_independent_variables ,weights:neighbor_similarities, query_variables:query_descriptors
           prediction.merge! result
           prediction[:neighbors] = neighbor_ids.collect_with_index{|id,i| {:id => id, :measurement => neighbor_dependent_variables[i], :similarity => neighbor_similarities[i]}}
+          #if neighbor_similarities.max < algorithms[:similarity][:warn_min]
+            #prediction[:warnings] << "Closest neighbor has similarity < #{algorithms[:similarity][:warn_min]}. Prediction may be out of applicability domain."
+          #end
+        end
+        if prediction[:warnings].empty? or threshold < algorithms[:similarity][:min] or threshold <= 0.2
+          prediction
+        else # try again with a lower threshold
+          predict_substance substance, 0.2
         end
-        prediction
       end
       # Predict a substance (compound or nanoparticle), an array of substances or a dataset
@@ -300,7 +312,7 @@ module OpenTox
         # serialize result
         if object.is_a? Substance
           prediction = predictions[substances.first.id.to_s]
-          prediction[:neighbors].sort!{|a,b| b[1] <=> a[1]} # sort according to similarity
+          prediction[:neighbors].sort!{|a,b| b[1] <=> a[1]} if prediction[:neighbors]# sort according to similarity
           return prediction
         elsif object.is_a? Array
           return predictions

data/lib/regression.rb CHANGED Viewed

@@ -17,7 +17,7 @@ module OpenTox
           sim_sum += weights[i]
         end if dependent_variables
         sim_sum == 0 ? prediction = nil : prediction = weighted_sum/sim_sum
-        {:value => prediction}
+        {:value => prediction, :warnings => ["Weighted average prediction, no prediction interval available."]}
       end
     end

data/lib/train-test-validation.rb CHANGED Viewed

@@ -27,6 +27,8 @@ module OpenTox
           end
         end
         predictions.select!{|cid,p| p[:value] and p[:measurements]}
+        # hack to avoid mongos file size limit error on large datasets
+        #predictions.each{|cid,p| p[:neighbors] = []} if model.training_dataset.name.match(/mutagenicity/i)
         validation = self.new(
           :model_id => validation_model.id,
           :test_dataset_id => test_set.id,

data/lib/unique_descriptors.rb CHANGED Viewed

@@ -48,7 +48,8 @@ UNIQUEDESCRIPTORS = [
   #"Cdk.HBondAcceptorCount", #Descriptor that calculates the number of hydrogen bond acceptors.
   #"Cdk.HBondDonorCount", #Descriptor that calculates the number of hydrogen bond donors.
   "Cdk.HybridizationRatio", #Characterizes molecular complexity in terms of carbon hybridization states.
-  "Cdk.IPMolecularLearning", #Descriptor that evaluates the ionization potential.
+  # TODO check why the next descriptor is not present in the CDK_DESCRIPTIONS variable.
+  #"Cdk.IPMolecularLearning", #Descriptor that evaluates the ionization potential.
   "Cdk.KappaShapeIndices", #Descriptor that calculates Kier and Hall kappa molecular shape indices.
   "Cdk.KierHallSmarts", #Counts the number of occurrences of the E-state fragments
   "Cdk.LargestChain", #Returns the number of atoms in the largest chain

data/lib/validation-statistics.rb CHANGED Viewed

@@ -111,6 +111,7 @@ module OpenTox
       # Get statistics
       # @return [Hash]
       def statistics
+        self.warnings = []
         self.rmse = 0
         self.mae = 0
         self.within_prediction_interval = 0
@@ -132,8 +133,10 @@ module OpenTox
               end
             end
           else
-            warnings << "No training activities for #{Compound.find(compound_id).smiles} in training dataset #{model.training_dataset_id}."
-            $logger.debug "No training activities for #{Compound.find(compound_id).smiles} in training dataset #{model.training_dataset_id}."
+            trd_id = model.training_dataset_id
+            smiles = Compound.find(cid).smiles
+            self.warnings << "No training activities for #{smiles} in training dataset #{trd_id}."
+            $logger.debug "No training activities for #{smiles} in training dataset #{trd_id}."
           end
         end
         R.assign "measurement", x
@@ -146,6 +149,7 @@ module OpenTox
         $logger.debug "RMSE #{rmse}"
         $logger.debug "MAE #{mae}"
         $logger.debug "#{percent_within_prediction_interval.round(2)}% of measurements within prediction interval"
+        $logger.debug "#{warnings}"
         save
         {
           :mae => mae,
@@ -179,8 +183,12 @@ module OpenTox
           R.assign "prediction", y
           R.eval "all = c(measurement,prediction)"
           R.eval "range = c(min(all), max(all))"
-          title = feature.name
-          title += "[#{feature.unit}]" if feature.unit and !feature.unit.blank?
+          if feature.name.match /Net cell association/ # ad hoc fix for awkward units
+            title = "log2(Net cell association [mL/ug(Mg)])"
+          else
+            title = feature.name
+            title += " [#{feature.unit}]" if feature.unit and !feature.unit.blank?
+          end
           R.eval "image = qplot(prediction,measurement,main='#{title}',xlab='Prediction',ylab='Measurement',asp=1,xlim=range, ylim=range)"
           R.eval "image = image + geom_abline(intercept=0, slope=1)"
           R.eval "ggsave(file='#{tmpfile}', plot=image)"
@@ -191,51 +199,21 @@ module OpenTox
         $gridfs.find_one(_id: correlation_plot_id).data
       end
-      # Get predictions with the largest difference between predicted and measured values
-      # @params [Fixnum] number of predictions
-      # @params [TrueClass,FalseClass,nil] include neighbors
-      # @params [TrueClass,FalseClass,nil] show common descriptors
+      # Get predictions with measurements outside of the prediction interval
       # @return [Hash]
-      def worst_predictions n: 5, show_neigbors: true, show_common_descriptors: false
-        worst_predictions = predictions.sort_by{|sid,p| -(p["value"] - p["measurements"].median).abs}[0,n]
-        worst_predictions.collect do |p|
-          substance = Substance.find(p.first)
-          prediction = p[1]
-          if show_neigbors
-            neighbors = prediction["neighbors"].collect do |n|
-              common_descriptors = []
-              if show_common_descriptors
-                common_descriptors = n["common_descriptors"].collect do |d|
-                  f=Feature.find(d)
-                  {
-                    :id => f.id.to_s,
-                    :name => "#{f.name} (#{f.conditions})",
-                    :p_value => d[:p_value],
-                    :r_squared => d[:r_squared],
-                  }
-                end
-              else
-                common_descriptors = n["common_descriptors"].size
-              end
-              {
-                :name => Substance.find(n["_id"]).name,
-                :id => n["_id"].to_s,
-                :common_descriptors => common_descriptors
-              }
-            end
-          else
-            neighbors = prediction["neighbors"].size
+      def worst_predictions
+        worst_predictions = predictions.select do |sid,p|
+          p["prediction_interval"] and p["value"] and (p["measurements"].max < p["prediction_interval"][0] or p["measurements"].min > p["prediction_interval"][1])
+        end.compact.to_h
+        worst_predictions.each do |sid,p|
+          p["error"] = (p["value"] - p["measurements"].median).abs
+          if p["measurements"].max < p["prediction_interval"][0]
+            p["distance_prediction_interval"] = (p["measurements"].max - p["prediction_interval"][0]).abs
+          elsif p["measurements"].min > p["prediction_interval"][1]
+            p["distance_prediction_interval"] = (p["measurements"].min - p["prediction_interval"][1]).abs
           end
-          {
-            :id => substance.id.to_s,
-            :name => substance.name,
-            :feature => Feature.find(prediction["prediction_feature_id"]).name,
-            :error => (prediction["value"] - prediction["measurements"].median).abs,
-            :prediction => prediction["value"],
-            :measurements => prediction["measurements"],
-            :neighbors => neighbors
-          }
         end
+        worst_predictions.sort_by{|sid,p| p["distance_prediction_interval"] }.to_h
       end
     end
   end

data/test/dataset.rb CHANGED Viewed

@@ -160,7 +160,7 @@ class DatasetTest < MiniTest::Test
         if v.numeric?
           assert_equal v.to_f, serialized[inchi][i].to_f
         else
-          assert_equal v, serialized[inchi][i]
+          assert_equal v.to_s, serialized[inchi][i].to_s
         end
       end

data/test/feature.rb CHANGED Viewed

@@ -57,20 +57,20 @@ class FeatureTest < MiniTest::Test
   def test_physchem_description
     assert_equal 346, PhysChem.descriptors.size
     assert_equal 15, PhysChem.openbabel_descriptors.size
-    assert_equal 295, PhysChem.cdk_descriptors.size
+    assert_equal 286, PhysChem.cdk_descriptors.size
     assert_equal 45, PhysChem.joelib_descriptors.size
-    assert_equal 310, PhysChem.unique_descriptors.size
+    assert_equal 309, PhysChem.unique_descriptors.size
   end
   def test_physchem
     assert_equal 346, PhysChem.descriptors.size
     c = Compound.from_smiles "CC(=O)CC(C)C"
     logP = PhysChem.find_or_create_by :name => "Openbabel.logP"
-    assert_equal 1.6215, logP.calculate(c)
+    assert_equal 1.6215, c.calculate_properties([logP]).first
     jlogP = PhysChem.find_or_create_by :name => "Joelib.LogP"
-    assert_equal 3.5951, jlogP.calculate(c)
+    assert_equal 3.5951, c.calculate_properties([jlogP]).first
     alogP = PhysChem.find_or_create_by :name => "Cdk.ALOGP.ALogP"
-    assert_equal 0.35380000000000034, alogP.calculate(c)
+    assert_equal 0.35380000000000034, c.calculate_properties([alogP]).first
   end
 end

data/test/model-classification.rb CHANGED Viewed

@@ -46,12 +46,14 @@ class LazarClassificationTest < MiniTest::Test
     assert_equal compound_dataset.compounds, prediction_dataset.compounds
     cid = prediction_dataset.compounds[7].id.to_s
-    assert_equal "Could not find similar substances with experimental data in the training dataset.", prediction_dataset.predictions[cid][:warning]
+    assert_equal "Could not find similar substances with experimental data in the training dataset.", prediction_dataset.predictions[cid][:warnings][0]
+    expectations = ["Cannot create prediction: Only one similar compound in the training set.",
+    "Could not find similar substances with experimental data in the training dataset."]
     prediction_dataset.predictions.each do |cid,pred|
-      assert_equal "Could not find similar substances with experimental data in the training dataset.", pred[:warning] if pred[:value].nil?
+      assert_includes expectations, pred[:warnings][0] if pred[:value].nil?
     end
     cid = Compound.from_smiles("CCOC(=O)N").id.to_s
-    assert_match "excluded", prediction_dataset.predictions[cid][:warning]
+    assert_match "excluded", prediction_dataset.predictions[cid][:info]
     # cleanup
     [training_dataset,model,compound_dataset,prediction_dataset].each{|o| o.delete}
   end

data/test/model-regression.rb CHANGED Viewed

@@ -10,21 +10,21 @@ class LazarRegressionTest < MiniTest::Test
       },
       :similarity => {
         :method => "Algorithm::Similarity.tanimoto",
-        :min => 0.1
+        :min => 0.5
       },
       :prediction => {
-        :method => "Algorithm::Caret.pls",
+        :method => "Algorithm::Caret.rf",
       },
       :feature_selection => nil,
     }
-    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.medi_log10.csv")
+    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM_log10.csv")
     model = Model::Lazar.create  training_dataset: training_dataset
     assert_kind_of Model::LazarRegression, model
     assert_equal algorithms, model.algorithms
-    substance = training_dataset.substances[10]
+    substance = training_dataset.substances[145]
     prediction = model.predict substance
     assert_includes prediction[:prediction_interval][0]..prediction[:prediction_interval][1], prediction[:measurements].median, "This assertion assures that measured values are within the prediction interval. It may fail in 5% of the predictions."
-    substance = Compound.from_smiles "NC(=O)OCCC"
+    substance = Compound.from_smiles "c1ccc(cc1)Oc1ccccc1"
     prediction = model.predict substance
     refute_nil prediction[:value]
     refute_nil prediction[:prediction_interval]
@@ -59,8 +59,8 @@ class LazarRegressionTest < MiniTest::Test
     model = Model::Lazar.create training_dataset: training_dataset, algorithms: algorithms
     compound = Compound.from_smiles "CCCSCCSCC"
     prediction = model.predict compound
-    assert_equal 4, prediction[:neighbors].size
-    assert_equal 1.37, prediction[:value].round(2)
+    assert_equal 3, prediction[:neighbors].size
+    assert prediction[:value].round(2) > 1.37, "Prediction value (#{prediction[:value].round(2)}) should be larger than 1.37."
   end
   def test_local_physchem_regression
@@ -112,12 +112,12 @@ class LazarRegressionTest < MiniTest::Test
         :method => "Algorithm::Similarity.cosine",
       }
     }
-    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.mini_log10.csv")
+    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.medi_log10.csv")
     model = Model::Lazar.create  training_dataset: training_dataset, algorithms: algorithms
     assert_kind_of Model::LazarRegression, model
-    assert_equal "Algorithm::Caret.pls", model.algorithms[:prediction][:method]
+    assert_equal "Algorithm::Caret.rf", model.algorithms[:prediction][:method]
     assert_equal "Algorithm::Similarity.cosine", model.algorithms[:similarity][:method]
-    assert_equal 0.1, model.algorithms[:similarity][:min]
+    assert_equal 0.5, model.algorithms[:similarity][:min]
     algorithms[:descriptors].delete :features
     assert_equal algorithms[:descriptors], model.algorithms[:descriptors]
     prediction = model.predict training_dataset.substances[10]
@@ -130,14 +130,14 @@ class LazarRegressionTest < MiniTest::Test
         :method => "Algorithm::FeatureSelection.correlation_filter",
       },
     }
-    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.mini_log10.csv")
+    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM_log10.csv")
     model = Model::Lazar.create  training_dataset: training_dataset, algorithms: algorithms
     assert_kind_of Model::LazarRegression, model
-    assert_equal "Algorithm::Caret.pls", model.algorithms[:prediction][:method]
+    assert_equal "Algorithm::Caret.rf", model.algorithms[:prediction][:method]
     assert_equal "Algorithm::Similarity.tanimoto", model.algorithms[:similarity][:method]
-    assert_equal 0.1, model.algorithms[:similarity][:min]
+    assert_equal 0.5, model.algorithms[:similarity][:min]
     assert_equal algorithms[:feature_selection][:method], model.algorithms[:feature_selection][:method]
-    prediction = model.predict training_dataset.substances[10]
+    prediction = model.predict training_dataset.substances[145]
     refute_nil prediction[:value]
   end

data/test/model-validation.rb CHANGED Viewed

@@ -12,7 +12,7 @@ class ValidationModelTest < MiniTest::Test
     m.crossvalidations.each do |cv|
       assert cv.accuracy > 0.74, "Crossvalidation accuracy (#{cv.accuracy}) should be larger than 0.75. This may happen due to an unfavorable training/test set split."
     end
-    prediction = m.predict Compound.from_smiles("CCCC(NN)C")
+    prediction = m.predict Compound.from_smiles("OCC(CN(CC(O)C)N=O)O")
     assert_equal "true", prediction[:value]
     m.delete
   end

data/test/setup.rb CHANGED Viewed

@@ -3,6 +3,8 @@ require 'minitest/autorun'
 require_relative '../lib/lazar.rb'
 #require 'lazar'
 include OpenTox
+#$mongo.database.drop
+#$gridfs = $mongo.database.fs # recreate GridFS indexes
 TEST_DIR ||= File.expand_path(File.dirname(__FILE__))
 DATA_DIR ||= File.join(TEST_DIR,"data")
 training_dataset = Dataset.where(:name => "Protein Corona Fingerprinting Predicts the Cellular Interaction of Gold and Silver Nanoparticles").first

data/test/validation-classification.rb CHANGED Viewed

@@ -47,7 +47,7 @@ class ValidationClassificationTest < MiniTest::Test
     dataset = Dataset.from_csv_file "#{DATA_DIR}/hamster_carcinogenicity.csv"
     model = Model::Lazar.create training_dataset: dataset
     loo = ClassificationLeaveOneOut.create model
-    assert_equal 14, loo.nr_unpredicted
+    assert_equal 24, loo.nr_unpredicted
     refute_empty loo.confusion_matrix
     assert loo.accuracy > 0.77
     assert loo.weighted_accuracy > loo.accuracy, "Weighted accuracy (#{loo.weighted_accuracy}) should be larger than accuracy (#{loo.accuracy})."

data/test/validation-regression.rb CHANGED Viewed

@@ -84,7 +84,7 @@ class ValidationRegressionTest < MiniTest::Test
     repeated_cv = RepeatedCrossValidation.create model
     repeated_cv.crossvalidations.each do |cv|
       assert cv.r_squared > 0.34, "R^2 (#{cv.r_squared}) should be larger than 0.034"
-      assert_operator cv.accuracy, :>, 0.7, "model accuracy < 0.7, this may happen by chance due to an unfavorable training/test set split"
+      assert cv.rmse < 0.5, "RMSE (#{cv.rmse}) should be smaller than 0.5"
     end
   end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: lazar
 version: !ruby/object:Gem::Version
-  version: 1.0.1
+  version: 1.1.0
 platform: ruby
 authors:
 - Christoph Helma, Martin Guetlein, Andreas Maunz, Micha Rautenberg, David Vorgrimmler,
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-01-18 00:00:00.000000000 Z
+date: 2017-05-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler