RubyGems - lazar - Versions diffs - 1.0.1 → 1.1.0 - Mend

lazar 1.0.1 → 1.1.0

Files changed (23) hide show

checksums.yaml +4 -4
data/README.md +98 -2
data/VERSION +1 -1
data/lib/caret.rb +5 -6
data/lib/classification.rb +5 -0
data/lib/crossvalidation.rb +1 -0
data/lib/dataset.rb +1 -1
data/lib/lazar.rb +5 -2
data/lib/leave-one-out-validation.rb +1 -0
data/lib/model.rb +30 -18
data/lib/regression.rb +1 -1
data/lib/train-test-validation.rb +2 -0
data/lib/unique_descriptors.rb +2 -1
data/lib/validation-statistics.rb +24 -46
data/test/dataset.rb +1 -1
data/test/feature.rb +5 -5
data/test/model-classification.rb +5 -3
data/test/model-regression.rb +14 -14
data/test/model-validation.rb +1 -1
data/test/setup.rb +2 -0
data/test/validation-classification.rb +1 -1
data/test/validation-regression.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: c17dc3fb7cae4c75aca1be7c0a6286cfbc3f22ce
-  data.tar.gz: 5b9fb4bae6230e427188e0c8e34153fd5a6efa0a
+  metadata.gz: 698fd96821077269f4c31fdfca6bead6beab36f0
+  data.tar.gz: 2fd49abf99e8f5367b83764735c5c2e49caad4d2
 SHA512:
-  metadata.gz: 7cae1ffb410cd9a2d1afd1516ebf99499e2b2447af8707a4381adb652cb59711e1875c11e80cec8fc101f8368224ab21bc378f685b0084ab29c631d798145dca
-  data.tar.gz: d01273022852b6a0b59941a0e881a85ed1400a984912018d97fc137f8ab602cff1fd6f5fb42a65df5d9375cb43ca2809adeefbde7a0e385fd832c189df0da031
+  metadata.gz: ef5d243c765f5d1d6bb4c2dbd53bf2cdcca380df532cef8c48447b907ff01086f1b50c261e1f494827a30e52e7d8e62e2c077334cdbe30370546680ffd018886
+  data.tar.gz: d9bb905832388dcb44bb3211fb6976c4e0894d75eb02fdf60c17fe19f25dec73ba24b5585fea8cb85e77b6793b46518462cf1d9b2488c465332b9db3af132028

data/README.md CHANGED Viewed

@@ -59,7 +59,75 @@ Execute the following commands either from an interactive Ruby shell or a Ruby s
 #### Experiment with other algorithms
-  You can pass algorithms parameters to the `Model::Validation.create_from_csv_file` command. The [API documentation](http://rdoc.info/gems/lazar) provides detailed instructions.
+  You can pass algorithm specifications as parameters to the `Model::Validation.create_from_csv_file` and `Model::Lazar.create` commands. Algorithms for descriptors, similarity calculations, feature_selection and local models are specified in the `algorithm` parameter. Unspecified algorithms and parameters are substituted by default values. The example below selects
+  - MP2D fingerprint descriptors
+  - Tanimoto similarity with a threshold of 0.1
+  - no feature selection
+  - weighted majority vote predictions
+  ```
+algorithms = {
+  :descriptors => { # descriptor algorithm
+    :method => "fingerprint", # fingerprint descriptors
+    :type => "MP2D" # fingerprint type, e.g. FP4, MACCS
+  },
+  :similarity => { # similarity algorithm
+    :method => "Algorithm::Similarity.tanimoto",
+    :min => 0.1 # similarity threshold for neighbors
+  },
+  :feature_selection => nil, # no feature selection
+  :prediction => { # local modelling algorithm
+    :method => "Algorithm::Classification.weighted_majority_vote",
+  },
+}
+training_dataset = Dataset.from_csv_file "hamster_carcinogenicity.csv"
+model = Model::Lazar.create  training_dataset: training_dataset, algorithms: algorithms
+  ```
+  The next example creates a regression model with
+  - calculated descriptors from OpenBabel libraries
+  - weighted cosine similarity and a threshold of 0.5
+  - descriptors that are correlated with the endpoint
+  - local partial least squares models from the R caret package
+  ```
+algorithms = {
+  :descriptors => { # descriptor algorithm
+    :method => "calculate_properties",
+    :features => PhysChem.openbabel_descriptors,
+  },
+  :similarity => { # similarity algorithm
+    :method => "Algorithm::Similarity.weighted_cosine",
+    :min => 0.5
+  },
+  :feature_selection => { # feature selection algorithm
+    :method => "Algorithm::FeatureSelection.correlation_filter",
+  },
+  :prediction => { # local modelling algorithm
+    :method => "Algorithm::Caret.pls",
+  },
+}
+training_dataset = Dataset.from_csv_file "EPAFHM_log10.csv"
+model = Model::Lazar.create(training_dataset:training_dataset, algorithms:algorithms)
+    ```
+Please consult the [API documentation](http://rdoc.info/gems/lazar) and [source code](https:://github.com/opentox/lazar) for up to date information about implemented algorithms:
+- Descriptor algorithms
+  - [Compounds](http://www.rubydoc.info/gems/lazar/OpenTox/Compound)
+  - [Nanoparticles](http://www.rubydoc.info/gems/lazar/OpenTox/Nanoparticle)
+- [Similarity algorithms](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Similarity)
+- [Feature selection algorithms](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/FeatureSelection)
+- Local models
+  - [Classification](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Classification)
+  - [Regression](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Regression)
+  - [R caret](http://www.rubydoc.info/gems/lazar/OpenTox/Algorithm/Caret)
+You can find more working examples in the `lazar` `model-*.rb` and `validation-*.rb` [tests](https://github.com/opentox/lazar/tree/master/test).
 ### Create and use `lazar` nanoparticle models
@@ -87,7 +155,35 @@ Execute the following commands either from an interactive Ruby shell or a Ruby s
 #### Experiment with other datasets, endpoints and algorithms
-  You can pass training_dataset, prediction_feature and algorithms parameters to the `Model::Validation.create_from_enanomapper` command. The [API documentation](http://rdoc.info/gems/lazar) provides detailed instructions. Detailed documentation and validation results can be found in this [publication](https://github.com/enanomapper/nano-lazar-paper/blob/master/nano-lazar.pdf).
+  You can pass training_dataset, prediction_feature and algorithms parameters to the `Model::Validation.create_from_enanomapper` command. Procedure and options are the same as for compounds. The following commands create and validate a `nano-lazar` model with
+  - measured P-CHEM properties as descriptors
+  - descriptors selected with correlation filter
+  - weighted cosine similarity with a threshold of 0.5
+  - Caret random forests
+```
+algorithms = {
+  :descriptors => {
+    :method => "properties",
+    :categories => ["P-CHEM"],
+  },
+  :similarity => {
+    :method => "Algorithm::Similarity.weighted_cosine",
+    :min => 0.5
+  },
+  :feature_selection => {
+    :method => "Algorithm::FeatureSelection.correlation_filter",
+  },
+  :prediction => {
+    :method => "Algorithm::Caret.rf",
+  },
+}
+validation_model = Model::Validation.from_enanomapper algorithms: algorithms
+```
+  Detailed documentation and validation results for nanoparticle models can be found in this [publication](https://github.com/enanomapper/nano-lazar-paper/blob/master/nano-lazar.pdf).
 Documentation
 -------------

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 1.0.1
1	+ 1.1.0

data/lib/caret.rb CHANGED Viewed

@@ -22,12 +22,11 @@ module OpenTox
         end
         if independent_variables.flatten.uniq == ["NA"] or independent_variables.flatten.uniq == []
           prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-          prediction[:warning] = "No variables for regression model. Using weighted average of similar substances."
+          prediction[:warnings] << "No variables for regression model. Using weighted average of similar substances."
         elsif
           dependent_variables.size < 3
           prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-          prediction[:warning] = "Insufficient number of neighbors (#{dependent_variables.size}) for regression model. Using weighted average of similar substances."
+          prediction[:warnings] << "Insufficient number of neighbors (#{dependent_variables.size}) for regression model. Using weighted average of similar substances."
         else
           dependent_variables.each_with_index do |v,i|
             dependent_variables[i] = to_r(v)
@@ -52,7 +51,7 @@ module OpenTox
             $logger.debug dependent_variables
             $logger.debug independent_variables
             prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-            prediction[:warning] = "R caret model creation error. Using weighted average of similar substances."
+            prediction[:warnings] << "R caret model creation error. Using weighted average of similar substances."
             return prediction
           end
           begin
@@ -73,12 +72,12 @@ module OpenTox
             $logger.debug "R caret prediction error for:"
             $logger.debug self.inspect
             prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-            prediction[:warning] = "R caret prediction error. Using weighted average of similar substances"
+            prediction[:warnings] << "R caret prediction error. Using weighted average of similar substances"
             return prediction
           end
           if prediction.nil? or prediction[:value].nil?
             prediction = Algorithm::Regression::weighted_average dependent_variables:dependent_variables, weights:weights
-            prediction[:warning] = "Could not create local caret model. Using weighted average of similar substances."
+            prediction[:warnings] << "Empty R caret prediction. Using weighted average of similar substances."
           end
         end
         prediction

data/lib/classification.rb CHANGED Viewed

@@ -18,6 +18,11 @@ module OpenTox
         class_weights.each do |a,w|
           probabilities[a] = w.sum/weights.sum
         end
+        # DG: hack to ensure always two probability values
+        if probabilities.keys.uniq.size == 1
+          missing_key = probabilities.keys.uniq[0].match(/^non/) ? probabilities.keys.uniq[0].sub(/non-/,"") : "non-"+probabilities.keys.uniq[0]
+          probabilities[missing_key] = 0.0
+        end
         probabilities = probabilities.collect{|a,p| [a,weights.max*p]}.to_h
         p_max = probabilities.collect{|a,p| p}.max
         prediction = probabilities.key(p_max)

data/lib/crossvalidation.rb CHANGED Viewed

@@ -90,6 +90,7 @@ module OpenTox
       field :within_prediction_interval, type: Integer, default:0
       field :out_of_prediction_interval, type: Integer, default:0
       field :correlation_plot_id, type: BSON::ObjectId
+      field :warnings, type: Array
     end
     # Independent repeated crossvalidations

data/lib/dataset.rb CHANGED Viewed

@@ -46,7 +46,7 @@ module OpenTox
       if data_entries[substance.to_s] and data_entries[substance.to_s][feature.to_s]
         data_entries[substance.to_s][feature.to_s]
       else
-        nil
+        [nil]
       end
     end

data/lib/lazar.rb CHANGED Viewed

@@ -16,16 +16,19 @@ raise "Incorrect lazar environment variable LAZAR_ENV '#{ENV["LAZAR_ENV"]}', ple
 ENV["MONGOID_ENV"] = ENV["LAZAR_ENV"]
 ENV["RACK_ENV"] = ENV["LAZAR_ENV"] # should set sinatra environment
+# search for a central mongo database in use
+# http://opentox.github.io/installation/2017/03/07/use-central-mongodb-in-docker-environment
+CENTRAL_MONGO_IP = `grep -oP '^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}(?=.*mongodb)' /etc/hosts`.chomp
 Mongoid.load_configuration({
   :clients => {
     :default => {
       :database => ENV["LAZAR_ENV"],
-      :hosts => ["localhost:27017"],
+      :hosts => (CENTRAL_MONGO_IP.blank? ? ["localhost:27017"] : ["#{CENTRAL_MONGO_IP}:27017"]),
     }
   }
 })
 Mongoid.raise_not_found_error = false # return nil if no document is found
-$mongo = Mongo::Client.new("mongodb://127.0.0.1:27017/#{ENV['LAZAR_ENV']}")
+$mongo = Mongo::Client.new("mongodb://#{(CENTRAL_MONGO_IP.blank? ? "127.0.0.1" : CENTRAL_MONGO_IP)}:27017/#{ENV['LAZAR_ENV']}")
 $gridfs = $mongo.database.fs
 # Logger setup

data/lib/leave-one-out-validation.rb CHANGED Viewed

@@ -58,6 +58,7 @@ module OpenTox
       field :within_prediction_interval, type: Integer, default:0
       field :out_of_prediction_interval, type: Integer, default:0
       field :correlation_plot_id, type: BSON::ObjectId
+      field :warnings, type: Array
     end
   end

data/lib/model.rb CHANGED Viewed

@@ -57,7 +57,7 @@ module OpenTox
           model.version = {:warning => "git is not installed"}
         end
-        # set defaults
+        # set defaults#
         substance_classes = training_dataset.substances.collect{|s| s.class.to_s}.uniq
         bad_request_error "Cannot create models for mixed substance classes '#{substance_classes.join ', '}'." unless substance_classes.size == 1
@@ -68,10 +68,6 @@ module OpenTox
               :method => "fingerprint",
               :type => "MP2D",
             },
-            :similarity => {
-              :method => "Algorithm::Similarity.tanimoto",
-              :min => 0.1
-            },
             :feature_selection => nil
           }
@@ -79,9 +75,17 @@ module OpenTox
             model.algorithms[:prediction] = {
                 :method => "Algorithm::Classification.weighted_majority_vote",
             }
+            model.algorithms[:similarity] = {
+              :method => "Algorithm::Similarity.tanimoto",
+              :min => 0.1,
+            }
           elsif model.class == LazarRegression
             model.algorithms[:prediction] = {
-              :method => "Algorithm::Caret.pls",
+              :method => "Algorithm::Caret.rf",
+            }
+            model.algorithms[:similarity] = {
+              :method => "Algorithm::Similarity.tanimoto",
+              :min => 0.5,
             }
           end
@@ -93,7 +97,7 @@ module OpenTox
             },
             :similarity => {
               :method => "Algorithm::Similarity.weighted_cosine",
-              :min => 0.5
+              :min => 0.5,
             },
             :prediction => {
               :method => "Algorithm::Caret.rf",
@@ -141,7 +145,6 @@ module OpenTox
           end
           model.descriptor_ids = model.fingerprints.flatten.uniq
           model.descriptor_ids.each do |d|
-            # resulting model may break BSON size limit (e.g. f Kazius dataset)
             model.independent_variables << model.substance_ids.collect_with_index{|s,i| model.fingerprints[i].include? d} if model.algorithms[:prediction][:method].match /Caret/
           end
         # calculate physchem properties
@@ -191,7 +194,7 @@ module OpenTox
       # Predict a substance (compound or nanoparticle)
       # @param [OpenTox::Substance]
       # @return [Hash]
-      def predict_substance substance
+      def predict_substance substance, threshold = self.algorithms[:similarity][:min]
         @independent_variables = Marshal.load $gridfs.find_one(_id: self.independent_variables_id).data
         case algorithms[:similarity][:method]
@@ -221,20 +224,19 @@ module OpenTox
           bad_request_error "Unknown descriptor type '#{descriptors}' for similarity method '#{similarity[:method]}'."
         end
-        prediction = {}
+        prediction = {:warnings => [], :measurements => []}
+        prediction[:warnings] << "Similarity threshold #{threshold} < #{algorithms[:similarity][:min]}, prediction may be out of applicability domain." if threshold < algorithms[:similarity][:min]
         neighbor_ids = []
         neighbor_similarities = []
         neighbor_dependent_variables = []
         neighbor_independent_variables = []
-        prediction = {}
         # find neighbors
         substance_ids.each_with_index do |s,i|
           # handle query substance
           if substance.id.to_s == s
-            prediction[:measurements] ||= []
             prediction[:measurements] << dependent_variables[i]
-            prediction[:warning] = "Substance '#{substance.name}, id:#{substance.id}' has been excluded from neighbors, because it is identical with the query substance."
+            prediction[:info] = "Substance '#{substance.name}, id:#{substance.id}' has been excluded from neighbors, because it is identical with the query substance."
           else
             if fingerprints?
               neighbor_descriptors = fingerprints[i]
@@ -243,7 +245,7 @@ module OpenTox
               neighbor_descriptors = scaled_variables.collect{|v| v[i]}
             end
             sim = Algorithm.run algorithms[:similarity][:method], [similarity_descriptors, neighbor_descriptors, descriptor_weights]
-            if sim >= algorithms[:similarity][:min]
+            if sim >= threshold
               neighbor_ids << s
               neighbor_similarities << sim
               neighbor_dependent_variables << dependent_variables[i]
@@ -258,17 +260,27 @@ module OpenTox
         measurements = nil
         if neighbor_similarities.empty?
-          prediction.merge!({:value => nil,:warning => "Could not find similar substances with experimental data in the training dataset.",:neighbors => []})
+          prediction[:value] = nil
+          prediction[:warnings] << "Could not find similar substances with experimental data in the training dataset."
         elsif neighbor_similarities.size == 1
-          prediction.merge!({:value => dependent_variables.first, :probabilities => nil, :warning => "Only one similar compound in the training set. Predicting its experimental value.", :neighbors => [{:id => neighbor_ids.first, :similarity => neighbor_similarities.first}]})
+          prediction[:value] = nil
+          prediction[:warnings] << "Cannot create prediction: Only one similar compound in the training set."
+          prediction[:neighbors] = [{:id => neighbor_ids.first, :similarity => neighbor_similarities.first}]
         else
           query_descriptors.collect!{|d| d ? 1 : 0} if algorithms[:feature_selection] and algorithms[:descriptors][:method] == "fingerprint"
           # call prediction algorithm
           result = Algorithm.run algorithms[:prediction][:method], dependent_variables:neighbor_dependent_variables,independent_variables:neighbor_independent_variables ,weights:neighbor_similarities, query_variables:query_descriptors
           prediction.merge! result
           prediction[:neighbors] = neighbor_ids.collect_with_index{|id,i| {:id => id, :measurement => neighbor_dependent_variables[i], :similarity => neighbor_similarities[i]}}
+          #if neighbor_similarities.max < algorithms[:similarity][:warn_min]
+            #prediction[:warnings] << "Closest neighbor has similarity < #{algorithms[:similarity][:warn_min]}. Prediction may be out of applicability domain."
+          #end
+        end
+        if prediction[:warnings].empty? or threshold < algorithms[:similarity][:min] or threshold <= 0.2
+          prediction
+        else # try again with a lower threshold
+          predict_substance substance, 0.2
         end
-        prediction
       end
       # Predict a substance (compound or nanoparticle), an array of substances or a dataset
@@ -300,7 +312,7 @@ module OpenTox
         # serialize result
         if object.is_a? Substance
           prediction = predictions[substances.first.id.to_s]
-          prediction[:neighbors].sort!{|a,b| b[1] <=> a[1]} # sort according to similarity
+          prediction[:neighbors].sort!{|a,b| b[1] <=> a[1]} if prediction[:neighbors]# sort according to similarity
           return prediction
         elsif object.is_a? Array
           return predictions

data/lib/regression.rb CHANGED Viewed

@@ -17,7 +17,7 @@ module OpenTox
           sim_sum += weights[i]
         end if dependent_variables
         sim_sum == 0 ? prediction = nil : prediction = weighted_sum/sim_sum
-        {:value => prediction}
+        {:value => prediction, :warnings => ["Weighted average prediction, no prediction interval available."]}
       end
     end

data/lib/train-test-validation.rb CHANGED Viewed

@@ -27,6 +27,8 @@ module OpenTox
           end
         end
         predictions.select!{|cid,p| p[:value] and p[:measurements]}
+        # hack to avoid mongos file size limit error on large datasets
+        #predictions.each{|cid,p| p[:neighbors] = []} if model.training_dataset.name.match(/mutagenicity/i)
         validation = self.new(
           :model_id => validation_model.id,
           :test_dataset_id => test_set.id,

data/lib/unique_descriptors.rb CHANGED Viewed

@@ -48,7 +48,8 @@ UNIQUEDESCRIPTORS = [
   #"Cdk.HBondAcceptorCount", #Descriptor that calculates the number of hydrogen bond acceptors.
   #"Cdk.HBondDonorCount", #Descriptor that calculates the number of hydrogen bond donors.
   "Cdk.HybridizationRatio", #Characterizes molecular complexity in terms of carbon hybridization states.
-  "Cdk.IPMolecularLearning", #Descriptor that evaluates the ionization potential.
+  # TODO check why the next descriptor is not present in the CDK_DESCRIPTIONS variable.
+  #"Cdk.IPMolecularLearning", #Descriptor that evaluates the ionization potential.
   "Cdk.KappaShapeIndices", #Descriptor that calculates Kier and Hall kappa molecular shape indices.
   "Cdk.KierHallSmarts", #Counts the number of occurrences of the E-state fragments
   "Cdk.LargestChain", #Returns the number of atoms in the largest chain

data/lib/validation-statistics.rb CHANGED Viewed

@@ -111,6 +111,7 @@ module OpenTox
       # Get statistics
       # @return [Hash]
       def statistics
+        self.warnings = []
         self.rmse = 0
         self.mae = 0
         self.within_prediction_interval = 0
@@ -132,8 +133,10 @@ module OpenTox
               end
             end
           else
-            warnings << "No training activities for #{Compound.find(compound_id).smiles} in training dataset #{model.training_dataset_id}."
-            $logger.debug "No training activities for #{Compound.find(compound_id).smiles} in training dataset #{model.training_dataset_id}."
+            trd_id = model.training_dataset_id
+            smiles = Compound.find(cid).smiles
+            self.warnings << "No training activities for #{smiles} in training dataset #{trd_id}."
+            $logger.debug "No training activities for #{smiles} in training dataset #{trd_id}."
           end
         end
         R.assign "measurement", x
@@ -146,6 +149,7 @@ module OpenTox
         $logger.debug "RMSE #{rmse}"
         $logger.debug "MAE #{mae}"
         $logger.debug "#{percent_within_prediction_interval.round(2)}% of measurements within prediction interval"
+        $logger.debug "#{warnings}"
         save
         {
           :mae => mae,
@@ -179,8 +183,12 @@ module OpenTox
           R.assign "prediction", y
           R.eval "all = c(measurement,prediction)"
           R.eval "range = c(min(all), max(all))"
-          title = feature.name
-          title += "[#{feature.unit}]" if feature.unit and !feature.unit.blank?
+          if feature.name.match /Net cell association/ # ad hoc fix for awkward units
+            title = "log2(Net cell association [mL/ug(Mg)])"
+          else
+            title = feature.name
+            title += " [#{feature.unit}]" if feature.unit and !feature.unit.blank?
+          end
           R.eval "image = qplot(prediction,measurement,main='#{title}',xlab='Prediction',ylab='Measurement',asp=1,xlim=range, ylim=range)"
           R.eval "image = image + geom_abline(intercept=0, slope=1)"
           R.eval "ggsave(file='#{tmpfile}', plot=image)"
@@ -191,51 +199,21 @@ module OpenTox
         $gridfs.find_one(_id: correlation_plot_id).data
       end
-      # Get predictions with the largest difference between predicted and measured values
-      # @params [Fixnum] number of predictions
-      # @params [TrueClass,FalseClass,nil] include neighbors
-      # @params [TrueClass,FalseClass,nil] show common descriptors
+      # Get predictions with measurements outside of the prediction interval
       # @return [Hash]
-      def worst_predictions n: 5, show_neigbors: true, show_common_descriptors: false
-        worst_predictions = predictions.sort_by{|sid,p| -(p["value"] - p["measurements"].median).abs}[0,n]
-        worst_predictions.collect do |p|
-          substance = Substance.find(p.first)
-          prediction = p[1]
-          if show_neigbors
-            neighbors = prediction["neighbors"].collect do |n|
-              common_descriptors = []
-              if show_common_descriptors
-                common_descriptors = n["common_descriptors"].collect do |d|
-                  f=Feature.find(d)
-                  {
-                    :id => f.id.to_s,
-                    :name => "#{f.name} (#{f.conditions})",
-                    :p_value => d[:p_value],
-                    :r_squared => d[:r_squared],
-                  }
-                end
-              else
-                common_descriptors = n["common_descriptors"].size
-              end
-              {
-                :name => Substance.find(n["_id"]).name,
-                :id => n["_id"].to_s,
-                :common_descriptors => common_descriptors
-              }
-            end
-          else
-            neighbors = prediction["neighbors"].size
+      def worst_predictions
+        worst_predictions = predictions.select do |sid,p|
+          p["prediction_interval"] and p["value"] and (p["measurements"].max < p["prediction_interval"][0] or p["measurements"].min > p["prediction_interval"][1])
+        end.compact.to_h
+        worst_predictions.each do |sid,p|
+          p["error"] = (p["value"] - p["measurements"].median).abs
+          if p["measurements"].max < p["prediction_interval"][0]
+            p["distance_prediction_interval"] = (p["measurements"].max - p["prediction_interval"][0]).abs
+          elsif p["measurements"].min > p["prediction_interval"][1]
+            p["distance_prediction_interval"] = (p["measurements"].min - p["prediction_interval"][1]).abs
           end
-          {
-            :id => substance.id.to_s,
-            :name => substance.name,
-            :feature => Feature.find(prediction["prediction_feature_id"]).name,
-            :error => (prediction["value"] - prediction["measurements"].median).abs,
-            :prediction => prediction["value"],
-            :measurements => prediction["measurements"],
-            :neighbors => neighbors
-          }
         end
+        worst_predictions.sort_by{|sid,p| p["distance_prediction_interval"] }.to_h
       end
     end
   end

data/test/dataset.rb CHANGED Viewed

@@ -160,7 +160,7 @@ class DatasetTest < MiniTest::Test
         if v.numeric?
           assert_equal v.to_f, serialized[inchi][i].to_f
         else
-          assert_equal v, serialized[inchi][i]
+          assert_equal v.to_s, serialized[inchi][i].to_s
         end
       end

data/test/feature.rb CHANGED Viewed

@@ -57,20 +57,20 @@ class FeatureTest < MiniTest::Test
   def test_physchem_description
     assert_equal 346, PhysChem.descriptors.size
     assert_equal 15, PhysChem.openbabel_descriptors.size
-    assert_equal 295, PhysChem.cdk_descriptors.size
+    assert_equal 286, PhysChem.cdk_descriptors.size
     assert_equal 45, PhysChem.joelib_descriptors.size
-    assert_equal 310, PhysChem.unique_descriptors.size
+    assert_equal 309, PhysChem.unique_descriptors.size
   end
   def test_physchem
     assert_equal 346, PhysChem.descriptors.size
     c = Compound.from_smiles "CC(=O)CC(C)C"
     logP = PhysChem.find_or_create_by :name => "Openbabel.logP"
-    assert_equal 1.6215, logP.calculate(c)
+    assert_equal 1.6215, c.calculate_properties([logP]).first
     jlogP = PhysChem.find_or_create_by :name => "Joelib.LogP"
-    assert_equal 3.5951, jlogP.calculate(c)
+    assert_equal 3.5951, c.calculate_properties([jlogP]).first
     alogP = PhysChem.find_or_create_by :name => "Cdk.ALOGP.ALogP"
-    assert_equal 0.35380000000000034, alogP.calculate(c)
+    assert_equal 0.35380000000000034, c.calculate_properties([alogP]).first
   end
 end

data/test/model-classification.rb CHANGED Viewed

@@ -46,12 +46,14 @@ class LazarClassificationTest < MiniTest::Test
     assert_equal compound_dataset.compounds, prediction_dataset.compounds
     cid = prediction_dataset.compounds[7].id.to_s
-    assert_equal "Could not find similar substances with experimental data in the training dataset.", prediction_dataset.predictions[cid][:warning]
+    assert_equal "Could not find similar substances with experimental data in the training dataset.", prediction_dataset.predictions[cid][:warnings][0]
+    expectations = ["Cannot create prediction: Only one similar compound in the training set.",
+    "Could not find similar substances with experimental data in the training dataset."]
     prediction_dataset.predictions.each do |cid,pred|
-      assert_equal "Could not find similar substances with experimental data in the training dataset.", pred[:warning] if pred[:value].nil?
+      assert_includes expectations, pred[:warnings][0] if pred[:value].nil?
     end
     cid = Compound.from_smiles("CCOC(=O)N").id.to_s
-    assert_match "excluded", prediction_dataset.predictions[cid][:warning]
+    assert_match "excluded", prediction_dataset.predictions[cid][:info]
     # cleanup
     [training_dataset,model,compound_dataset,prediction_dataset].each{|o| o.delete}
   end

data/test/model-regression.rb CHANGED Viewed

@@ -10,21 +10,21 @@ class LazarRegressionTest < MiniTest::Test
       },
       :similarity => {
         :method => "Algorithm::Similarity.tanimoto",
-        :min => 0.1
+        :min => 0.5
       },
       :prediction => {
-        :method => "Algorithm::Caret.pls",
+        :method => "Algorithm::Caret.rf",
       },
       :feature_selection => nil,
     }
-    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.medi_log10.csv")
+    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM_log10.csv")
     model = Model::Lazar.create  training_dataset: training_dataset
     assert_kind_of Model::LazarRegression, model
     assert_equal algorithms, model.algorithms
-    substance = training_dataset.substances[10]
+    substance = training_dataset.substances[145]
     prediction = model.predict substance
     assert_includes prediction[:prediction_interval][0]..prediction[:prediction_interval][1], prediction[:measurements].median, "This assertion assures that measured values are within the prediction interval. It may fail in 5% of the predictions."
-    substance = Compound.from_smiles "NC(=O)OCCC"
+    substance = Compound.from_smiles "c1ccc(cc1)Oc1ccccc1"
     prediction = model.predict substance
     refute_nil prediction[:value]
     refute_nil prediction[:prediction_interval]
@@ -59,8 +59,8 @@ class LazarRegressionTest < MiniTest::Test
     model = Model::Lazar.create training_dataset: training_dataset, algorithms: algorithms
     compound = Compound.from_smiles "CCCSCCSCC"
     prediction = model.predict compound
-    assert_equal 4, prediction[:neighbors].size
-    assert_equal 1.37, prediction[:value].round(2)
+    assert_equal 3, prediction[:neighbors].size
+    assert prediction[:value].round(2) > 1.37, "Prediction value (#{prediction[:value].round(2)}) should be larger than 1.37."
   end
   def test_local_physchem_regression
@@ -112,12 +112,12 @@ class LazarRegressionTest < MiniTest::Test
         :method => "Algorithm::Similarity.cosine",
       }
     }
-    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.mini_log10.csv")
+    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.medi_log10.csv")
     model = Model::Lazar.create  training_dataset: training_dataset, algorithms: algorithms
     assert_kind_of Model::LazarRegression, model
-    assert_equal "Algorithm::Caret.pls", model.algorithms[:prediction][:method]
+    assert_equal "Algorithm::Caret.rf", model.algorithms[:prediction][:method]
     assert_equal "Algorithm::Similarity.cosine", model.algorithms[:similarity][:method]
-    assert_equal 0.1, model.algorithms[:similarity][:min]
+    assert_equal 0.5, model.algorithms[:similarity][:min]
     algorithms[:descriptors].delete :features
     assert_equal algorithms[:descriptors], model.algorithms[:descriptors]
     prediction = model.predict training_dataset.substances[10]
@@ -130,14 +130,14 @@ class LazarRegressionTest < MiniTest::Test
         :method => "Algorithm::FeatureSelection.correlation_filter",
       },
     }
-    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM.mini_log10.csv")
+    training_dataset = Dataset.from_csv_file File.join(DATA_DIR,"EPAFHM_log10.csv")
     model = Model::Lazar.create  training_dataset: training_dataset, algorithms: algorithms
     assert_kind_of Model::LazarRegression, model
-    assert_equal "Algorithm::Caret.pls", model.algorithms[:prediction][:method]
+    assert_equal "Algorithm::Caret.rf", model.algorithms[:prediction][:method]
     assert_equal "Algorithm::Similarity.tanimoto", model.algorithms[:similarity][:method]
-    assert_equal 0.1, model.algorithms[:similarity][:min]
+    assert_equal 0.5, model.algorithms[:similarity][:min]
     assert_equal algorithms[:feature_selection][:method], model.algorithms[:feature_selection][:method]
-    prediction = model.predict training_dataset.substances[10]
+    prediction = model.predict training_dataset.substances[145]
     refute_nil prediction[:value]
   end

data/test/model-validation.rb CHANGED Viewed

@@ -12,7 +12,7 @@ class ValidationModelTest < MiniTest::Test
     m.crossvalidations.each do |cv|
       assert cv.accuracy > 0.74, "Crossvalidation accuracy (#{cv.accuracy}) should be larger than 0.75. This may happen due to an unfavorable training/test set split."
     end
-    prediction = m.predict Compound.from_smiles("CCCC(NN)C")
+    prediction = m.predict Compound.from_smiles("OCC(CN(CC(O)C)N=O)O")
     assert_equal "true", prediction[:value]
     m.delete
   end

data/test/setup.rb CHANGED Viewed

@@ -3,6 +3,8 @@ require 'minitest/autorun'
 require_relative '../lib/lazar.rb'
 #require 'lazar'
 include OpenTox
+#$mongo.database.drop
+#$gridfs = $mongo.database.fs # recreate GridFS indexes
 TEST_DIR ||= File.expand_path(File.dirname(__FILE__))
 DATA_DIR ||= File.join(TEST_DIR,"data")
 training_dataset = Dataset.where(:name => "Protein Corona Fingerprinting Predicts the Cellular Interaction of Gold and Silver Nanoparticles").first

data/test/validation-classification.rb CHANGED Viewed

@@ -47,7 +47,7 @@ class ValidationClassificationTest < MiniTest::Test
     dataset = Dataset.from_csv_file "#{DATA_DIR}/hamster_carcinogenicity.csv"
     model = Model::Lazar.create training_dataset: dataset
     loo = ClassificationLeaveOneOut.create model
-    assert_equal 14, loo.nr_unpredicted
+    assert_equal 24, loo.nr_unpredicted
     refute_empty loo.confusion_matrix
     assert loo.accuracy > 0.77
     assert loo.weighted_accuracy > loo.accuracy, "Weighted accuracy (#{loo.weighted_accuracy}) should be larger than accuracy (#{loo.accuracy})."

data/test/validation-regression.rb CHANGED Viewed

@@ -84,7 +84,7 @@ class ValidationRegressionTest < MiniTest::Test
     repeated_cv = RepeatedCrossValidation.create model
     repeated_cv.crossvalidations.each do |cv|
       assert cv.r_squared > 0.34, "R^2 (#{cv.r_squared}) should be larger than 0.034"
-      assert_operator cv.accuracy, :>, 0.7, "model accuracy < 0.7, this may happen by chance due to an unfavorable training/test set split"
+      assert cv.rmse < 0.5, "RMSE (#{cv.rmse}) should be smaller than 0.5"
     end
   end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: lazar
 version: !ruby/object:Gem::Version
-  version: 1.0.1
+  version: 1.1.0
 platform: ruby
 authors:
 - Christoph Helma, Martin Guetlein, Andreas Maunz, Micha Rautenberg, David Vorgrimmler,
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2017-01-18 00:00:00.000000000 Z
+date: 2017-05-26 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler