lazar 1.0.0 → 1.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +64 -1
- data/VERSION +1 -1
- data/lib/algorithm.rb +1 -0
- data/lib/caret.rb +11 -2
- data/lib/classification.rb +6 -1
- data/lib/compound.rb +32 -23
- data/lib/crossvalidation.rb +22 -0
- data/lib/dataset.rb +30 -3
- data/lib/feature.rb +7 -0
- data/lib/feature_selection.rb +4 -1
- data/lib/import.rb +5 -1
- data/lib/leave-one-out-validation.rb +6 -0
- data/lib/model.rb +77 -3
- data/lib/nanoparticle.rb +19 -0
- data/lib/overwrite.rb +46 -11
- data/lib/physchem.rb +23 -5
- data/lib/regression.rb +5 -0
- data/lib/rest-client-wrapper.rb +1 -0
- data/lib/similarity.rb +22 -2
- data/lib/substance.rb +1 -0
- data/lib/train-test-validation.rb +12 -0
- data/lib/validation-statistics.rb +19 -0
- data/lib/validation.rb +3 -0
- data/test/feature.rb +2 -2
- data/test/model-nanoparticle.rb +7 -0
- data/test/nanomaterial-model-validation.rb +2 -3
- data/test/setup.rb +1 -5
- data/test/validation-regression.rb +2 -3
- metadata +50 -5
- data/lib/experiment.rb +0 -99
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: c17dc3fb7cae4c75aca1be7c0a6286cfbc3f22ce
|
4
|
+
data.tar.gz: 5b9fb4bae6230e427188e0c8e34153fd5a6efa0a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7cae1ffb410cd9a2d1afd1516ebf99499e2b2447af8707a4381adb652cb59711e1875c11e80cec8fc101f8368224ab21bc378f685b0084ab29c631d798145dca
|
7
|
+
data.tar.gz: d01273022852b6a0b59941a0e881a85ed1400a984912018d97fc137f8ab602cff1fd6f5fb42a65df5d9375cb43ca2809adeefbde7a0e385fd832c189df0da031
|
data/README.md
CHANGED
@@ -26,10 +26,73 @@ Installation
|
|
26
26
|
|
27
27
|
The output should give you more verbose information that can help in debugging (e.g. to identify missing libraries).
|
28
28
|
|
29
|
+
Tutorial
|
30
|
+
--------
|
31
|
+
|
32
|
+
Execute the following commands either from an interactive Ruby shell or a Ruby script:
|
33
|
+
|
34
|
+
### Create and use `lazar` models for small molecules
|
35
|
+
|
36
|
+
#### Create a training dataset
|
37
|
+
|
38
|
+
Create a CSV file with two columns. The first line should contain either SMILES or InChI (first column) and the endpoint (second column). The first column should contain either the SMILES or InChI of the training compounds, the second column the training compounds toxic activities (qualitative or quantitative). Use -log10 transformed values for regression datasets. Add metadata to a JSON file with the same basename containing the fields "species", "endpoint", "source" and "unit" (regression only). You can find example training data at [Github](https://github.com/opentox/lazar-public-data).
|
39
|
+
|
40
|
+
#### Create and validate a `lazar` model with default algorithms and parameters
|
41
|
+
|
42
|
+
`validated_model = Model::Validation.create_from_csv_file EPAFHM_log10.csv`
|
43
|
+
|
44
|
+
This command will create a `lazar` model and validate it with three independent 10-fold crossvalidations.
|
45
|
+
|
46
|
+
#### Inspect crossvalidation results
|
47
|
+
|
48
|
+
`validated_model.crossvalidations`
|
49
|
+
|
50
|
+
#### Predict a new compound
|
51
|
+
|
52
|
+
Create a compound
|
53
|
+
|
54
|
+
`compound = Compound.from_smiles "NC(=O)OCCC"`
|
55
|
+
|
56
|
+
Predict Fathead Minnow Acute Toxicity
|
57
|
+
|
58
|
+
`validated_model.predict compound`
|
59
|
+
|
60
|
+
#### Experiment with other algorithms
|
61
|
+
|
62
|
+
You can pass algorithms parameters to the `Model::Validation.create_from_csv_file` command. The [API documentation](http://rdoc.info/gems/lazar) provides detailed instructions.
|
63
|
+
|
64
|
+
### Create and use `lazar` nanoparticle models
|
65
|
+
|
66
|
+
#### Create and validate a `nano-lazar` model from eNanoMapper with default algorithms and parameters
|
67
|
+
|
68
|
+
`validated_model = Model::Validation.create_from_enanomapper`
|
69
|
+
|
70
|
+
This command will mirror the eNanoMapper database in the local database, create a `nano-lazar` model and validate it with five independent 10-fold crossvalidations.
|
71
|
+
|
72
|
+
#### Inspect crossvalidation results
|
73
|
+
|
74
|
+
`validated_model.crossvalidations`
|
75
|
+
|
76
|
+
#### Predict nanoparticle toxicities
|
77
|
+
|
78
|
+
Choose a random nanoparticle from the "Potein Corona" dataset
|
79
|
+
```
|
80
|
+
training_dataset = Dataset.where(:name => "Protein Corona Fingerprinting Predicts the Cellular Interaction of Gold and Silver Nanoparticles").first
|
81
|
+
nanoparticle = training_dataset.substances.shuffle.first
|
82
|
+
```
|
83
|
+
|
84
|
+
Predict the "Net Cell Association" endpoint
|
85
|
+
|
86
|
+
`validated_model.predict nanoparticle`
|
87
|
+
|
88
|
+
#### Experiment with other datasets, endpoints and algorithms
|
89
|
+
|
90
|
+
You can pass training_dataset, prediction_feature and algorithms parameters to the `Model::Validation.create_from_enanomapper` command. The [API documentation](http://rdoc.info/gems/lazar) provides detailed instructions. Detailed documentation and validation results can be found in this [publication](https://github.com/enanomapper/nano-lazar-paper/blob/master/nano-lazar.pdf).
|
91
|
+
|
29
92
|
Documentation
|
30
93
|
-------------
|
31
94
|
* [API documentation](http://rdoc.info/gems/lazar)
|
32
95
|
|
33
96
|
Copyright
|
34
97
|
---------
|
35
|
-
Copyright (c) 2009-
|
98
|
+
Copyright (c) 2009-2017 Christoph Helma, Martin Guetlein, Micha Rautenberg, Andreas Maunz, David Vorgrimmler, Denis Gebele. See LICENSE for details.
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.0.
|
1
|
+
1.0.1
|
data/lib/algorithm.rb
CHANGED
data/lib/caret.rb
CHANGED
@@ -1,9 +1,17 @@
|
|
1
1
|
module OpenTox
|
2
2
|
module Algorithm
|
3
3
|
|
4
|
+
# Ruby interface for the R caret package
|
5
|
+
# Caret model list: https://topepo.github.io/caret/modelList.html
|
4
6
|
class Caret
|
5
|
-
# model list: https://topepo.github.io/caret/modelList.html
|
6
7
|
|
8
|
+
# Create a local R caret model and make a prediction
|
9
|
+
# @param [Array<Float,Bool>] dependent_variables
|
10
|
+
# @param [Array<Array<Float,Bool>>] independent_variables
|
11
|
+
# @param [Array<Float>] weights
|
12
|
+
# @param [String] Caret method
|
13
|
+
# @param [Array<Float,Bool>] query_variables
|
14
|
+
# @return [Hash]
|
7
15
|
def self.create_model_and_predict dependent_variables:, independent_variables:, weights:, method:, query_variables:
|
8
16
|
remove = []
|
9
17
|
# remove independent_variables with single values
|
@@ -77,12 +85,13 @@ module OpenTox
|
|
77
85
|
|
78
86
|
end
|
79
87
|
|
80
|
-
#
|
88
|
+
# Call caret methods dynamically, e.g. Caret.pls
|
81
89
|
def self.method_missing(sym, *args, &block)
|
82
90
|
args.first[:method] = sym.to_s
|
83
91
|
self.create_model_and_predict args.first
|
84
92
|
end
|
85
93
|
|
94
|
+
# Convert Ruby values to R values
|
86
95
|
def self.to_r v
|
87
96
|
return "F" if v == false
|
88
97
|
return "T" if v == true
|
data/lib/classification.rb
CHANGED
@@ -1,9 +1,14 @@
|
|
1
1
|
module OpenTox
|
2
2
|
module Algorithm
|
3
3
|
|
4
|
+
# Classification algorithms
|
4
5
|
class Classification
|
5
6
|
|
6
|
-
|
7
|
+
# Weighted majority vote
|
8
|
+
# @param [Array<TrueClass,FalseClass>] dependent_variables
|
9
|
+
# @param [Array<Float>] weights
|
10
|
+
# @return [Hash]
|
11
|
+
def self.weighted_majority_vote dependent_variables:, independent_variables:nil, weights:, query_variables:nil
|
7
12
|
class_weights = {}
|
8
13
|
dependent_variables.each_with_index do |v,i|
|
9
14
|
class_weights[v] ||= []
|
data/lib/compound.rb
CHANGED
@@ -2,6 +2,7 @@ CACTUS_URI="https://cactus.nci.nih.gov/chemical/structure/"
|
|
2
2
|
|
3
3
|
module OpenTox
|
4
4
|
|
5
|
+
# Small molecules with defined chemical structures
|
5
6
|
class Compound < Substance
|
6
7
|
require_relative "unique_descriptors.rb"
|
7
8
|
DEFAULT_FINGERPRINT = "MP2D"
|
@@ -28,6 +29,9 @@ module OpenTox
|
|
28
29
|
compound
|
29
30
|
end
|
30
31
|
|
32
|
+
# Create chemical fingerprint
|
33
|
+
# @param [String] fingerprint type
|
34
|
+
# @return [Array<String>]
|
31
35
|
def fingerprint type=DEFAULT_FINGERPRINT
|
32
36
|
unless fingerprints[type]
|
33
37
|
return [] unless self.smiles
|
@@ -75,6 +79,9 @@ module OpenTox
|
|
75
79
|
fingerprints[type]
|
76
80
|
end
|
77
81
|
|
82
|
+
# Calculate physchem properties
|
83
|
+
# @param [Array<Hash>] list of descriptors
|
84
|
+
# @return [Array<Float>]
|
78
85
|
def calculate_properties descriptors=PhysChem::OPENBABEL
|
79
86
|
calculated_ids = properties.keys
|
80
87
|
# BSON::ObjectId instances are not allowed as keys in a BSON document.
|
@@ -96,6 +103,10 @@ module OpenTox
|
|
96
103
|
descriptors.collect{|d| properties[d.id.to_s]}
|
97
104
|
end
|
98
105
|
|
106
|
+
# Match a SMARTS substructure
|
107
|
+
# @param [String] smarts
|
108
|
+
# @param [TrueClass,FalseClass] count matches or return true/false
|
109
|
+
# @return [TrueClass,FalseClass,Fixnum]
|
99
110
|
def smarts_match smarts, count=false
|
100
111
|
obconversion = OpenBabel::OBConversion.new
|
101
112
|
obmol = OpenBabel::OBMol.new
|
@@ -116,8 +127,8 @@ module OpenTox
|
|
116
127
|
# Create a compound from smiles string
|
117
128
|
# @example
|
118
129
|
# compound = OpenTox::Compound.from_smiles("c1ccccc1")
|
119
|
-
# @param [String] smiles
|
120
|
-
# @return [OpenTox::Compound]
|
130
|
+
# @param [String] smiles
|
131
|
+
# @return [OpenTox::Compound]
|
121
132
|
def self.from_smiles smiles
|
122
133
|
if smiles.match(/\s/) # spaces seem to confuse obconversion and may lead to invalid smiles
|
123
134
|
$logger.warn "SMILES parsing failed for '#{smiles}'', SMILES string contains whitespaces."
|
@@ -132,9 +143,9 @@ module OpenTox
|
|
132
143
|
end
|
133
144
|
end
|
134
145
|
|
135
|
-
# Create a compound from
|
136
|
-
# @param
|
137
|
-
# @return [OpenTox::Compound]
|
146
|
+
# Create a compound from InChI string
|
147
|
+
# @param [String] InChI
|
148
|
+
# @return [OpenTox::Compound]
|
138
149
|
def self.from_inchi inchi
|
139
150
|
#smiles = `echo "#{inchi}" | "#{File.join(File.dirname(__FILE__),"..","openbabel","bin","babel")}" -iinchi - -ocan`.chomp.strip
|
140
151
|
smiles = obconversion(inchi,"inchi","can")
|
@@ -145,9 +156,9 @@ module OpenTox
|
|
145
156
|
end
|
146
157
|
end
|
147
158
|
|
148
|
-
# Create a compound from
|
149
|
-
# @param
|
150
|
-
# @return [OpenTox::Compound]
|
159
|
+
# Create a compound from SDF
|
160
|
+
# @param [String] SDF
|
161
|
+
# @return [OpenTox::Compound]
|
151
162
|
def self.from_sdf sdf
|
152
163
|
# do not store sdf because it might be 2D
|
153
164
|
Compound.from_smiles obconversion(sdf,"sdf","can")
|
@@ -156,40 +167,38 @@ module OpenTox
|
|
156
167
|
# Create a compound from name. Relies on an external service for name lookups.
|
157
168
|
# @example
|
158
169
|
# compound = OpenTox::Compound.from_name("Benzene")
|
159
|
-
# @param
|
160
|
-
# @return [OpenTox::Compound]
|
170
|
+
# @param [String] name, can be also an InChI/InChiKey, CAS number, etc
|
171
|
+
# @return [OpenTox::Compound]
|
161
172
|
def self.from_name name
|
162
173
|
Compound.from_smiles RestClientWrapper.get(File.join(CACTUS_URI,URI.escape(name),"smiles"))
|
163
174
|
end
|
164
175
|
|
165
176
|
# Get InChI
|
166
|
-
# @return [String]
|
177
|
+
# @return [String]
|
167
178
|
def inchi
|
168
179
|
unless self["inchi"]
|
169
|
-
|
170
180
|
result = obconversion(smiles,"smi","inchi")
|
171
|
-
#result = `echo "#{self.smiles}" | "#{File.join(File.dirname(__FILE__),"..","openbabel","bin","babel")}" -ismi - -oinchi`.chomp
|
172
181
|
update(:inchi => result.chomp) if result and !result.empty?
|
173
182
|
end
|
174
183
|
self["inchi"]
|
175
184
|
end
|
176
185
|
|
177
186
|
# Get InChIKey
|
178
|
-
# @return [String]
|
187
|
+
# @return [String]
|
179
188
|
def inchikey
|
180
189
|
update(:inchikey => obconversion(smiles,"smi","inchikey")) unless self["inchikey"]
|
181
190
|
self["inchikey"]
|
182
191
|
end
|
183
192
|
|
184
193
|
# Get (canonical) smiles
|
185
|
-
# @return [String]
|
194
|
+
# @return [String]
|
186
195
|
def smiles
|
187
196
|
update(:smiles => obconversion(self["smiles"],"smi","can")) unless self["smiles"]
|
188
197
|
self["smiles"]
|
189
198
|
end
|
190
199
|
|
191
|
-
# Get
|
192
|
-
# @return [String]
|
200
|
+
# Get SDF
|
201
|
+
# @return [String]
|
193
202
|
def sdf
|
194
203
|
if self.sdf_id.nil?
|
195
204
|
sdf = obconversion(smiles,"smi","sdf")
|
@@ -209,7 +218,6 @@ module OpenTox
|
|
209
218
|
update(:svg_id => $gridfs.insert_one(file))
|
210
219
|
end
|
211
220
|
$gridfs.find_one(_id: self.svg_id).data
|
212
|
-
|
213
221
|
end
|
214
222
|
|
215
223
|
# Get png image
|
@@ -223,26 +231,27 @@ module OpenTox
|
|
223
231
|
update(:png_id => $gridfs.insert_one(file))
|
224
232
|
end
|
225
233
|
Base64.decode64($gridfs.find_one(_id: self.png_id).data)
|
226
|
-
|
227
234
|
end
|
228
235
|
|
229
236
|
# Get all known compound names. Relies on an external service for name lookups.
|
230
237
|
# @example
|
231
238
|
# names = compound.names
|
232
|
-
# @return [String]
|
239
|
+
# @return [Array<String>]
|
233
240
|
def names
|
234
241
|
update(:names => RestClientWrapper.get("#{CACTUS_URI}#{inchi}/names").split("\n")) unless self["names"]
|
235
242
|
self["names"]
|
236
243
|
end
|
237
244
|
|
238
|
-
#
|
245
|
+
# Get PubChem Compound Identifier (CID), obtained via REST call to PubChem
|
246
|
+
# @return [String]
|
239
247
|
def cid
|
240
248
|
pug_uri = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/"
|
241
249
|
update(:cid => RestClientWrapper.post(File.join(pug_uri, "compound", "inchi", "cids", "TXT"),{:inchi => inchi}).strip) unless self["cid"]
|
242
250
|
self["cid"]
|
243
251
|
end
|
244
252
|
|
245
|
-
#
|
253
|
+
# Get ChEMBL database compound id, obtained via REST call to ChEMBL
|
254
|
+
# @return [String]
|
246
255
|
def chemblid
|
247
256
|
# https://www.ebi.ac.uk/chembldb/ws#individualCompoundByInChiKey
|
248
257
|
uri = "https://www.ebi.ac.uk/chemblws/compounds/smiles/#{smiles}.json"
|
@@ -292,7 +301,7 @@ module OpenTox
|
|
292
301
|
mg.to_f/molecular_weight
|
293
302
|
end
|
294
303
|
|
295
|
-
# Calculate molecular weight of Compound with OB and store it in object
|
304
|
+
# Calculate molecular weight of Compound with OB and store it in compound object
|
296
305
|
# @return [Float] molecular weight
|
297
306
|
def molecular_weight
|
298
307
|
mw_feature = PhysChem.find_or_create_by(:name => "Openbabel.MW")
|
data/lib/crossvalidation.rb
CHANGED
@@ -1,10 +1,16 @@
|
|
1
1
|
module OpenTox
|
2
2
|
|
3
3
|
module Validation
|
4
|
+
|
5
|
+
# Crossvalidation
|
4
6
|
class CrossValidation < Validation
|
5
7
|
field :validation_ids, type: Array, default: []
|
6
8
|
field :folds, type: Integer, default: 10
|
7
9
|
|
10
|
+
# Create a crossvalidation
|
11
|
+
# @param [OpenTox::Model::Lazar]
|
12
|
+
# @param [Fixnum] number of folds
|
13
|
+
# @return [OpenTox::Validation::CrossValidation]
|
8
14
|
def self.create model, n=10
|
9
15
|
$logger.debug model.algorithms
|
10
16
|
klass = ClassificationCrossValidation if model.is_a? Model::LazarClassification
|
@@ -41,14 +47,20 @@ module OpenTox
|
|
41
47
|
cv
|
42
48
|
end
|
43
49
|
|
50
|
+
# Get execution time
|
51
|
+
# @return [Fixnum]
|
44
52
|
def time
|
45
53
|
finished_at - created_at
|
46
54
|
end
|
47
55
|
|
56
|
+
# Get individual validations
|
57
|
+
# @return [Array<OpenTox::Validation>]
|
48
58
|
def validations
|
49
59
|
validation_ids.collect{|vid| TrainTest.find vid}
|
50
60
|
end
|
51
61
|
|
62
|
+
# Get predictions for all compounds
|
63
|
+
# @return [Array<Hash>]
|
52
64
|
def predictions
|
53
65
|
predictions = {}
|
54
66
|
validations.each{|v| predictions.merge!(v.predictions)}
|
@@ -56,6 +68,7 @@ module OpenTox
|
|
56
68
|
end
|
57
69
|
end
|
58
70
|
|
71
|
+
# Crossvalidation of classification models
|
59
72
|
class ClassificationCrossValidation < CrossValidation
|
60
73
|
include ClassificationStatistics
|
61
74
|
field :accept_values, type: Array
|
@@ -68,6 +81,7 @@ module OpenTox
|
|
68
81
|
field :probability_plot_id, type: BSON::ObjectId
|
69
82
|
end
|
70
83
|
|
84
|
+
# Crossvalidation of regression models
|
71
85
|
class RegressionCrossValidation < CrossValidation
|
72
86
|
include RegressionStatistics
|
73
87
|
field :rmse, type: Float, default:0
|
@@ -78,10 +92,16 @@ module OpenTox
|
|
78
92
|
field :correlation_plot_id, type: BSON::ObjectId
|
79
93
|
end
|
80
94
|
|
95
|
+
# Independent repeated crossvalidations
|
81
96
|
class RepeatedCrossValidation < Validation
|
82
97
|
field :crossvalidation_ids, type: Array, default: []
|
83
98
|
field :correlation_plot_id, type: BSON::ObjectId
|
84
99
|
|
100
|
+
# Create repeated crossvalidations
|
101
|
+
# @param [OpenTox::Model::Lazar]
|
102
|
+
# @param [Fixnum] number of folds
|
103
|
+
# @param [Fixnum] number of repeats
|
104
|
+
# @return [OpenTox::Validation::RepeatedCrossValidation]
|
85
105
|
def self.create model, folds=10, repeats=3
|
86
106
|
repeated_cross_validation = self.new
|
87
107
|
repeats.times do |n|
|
@@ -92,6 +112,8 @@ module OpenTox
|
|
92
112
|
repeated_cross_validation
|
93
113
|
end
|
94
114
|
|
115
|
+
# Get crossvalidations
|
116
|
+
# @return [OpenTox::Validation::CrossValidation]
|
95
117
|
def crossvalidations
|
96
118
|
crossvalidation_ids.collect{|id| CrossValidation.find(id)}
|
97
119
|
end
|
data/lib/dataset.rb
CHANGED
@@ -3,32 +3,43 @@ require 'tempfile'
|
|
3
3
|
|
4
4
|
module OpenTox
|
5
5
|
|
6
|
+
# Collection of substances and features
|
6
7
|
class Dataset
|
7
8
|
|
8
9
|
field :data_entries, type: Hash, default: {}
|
9
10
|
|
10
11
|
# Readers
|
11
12
|
|
13
|
+
# Get all compounds
|
14
|
+
# @return [Array<OpenTox::Compound>]
|
12
15
|
def compounds
|
13
16
|
substances.select{|s| s.is_a? Compound}
|
14
17
|
end
|
15
18
|
|
19
|
+
# Get all nanoparticles
|
20
|
+
# @return [Array<OpenTox::Nanoparticle>]
|
16
21
|
def nanoparticles
|
17
22
|
substances.select{|s| s.is_a? Nanoparticle}
|
18
23
|
end
|
19
24
|
|
20
25
|
# Get all substances
|
26
|
+
# @return [Array<OpenTox::Substance>]
|
21
27
|
def substances
|
22
28
|
@substances ||= data_entries.keys.collect{|id| OpenTox::Substance.find id}.uniq
|
23
29
|
@substances
|
24
30
|
end
|
25
31
|
|
26
32
|
# Get all features
|
33
|
+
# @return [Array<OpenTox::Feature>]
|
27
34
|
def features
|
28
35
|
@features ||= data_entries.collect{|sid,data| data.keys.collect{|id| OpenTox::Feature.find(id)}}.flatten.uniq
|
29
36
|
@features
|
30
37
|
end
|
31
38
|
|
39
|
+
# Get all values for a given substance and feature
|
40
|
+
# @param [OpenTox::Substance,BSON::ObjectId,String] substance or substance id
|
41
|
+
# @param [OpenTox::Feature,BSON::ObjectId,String] feature or feature id
|
42
|
+
# @return [TrueClass,FalseClass,Float]
|
32
43
|
def values substance,feature
|
33
44
|
substance = substance.id if substance.is_a? Substance
|
34
45
|
feature = feature.id if feature.is_a? Feature
|
@@ -41,6 +52,10 @@ module OpenTox
|
|
41
52
|
|
42
53
|
# Writers
|
43
54
|
|
55
|
+
# Add a value for a given substance and feature
|
56
|
+
# @param [OpenTox::Substance,BSON::ObjectId,String] substance or substance id
|
57
|
+
# @param [OpenTox::Feature,BSON::ObjectId,String] feature or feature id
|
58
|
+
# @param [TrueClass,FalseClass,Float]
|
44
59
|
def add(substance,feature,value)
|
45
60
|
substance = substance.id if substance.is_a? Substance
|
46
61
|
feature = feature.id if feature.is_a? Feature
|
@@ -87,7 +102,7 @@ module OpenTox
|
|
87
102
|
|
88
103
|
# Serialisation
|
89
104
|
|
90
|
-
#
|
105
|
+
# Convert dataset to csv format including compound smiles as first column, other column headers are feature names
|
91
106
|
# @return [String]
|
92
107
|
def to_csv(inchi=false)
|
93
108
|
CSV.generate() do |csv|
|
@@ -130,6 +145,9 @@ module OpenTox
|
|
130
145
|
#end
|
131
146
|
|
132
147
|
# Create a dataset from CSV file
|
148
|
+
# @param [File]
|
149
|
+
# @param [TrueClass,FalseClass] accept or reject empty values
|
150
|
+
# @return [OpenTox::Dataset]
|
133
151
|
def self.from_csv_file file, accept_empty_values=false
|
134
152
|
source = file
|
135
153
|
name = File.basename(file,".*")
|
@@ -145,8 +163,10 @@ module OpenTox
|
|
145
163
|
dataset
|
146
164
|
end
|
147
165
|
|
148
|
-
#
|
149
|
-
#
|
166
|
+
# Parse data in tabular format (e.g. from csv)
|
167
|
+
# does a lot of guesswork in order to determine feature types
|
168
|
+
# @param [Array<Array>]
|
169
|
+
# @param [TrueClass,FalseClass] accept or reject empty values
|
150
170
|
def parse_table table, accept_empty_values
|
151
171
|
|
152
172
|
# features
|
@@ -225,6 +245,7 @@ module OpenTox
|
|
225
245
|
save
|
226
246
|
end
|
227
247
|
|
248
|
+
# Delete dataset
|
228
249
|
def delete
|
229
250
|
compounds.each{|c| c.dataset_ids.delete id.to_s}
|
230
251
|
super
|
@@ -238,14 +259,20 @@ module OpenTox
|
|
238
259
|
field :prediction_feature_id, type: BSON::ObjectId
|
239
260
|
field :predictions, type: Hash, default: {}
|
240
261
|
|
262
|
+
# Get prediction feature
|
263
|
+
# @return [OpenTox::Feature]
|
241
264
|
def prediction_feature
|
242
265
|
Feature.find prediction_feature_id
|
243
266
|
end
|
244
267
|
|
268
|
+
# Get all compounds
|
269
|
+
# @return [Array<OpenTox::Compound>]
|
245
270
|
def compounds
|
246
271
|
substances.select{|s| s.is_a? Compound}
|
247
272
|
end
|
248
273
|
|
274
|
+
# Get all substances
|
275
|
+
# @return [Array<OpenTox::Substance>]
|
249
276
|
def substances
|
250
277
|
predictions.keys.collect{|id| Substance.find id}
|
251
278
|
end
|
data/lib/feature.rb
CHANGED
@@ -8,10 +8,14 @@ module OpenTox
|
|
8
8
|
field :unit, type: String
|
9
9
|
field :conditions, type: Hash
|
10
10
|
|
11
|
+
# Is it a nominal feature
|
12
|
+
# @return [TrueClass,FalseClass]
|
11
13
|
def nominal?
|
12
14
|
self.class == NominalFeature
|
13
15
|
end
|
14
16
|
|
17
|
+
# Is it a numeric feature
|
18
|
+
# @return [TrueClass,FalseClass]
|
15
19
|
def numeric?
|
16
20
|
self.class == NumericFeature
|
17
21
|
end
|
@@ -30,6 +34,9 @@ module OpenTox
|
|
30
34
|
class Smarts < NominalFeature
|
31
35
|
field :smarts, type: String
|
32
36
|
index "smarts" => 1
|
37
|
+
# Create feature from SMARTS string
|
38
|
+
# @param [String]
|
39
|
+
# @return [OpenTox::Feature]
|
33
40
|
def self.from_smarts smarts
|
34
41
|
self.find_or_create_by :smarts => smarts
|
35
42
|
end
|
data/lib/feature_selection.rb
CHANGED
@@ -1,13 +1,16 @@
|
|
1
1
|
module OpenTox
|
2
2
|
module Algorithm
|
3
3
|
|
4
|
+
# Feature selection algorithms
|
4
5
|
class FeatureSelection
|
5
6
|
|
7
|
+
# Select features correlated to the models prediction feature
|
8
|
+
# @param [OpenTox::Model::Lazar]
|
6
9
|
def self.correlation_filter model
|
7
10
|
relevant_features = {}
|
8
11
|
R.assign "dependent", model.dependent_variables.collect{|v| to_r(v)}
|
9
12
|
model.descriptor_weights = []
|
10
|
-
selected_variables = []
|
13
|
+
selected_variables = []
|
11
14
|
selected_descriptor_ids = []
|
12
15
|
model.independent_variables.each_with_index do |v,i|
|
13
16
|
v.collect!{|n| to_r(n)}
|
data/lib/import.rb
CHANGED
@@ -1,12 +1,14 @@
|
|
1
1
|
module OpenTox
|
2
2
|
|
3
|
+
# Import data from external databases
|
3
4
|
module Import
|
4
5
|
|
5
6
|
class Enanomapper
|
6
7
|
include OpenTox
|
7
8
|
|
8
|
-
#
|
9
|
+
# Import from eNanoMapper
|
9
10
|
def self.import
|
11
|
+
# time critical step: JSON parsing (>99%), Oj brings only minor speed gains (~1%)
|
10
12
|
datasets = {}
|
11
13
|
bundles = JSON.parse(RestClientWrapper.get('https://data.enanomapper.net/bundle?media=application%2Fjson'))["dataset"]
|
12
14
|
bundles.each do |bundle|
|
@@ -20,6 +22,7 @@ module OpenTox
|
|
20
22
|
uri = c["component"]["compound"]["URI"]
|
21
23
|
uri = CGI.escape File.join(uri,"&media=application/json")
|
22
24
|
data = JSON.parse(RestClientWrapper.get "https://data.enanomapper.net/query/compound/url/all?media=application/json&search=#{uri}")
|
25
|
+
source = data["dataEntry"][0]["compound"]["URI"]
|
23
26
|
smiles = data["dataEntry"][0]["values"]["https://data.enanomapper.net/feature/http%3A%2F%2Fwww.opentox.org%2Fapi%2F1.1%23SMILESDefault"]
|
24
27
|
names = []
|
25
28
|
names << data["dataEntry"][0]["values"]["https://data.enanomapper.net/feature/http%3A%2F%2Fwww.opentox.org%2Fapi%2F1.1%23ChemicalNameDefault"]
|
@@ -31,6 +34,7 @@ module OpenTox
|
|
31
34
|
else
|
32
35
|
compound = Compound.find_or_create_by(:name => names.first,:names => names.compact)
|
33
36
|
end
|
37
|
+
compound.source = source
|
34
38
|
compound.save
|
35
39
|
if c["relation"] == "HAS_CORE"
|
36
40
|
core_id = compound.id.to_s
|
@@ -2,8 +2,12 @@ module OpenTox
|
|
2
2
|
|
3
3
|
module Validation
|
4
4
|
|
5
|
+
# Leave one out validation
|
5
6
|
class LeaveOneOut < Validation
|
6
7
|
|
8
|
+
# Create a leave one out validation
|
9
|
+
# @param [OpenTox::Model::Lazar]
|
10
|
+
# @return [OpenTox::Validation::LeaveOneOut]
|
7
11
|
def self.create model
|
8
12
|
bad_request_error "Cannot create leave one out validation for models with supervised feature selection. Please use crossvalidation instead." if model.algorithms[:feature_selection]
|
9
13
|
$logger.debug "#{model.name}: LOO validation started"
|
@@ -32,6 +36,7 @@ module OpenTox
|
|
32
36
|
|
33
37
|
end
|
34
38
|
|
39
|
+
# Leave one out validation for classification models
|
35
40
|
class ClassificationLeaveOneOut < LeaveOneOut
|
36
41
|
include ClassificationStatistics
|
37
42
|
field :accept_values, type: Array
|
@@ -44,6 +49,7 @@ module OpenTox
|
|
44
49
|
field :confidence_plot_id, type: BSON::ObjectId
|
45
50
|
end
|
46
51
|
|
52
|
+
# Leave one out validation for regression models
|
47
53
|
class RegressionLeaveOneOut < LeaveOneOut
|
48
54
|
include RegressionStatistics
|
49
55
|
field :rmse, type: Float, default: 0
|