pets 0.2.3 → 0.2.4

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d5b4a8a64787d15e3741fac94c70855625e040b670d03aec8b5df2cf1f7d6a95
4
- data.tar.gz: a6fdec46047c84df897d2f63aee4ebd9acfb888c33f77a90c38b5e19821558cf
3
+ metadata.gz: 7dad87a5083408e6049bd3edeac19e8bc106d93304453e9030841a620c2d7c3a
4
+ data.tar.gz: 2176d3f49726443447c0d7e9d8f031bed39c998db02ddf36ef2cb3d28d41f2c3
5
5
  SHA512:
6
- metadata.gz: 1956197b3c36a3f4bd34722ed201f1a50cb30fb0b40976ef330186d238039e0d392b0a68b99f6478b13c288c0201f6b17859e8efe88700f05cac4f43301c1e84
7
- data.tar.gz: 96465011731776d68719b53e548cf70475f6e4467c261a70da040b22d418eff556f9c0f09092f030209fba35d41cb02ab8ce9b3819f8b4be04a30b360f569a84
6
+ metadata.gz: 0755027e17a0a986895ef6bf2ed5da7ff38c2de66486a8f41e42a8c5d189337381a167c85ec2ab4e5f3905af608ba4e5648a91b5b26737b483e94acaae8928f7
7
+ data.tar.gz: f106584e515da71e1224f6f453c0e39694bccdf9b76d172d2b28e08eea5bc8929871070f7645833e145b9ddcfe5e6b52b4ba4b18c339141cef1150b48eb8a7ba
data/Gemfile CHANGED
@@ -4,3 +4,5 @@ source "https://rubygems.org"
4
4
  gemspec
5
5
  semtools_dev_path = File.expand_path('~/dev_gems/semtools')
6
6
  gem "semtools", github: "seoanezonjic/semtools", branch: "master" if Dir.exists?(semtools_dev_path)
7
+ expcalc_dev_path = File.expand_path('~/dev_gems/expcalc')
8
+ gem "expcalc", github: "seoanezonjic/expcalc", branch: "master" if Dir.exist?(expcalc_dev_path)
data/README.md CHANGED
@@ -1,10 +1,11 @@
1
- # Pets
1
+ # PETS
2
2
 
3
- Pets (Patient exploration tools suite) include tools for the analysis of cohorts of patients with pathological phenotypes described in terms of the Human Phenotype Ontology (HPO) and the position their genomic variants clinically determined.
3
+ PETS (Patient Exploration Tools Suite) include three different tools for the analysis of cohorts of patients with pathological phenotypes described in terms of the Human Phenotype Ontology (HPO) and the position their genomic variants clinically determined.
4
4
 
5
- Pets include tools to (1) perform cohort analysis (coPatReporter.rb), (2) searching for pathological phenotypes associated with a genomic region of interest (reg2phen.rb) and (3) predict regions of the genome that potentially lead to the pathological phenotypes observed in a patient (phen2reg.rb).
5
+ It can (1) determine the quality of information within a patient cohort with Cohort Analyzer (coPatReporter.rb); (2) associate genomic regions with their pathological phenotypes based on the cohort data with Reg2Phen (reg2phen.rb), and (3) predict the possible genetic variants that cause the clinically observed pathological phenotypes using phenotype-genotype association values with Phen2Reg (phen2reg.rb).
6
+
7
+ This tool has been developed to be used by the clinical community, to facilitate patient characterisation, help identify where data quality can be improved within a cohort and help diagnose patients with complex disease. Please cite us as Rojano E., Seoane-Zonjic P., Jabato F.M., Perkins J.R., Ranea J.A.G. (2020) Comprehensive Analysis of Patients with Undiagnosed Genetic Diseases Using the Patient Exploration Tools Suite (PETS). In: Rojas I., Valenzuela O., Rojas F., Herrera L., Ortuño F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science, vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_69.
6
8
 
7
- Associations between pathological phenotypes and genomic regions (using genomic coordinates from GRCh37 human assembly) are previously calculated using NetAnalyzer (https://rubygems.org/gems/NetAnalyzer). Please cite us as Rojano E. et al (2017). Revealing the Relationship Between Human Genome Regions and Pathological Phenotypes Through Network Analysis. LNCS, 10208:197-207.
8
9
 
9
10
  ## Installation
10
11
 
@@ -22,9 +23,82 @@ Or install it yourself as:
22
23
 
23
24
  $ gem install pets
24
25
 
26
+
27
+ After installing PETS Gem, R dependencies must be installed. For this, the user must run the following command:
28
+
29
+ $ install_deps.rb
30
+
25
31
  ## Usage
26
32
 
27
- TODO: Write usage instructions here
33
+ ### 1) Cohort Analyzer
34
+
35
+ Cohort Analyzer measures the phenotyping quality of patient and disease cohorts by calculating multiple statistics to give a general overview of the cohort status in terms of the depth and breadth of phenotyping. It can work with cohorts defined exclusively with HPO terms or with both HPO terms and genomic coordinates.
36
+
37
+ #### Basic usage of Cohort Analyzer:
38
+
39
+ We provide an example of use of Cohort Analyzer with a dataset from Vulto-van Silfhout, A.T.; Hehir-Kwa, J.Y.; van Bon, B.W.M.; Schuurs-Hoeijmakers, J.H.M.; Meader, S.; Hellebrekers, C.J.M.; Thoonen, I.J.M.; de Brouwer, A.P.M.; Brunner, H.G.; Webber, C.; Pfundt, R.; de Leeuw, N.; De Vries, B.B.A. Clinical Significance of De Novo and Inherited Copy-Number Variation. Human Mutation 2013, 34, 1679–1687. doi:10.1002/humu.22442.
40
+
41
+ This dataset includes de novo and inherited CNVs to phenotypes related to intellectual disability/developmental delay occurring alongside multiple congenital anomalies. An example of an input file is available in the example_datasets folder within this repository and the code to execute its analysis is provided below:
42
+
43
+ ```
44
+ coPatReporter.rb -i hummu_congenital_full_dataset.txt -o results -p phenotypes -c chr -d patient_id -s start -e stop -m lin
45
+ ```
46
+
47
+ Where:
48
+
49
+ - -i -> Input cohort, a tab file with patient identifiers and the list of HPOs characterised for each patient.
50
+ - -o -> Output path.
51
+ - -p -> Column name with phenotypes.
52
+ - -c -> Column name with chromosomes.
53
+ - -d -> Column name with patient identifiers.
54
+ - -s -> Column name with start genomic coordinate.
55
+ - -e -> Column name with final genomic coordinate.
56
+ - -m -> Semantic similarity measure method.
57
+ - -C -> Maximum number of clusters to display.
58
+
59
+ Further information with all Cohort Analyzer capabilities for setup can be queried as follows:
60
+
61
+ ```
62
+ coPatReporter.rb --help
63
+ ```
64
+
65
+ ### 2) Reg2Phen
66
+
67
+ This tool is a search engine that finds phenotypes associated with genomic regions or genes of interest. It uses two input files, one with phenotype-genotype associations previously calculated, and a list of genomic coordinates or gene identifiers to find their HPO associated. We provide an example of use in the example_datasets folder within this repository and the code to execute its analysis is provided below:
68
+
69
+ ```
70
+ reg2phen.rb -t associations_file.txt -p genes.txt -b hpo_file -P -g -H -o results/patient1Genes.txt -F $current/results/patient1Genes.html
71
+ ```
72
+ Where:
73
+
74
+ - -t -> Input phenotype-genotype associations file.
75
+ - -p -> List of genes to find HPOs associated.
76
+ - -b -> HPO obo file.
77
+ - -P -> Transform association values in P-values.
78
+ - -g -> Set if genes identifiers are provided instead of genome coordinates.
79
+ - -H -> Activate HTML reporting.
80
+ - -o -> Output folder.
81
+ - -F -> Semantic similarity measure method.
82
+
83
+ Associations between pathological phenotypes and genomic regions provided in this example were calculated with NetAnalyzer (https://rubygems.org/gems/NetAnalyzer, Rojano E. et al (2017). Revealing the Relationship Between Human Genome Regions and Pathological Phenotypes Through Network Analysis. LNCS, 10208:197-207) using randomised DECIPHER data (coordinates in the GRCh37 human genome assembly) and the hypergeometric association method.
84
+
85
+ ### 3) Phen2Reg
86
+
87
+ Phen2Reg analyses the pathological phenotypes observed in a patient and predicts putative causal genomic regions. As in the case of Reg2Phen, it uses phenotype-genotype associations previously calculated. We provide an example of use in the example_datasets folder within this repository and the code to execute its analysis is provided below:
88
+
89
+ ```
90
+ phen2reg.rb -t associations_file.txt -p example_patient_hpos.txt -i hpo2ci.txt -f hpo_file -T -Q > single_phens.txt
91
+ ```
92
+ Where:
93
+
94
+ - -t -> Input phenotype-genotype associations file.
95
+ - -p -> List of HPOs characterised for a patient.
96
+ - -i -> HPO information coefficients (IC) file.
97
+ - -f -> HPO obo file.
98
+ - -T -> Deactivate HTML reporting.
99
+ - -Q -> Deactivate quality control.
100
+
101
+ Results are saved in the single_phens.txt output file.
28
102
 
29
103
  ## Development
30
104
 
data/bin/coPatReporter.rb CHANGED
@@ -1,45 +1,13 @@
1
1
  #! /usr/bin/env ruby
2
2
 
3
3
  ROOT_PATH = File.dirname(__FILE__)
4
- REPORT_FOLDER = File.expand_path(File.join(ROOT_PATH, '..', 'templates'))
5
- EXTERNAL_DATA = File.expand_path(File.join(ROOT_PATH, '..', 'external_data'))
6
- EXTERNAL_CODE = File.expand_path(File.join(ROOT_PATH, '..', 'external_code'))
7
- HPO_FILE = File.join(EXTERNAL_DATA, 'hp.json')
8
- IC_FILE = File.join(EXTERNAL_DATA, 'uniq_hpo_with_CI.txt')
9
4
  $: << File.expand_path(File.join(ROOT_PATH, '..', 'lib', 'pets'))
10
5
 
11
6
  require 'benchmark'
12
7
  require 'parallel'
13
8
  require 'optparse'
14
- require 'csv'
15
- require 'npy'
16
- require 'generalMethods.rb'
17
- require 'coPatReporterMethods.rb'
18
9
  require 'report_html'
19
- require 'semtools'
20
-
21
- #Expand class (semtools modifications if necessary):
22
- class Ontology
23
-
24
- end
25
-
26
- ##########################
27
- # FUNCTIONS
28
- ##########################
29
-
30
- def translate_codes(clusters, hpo)
31
- translated_clusters = []
32
- clusters.each do |clusterID, num_of_pats, patientIDs_ary, patient_hpos_ary|
33
- translate_codes = patient_hpos_ary.map{|patient_hpos| patient_hpos.map{|hpo_code| hpo.translate_id(hpo_code)}}
34
- translated_clusters << [clusterID,
35
- num_of_pats,
36
- patientIDs_ary,
37
- patient_hpos_ary,
38
- translate_codes
39
- ]
40
- end
41
- return translated_clusters
42
- end
10
+ require 'pets'
43
11
 
44
12
  ##########################
45
13
  #OPT-PARSER
@@ -69,9 +37,9 @@ OptionParser.new do |opts|
69
37
  options[:chromosome_col] = data
70
38
  end
71
39
 
72
- options[:pat_id_col] = nil
40
+ options[:id_col] = nil
73
41
  opts.on("-d", "--pat_id_col INTEGER/STRING", "Column name if header is true, otherwise 0-based position of the column with the patient id") do |data|
74
- options[:pat_id_col] = data
42
+ options[:id_col] = data
75
43
  end
76
44
 
77
45
  options[:excluded_hpo] = nil
@@ -120,9 +88,9 @@ OptionParser.new do |opts|
120
88
  options[:clustering_methods] = data.split(',')
121
89
  end
122
90
 
123
- options[:hpo_names] = false
91
+ options[:names] = false
124
92
  opts.on("-n", "--hpo_names", "Define if the input HPO are human readable names. Default false") do
125
- options[:hpo_names] = true
93
+ options[:names] = true
126
94
  end
127
95
 
128
96
  options[:output_file] = nil
@@ -135,14 +103,14 @@ OptionParser.new do |opts|
135
103
  options[:hpo_file] = value
136
104
  end
137
105
 
138
- options[:hpo_col] = nil
106
+ options[:ont_col] = nil
139
107
  opts.on("-p", "--hpo_term_col INTEGER/STRING", "Column name if header true or 0-based position of the column with the HPO terms") do |data|
140
- options[:hpo_col] = data
108
+ options[:ont_col] = data
141
109
  end
142
110
 
143
- options[:hpo_separator] = '|'
111
+ options[:separator] = '|'
144
112
  opts.on("-S", "--hpo_separator STRING", "Set which character must be used to split the HPO profile. Default '|'") do |data|
145
- options[:hpo_separator] = data
113
+ options[:separator] = data
146
114
  end
147
115
 
148
116
  options[:start_col] = nil
@@ -165,7 +133,15 @@ OptionParser.new do |opts|
165
133
  options[:threads] = data.to_i
166
134
  end
167
135
 
136
+ options[:reference_profiles] = nil
137
+ opts.on("--reference_profiles PATH", "Path to file tabulated file with first column as id profile and second column with ontology terms separated by separator. ") do |opt|
138
+ options[:reference_profiles] = opt
139
+ end
168
140
 
141
+ options[:sim_thr] = nil
142
+ opts.on("--sim_thr FLOAT", "Keep pairs with similarity value >= FLOAT. ") do |opt|
143
+ options[:sim_thr] = opt.to_f
144
+ end
169
145
 
170
146
  opts.on_tail("-h", "--help", "Show this message") do
171
147
  puts opts
@@ -203,80 +179,68 @@ cluster_ic_data_file = File.join(temp_folder, 'cluster_ic_data.txt')
203
179
  cluster_chromosome_data_file = File.join(temp_folder, 'cluster_chromosome_data.txt')
204
180
  coverage_to_plot_file = File.join(temp_folder, 'coverage_data.txt')
205
181
  sor_coverage_to_plot_file = File.join(temp_folder, 'sor_coverage_data.txt')
182
+ ronto_file = File.join(temp_folder, 'hpo_freq_colour')
183
+
206
184
 
207
185
  Dir.mkdir(temp_folder) if !File.exists?(temp_folder)
208
186
 
209
187
  hpo_file = !ENV['hpo_file'].nil? ? ENV['hpo_file'] : HPO_FILE
210
- hpo = load_hpo_ontology(hpo_file, options[:excluded_hpo])
188
+ Cohort.load_ontology(:hpo, hpo_file, options[:excluded_hpo])
189
+ Cohort.act_ont = :hpo
211
190
 
212
- patient_data = load_patient_cohort(options)
191
+ patient_data, rejected_hpos_L, rejected_patients_L = Cohort_Parser.load(options)
192
+ rejected_hpos_C, rejected_patients_C = patient_data.check
193
+ rejected_hpos = rejected_hpos_L | rejected_hpos_C
194
+ rejected_patients = rejected_patients_L + rejected_patients_C
195
+ File.open(rejected_file, 'w'){|f| f.puts (rejected_patients).join("\n")}
213
196
 
214
- rejected_hpos, rejected_patients = format_patient_data(patient_data, options, hpo)
215
- File.open(rejected_file, 'w'){|f| f.puts rejected_patients.join("\n")}
216
- patient_data.select!{|pat_id, patient_record| !rejected_patients.include?(pat_id)}
217
- patient_uniq_profiles, equivalence = get_uniq_hpo_profiles(patient_data)
218
- hpo.load_profiles(patient_uniq_profiles)
197
+ patient_data.link2ont(Cohort.act_ont) # TODO: check if method load should call to this and use the semtools checking methods (take care to only remove invalid terms)
219
198
 
220
- profile_sizes, parental_hpos_per_profile = get_profile_redundancy(hpo)
221
- clean_patient_profiles(hpo, patient_uniq_profiles)
222
- cohort_hpos, suggested_childs, fraction_terms_specific_childs = compute_hpo_list_and_childs(patient_uniq_profiles, hpo)
223
- ontology_levels, distribution_percentage = get_profile_ontology_distribution_tables(hpo)
199
+ profile_sizes, parental_hpos_per_profile = patient_data.get_profile_redundancy
200
+ patient_data.check(hard=true)
201
+ hpo_stats = patient_data.get_profiles_terms_frequency() # hpo NAME, freq
202
+ hpo_stats.each{ |stat| stat[1] = stat[1]*100}
203
+ File.open(hpo_frequency_file, 'w') do |f|
204
+ patient_data.get_profiles_terms_frequency(translate: false).each do |hpo_code, freq| # hpo CODE, freq
205
+ f.puts "#{hpo_code.to_s}\t#{freq}"
206
+ end
207
+ end
208
+ suggested_childs, fraction_terms_specific_childs = patient_data.compute_term_list_and_childs()
209
+ ontology_levels, distribution_percentage = patient_data.get_profile_ontology_distribution_tables()
210
+ onto_ic, freq_ic, onto_ic_profile, freq_ic_profile = patient_data.get_ic_analysis()
224
211
 
225
- onto_ic, freq_ic = hpo.get_observed_ics_by_onto_and_freq # IC for TERMS
226
- onto_ic_profile, freq_ic_profile = hpo.get_profiles_resnik_dual_ICs # IC for PROFILES
227
212
  if options[:ic_stats] == 'freq_internal'
228
- ic_file = ENV['ic_file']
229
- ic_file = IC_FILE if ic_file.nil?
213
+ ic_file = !ENV['ic_file'].nil? ? ENV['ic_file'] : IC_FILE
230
214
  freq_ic = load_hpo_ci_values(ic_file)
231
215
  phenotype_ic = freq_ic
232
216
  freq_ic_profile = {}
233
- patient_uniq_profiles.each do |pat_id, phenotypes|
217
+ patient_data.each_profile do |pat_id, phenotypes|
234
218
  freq_ic_profile[pat_id] = get_profile_ic(phenotypes, phenotype_ic)
235
219
  end
236
- else
237
- if options[:ic_stats] == 'freq'
238
- phenotype_ic = freq_ic
239
- elsif options[:ic_stats] == 'onto'
240
- phenotype_ic = onto_ic
241
- end
220
+ elsif options[:ic_stats] == 'freq'
221
+ phenotype_ic = freq_ic
222
+ elsif options[:ic_stats] == 'onto'
223
+ phenotype_ic = onto_ic
242
224
  end
243
- clustered_patients = cluster_patients(patient_uniq_profiles, cohort_hpos, matrix_file, clustered_patients_file)
244
- all_ics, profile_lengths, cluster_data_by_chromosomes, top_cluster_phenotypes, multi_chromosome_patients = process_clustered_patients(options, clustered_patients, patient_uniq_profiles, patient_data, equivalence, hpo, phenotype_ic, options[:pat_id_col])
245
- get_patient_hpo_frequency(patient_uniq_profiles, hpo_frequency_file)
246
225
 
247
- summary_stats = get_summary_stats(patient_uniq_profiles, rejected_patients, cohort_hpos, hpo)
248
- summary_stats << ['Percentage of HPO with more specific children', (fraction_terms_specific_childs * 100).round(4)]
249
- summary_stats << ['DsI for uniq HP terms', hpo.get_dataset_specifity_index('uniq')]
250
- summary_stats << ['DsI for frequency weigthed HP terms', hpo.get_dataset_specifity_index('weigthed')]
226
+ clustered_patients = dummy_cluster_patients(patient_data.profiles, matrix_file, clustered_patients_file)
227
+ all_ics, prof_lengths, clust_by_chr, top_clust_phen, multi_chr_pats = process_dummy_clustered_patients(options, clustered_patients, patient_data, phenotype_ic)
251
228
 
252
- hpo_stats = hpo.get_profiles_terms_frequency()
253
- hpo_stats.each{ |stat| stat[1] = stat[1]*100}
254
- summary_stats << ['Number of unknown phenotypes', rejected_hpos.length]
229
+ summary_stats = get_summary_stats(patient_data, rejected_patients, hpo_stats, fraction_terms_specific_childs, rejected_hpos)
255
230
 
256
231
  all_cnvs_length = []
257
232
  if !options[:chromosome_col].nil?
258
- summary_stats << ['Number of clusters with mutations accross > 1 chromosomes', multi_chromosome_patients]
233
+ summary_stats << ['Number of clusters with mutations accross > 1 chromosomes', multi_chr_pats]
259
234
 
260
235
  #----------------------------------
261
236
  # Prepare data to plot coverage
262
237
  #----------------------------------
263
238
  if options[:coverage_analysis]
264
- processed_patient_data = process_patient_data(patient_data)
265
- cnv_sizes = []
266
- processed_patient_data.each do |chr, metadata|
267
- metadata.each do |patientID, start, stop|
268
- cnv_sizes << stop - start
269
- end
270
- end
271
- cnv_size_average = cnv_sizes.inject{ |sum, el| sum + el }.fdiv(cnv_sizes.length.to_f)
272
- patients_by_cluster, sors = generate_cluster_regions(processed_patient_data, 'A', 0)
273
- total_patients_sharing_sors = []
274
- all_patients = patients_by_cluster.keys
275
- all_patients.each do |identifier|
276
- total_patients_sharing_sors << identifier.split('_i').first
277
- end
278
- all_cnvs_length = get_cnvs_length(patient_data)
279
-
239
+ patient_data.index_vars
240
+ all_cnvs_length = patient_data.get_vars_sizes(true)
241
+ cnv_size_average = get_mean_size(all_cnvs_length)
242
+ patients_by_cluster, sors = patient_data.generate_cluster_regions(:reg_overlap, 'A', 0)
243
+
280
244
  ###1. Process CNVs
281
245
  raw_coverage, n_cnv, nt, pats_per_region = calculate_coverage(sors)
282
246
  summary_stats << ['Average variant size', cnv_size_average.round(4)]
@@ -288,7 +252,7 @@ if !options[:chromosome_col].nil?
288
252
  ###2. Process SORs
289
253
  raw_sor_coverage, n_sor, nt, pats_per_region = calculate_coverage(sors, options[:patients_filter] - 1)
290
254
  summary_stats << ["Number of genome window shared by >= #{options[:patients_filter]} patients", n_sor]
291
- summary_stats << ["Number of patients with at least 1 SOR", total_patients_sharing_sors.uniq.length]
255
+ summary_stats << ["Number of patients with at least 1 SOR", patients_by_cluster.length]
292
256
  summary_stats << ['Nucleotides affected by mutations', nt]
293
257
  # summary_stats << ['Patient average per region', pats_per_region]
294
258
  sor_coverage_to_plot = get_final_coverage(raw_sor_coverage, options[:bin_size])
@@ -304,20 +268,16 @@ write_detailed_hpo_profile_evaluation(suggested_childs, detailed_profile_evaluat
304
268
  write_arrays4scatterplot(onto_ic.values, freq_ic.values, hpo_ic_file, 'OntoIC', 'FreqIC') # hP terms
305
269
  write_arrays4scatterplot(onto_ic_profile.values, freq_ic_profile.values, hpo_profile_ic_file, 'OntoIC', 'FreqIC') #HP profiles
306
270
  write_arrays4scatterplot(profile_sizes, parental_hpos_per_profile, parents_per_term_file, 'ProfileSize', 'ParentTerms')
271
+ write_cluster_ic_data(all_ics, prof_lengths, cluster_ic_data_file, options[:clusters2graph])
307
272
 
308
273
  system_call(EXTERNAL_CODE, 'plot_scatterplot_simple.R', "-i #{hpo_ic_file} -o #{File.join(temp_folder, 'hpo_ics.pdf')} -x 'OntoIC' -y 'FreqIC' --x_tag 'HP Ontology IC' --y_tag 'HP Frequency based IC' --x_lim '0,4.5' --y_lim '0,4.5'") if !File.exists?(File.join(temp_folder, 'hpo_ics.pdf'))
309
274
  system_call(EXTERNAL_CODE, 'plot_scatterplot_simple.R', "-i #{hpo_profile_ic_file} -o #{File.join(temp_folder, 'hpo_profile_ics.pdf')} -x 'OntoIC' -y 'FreqIC' --x_tag 'HP Ontology Profile IC' --y_tag 'HP Frequency based Profile IC' --x_lim '0,4.5' --y_lim '0,4.5'") if !File.exists?(File.join(temp_folder, 'hpo_profile_ics.pdf'))
310
275
  system_call(EXTERNAL_CODE, 'plot_scatterplot_simple.R', "-i #{parents_per_term_file} -o #{File.join(temp_folder, 'parents_per_term.pdf')} -x 'ProfileSize' -y 'ParentTerms' --x_tag 'Patient HPO profile size' --y_tag 'Parent HPO terms within the profile'")
311
-
312
- ###Cohort frequency calculation
313
- ronto_file = File.join(temp_folder, 'hpo_freq_colour')
314
- system_call(EXTERNAL_CODE, 'ronto_plotter.R', "-i #{hpo_frequency_file} -o #{ronto_file} --root_node #{options[:root_node]} -O #{hpo_file.gsub('.json','.obo')}") if !File.exist?(ronto_file + '.png')
315
-
316
- write_cluster_ic_data(all_ics, profile_lengths, cluster_ic_data_file, options[:clusters2graph])
276
+ system_call(EXTERNAL_CODE, 'ronto_plotter.R', "-i #{hpo_frequency_file} -o #{ronto_file} --root_node #{options[:root_node]} -O #{hpo_file.gsub('.json','.obo')}") if !File.exist?(ronto_file + '.png') ###Cohort frequency calculation
317
277
  system_call(EXTERNAL_CODE, 'plot_boxplot.R', "#{cluster_ic_data_file} #{temp_folder} cluster_id ic 'Cluster size/id' 'Information coefficient' 'Plen' 'Profile size'")
318
278
 
319
279
  if !options[:chromosome_col].nil?
320
- write_cluster_chromosome_data(cluster_data_by_chromosomes, cluster_chromosome_data_file, options[:clusters2graph])
280
+ write_cluster_chromosome_data(clust_by_chr, cluster_chromosome_data_file, options[:clusters2graph])
321
281
  system_call(EXTERNAL_CODE, 'plot_scatterplot.R', "#{cluster_chromosome_data_file} #{temp_folder} cluster_id chr count 'Cluster size/id' 'Chromosome' 'Patients'")
322
282
  if options[:coverage_analysis]
323
283
  ###1. Process CNVs
@@ -332,69 +292,16 @@ end
332
292
  #----------------------------------
333
293
  # CLUSTER COHORT ANALYZER REPORT
334
294
  #----------------------------------
335
- Parallel.each(options[:clustering_methods], in_processes: options[:threads] ) do |method_name|
336
- matrix_filename = File.join(temp_folder, "similarity_matrix_#{method_name}.npy")
337
- axis_file = matrix_filename.gsub('.npy','.lst')
338
- profiles_similarity_filename = File.join(temp_folder, ['profiles_similarity', method_name].join('_').concat('.txt'))
339
- clusters_distribution_filename = File.join(temp_folder, ['clusters_distribution', method_name].join('_').concat('.txt'))
340
- if !File.exists?(matrix_filename)
341
- profiles_similarity = hpo.compare_profiles(sim_type: method_name.to_sym)
342
- write_profile_pairs(profiles_similarity, profiles_similarity_filename)
343
- similarity_matrix, axis_names = format_profiles_similarity_data_numo(profiles_similarity)
344
- File.open(axis_file, 'w'){|f| f.print axis_names.join("\n") }
345
- Npy.save(matrix_filename, similarity_matrix)
346
- end
347
- ext_var = ''
348
- if method_name == 'resnik'
349
- ext_var = '-m max'
350
- elsif method_name == 'lin'
351
- ext_var = '-m comp1'
352
- end
353
- out_file = File.join(temp_folder, method_name)
354
- system_call(EXTERNAL_CODE, 'plot_heatmap.R', "-y #{axis_file} -d #{matrix_filename} -o #{out_file} -M #{options[:minClusterProportion]} -t dynamic -H #{ext_var}") if !File.exists?(out_file + '_heatmap.png')
355
- clusters_codes, clusters_info = parse_clusters_file(File.join(temp_folder, "#{method_name}_clusters.txt"), patient_uniq_profiles)
356
- get_cluster_metadata(clusters_info, clusters_distribution_filename)
357
- out_file = File.join(temp_folder, ['clusters_distribution', method_name].join('_'))
358
- system_call(EXTERNAL_CODE, 'xyplot_graph.R', "-d #{clusters_distribution_filename} -o #{out_file} -x PatientsNumber -y HPOAverage") if !File.exists?(out_file)
359
- clusters = translate_codes(clusters_codes, hpo)
360
-
361
- container = {
362
- :temp_folder => temp_folder,
363
- :cluster_name => method_name,
364
- :clusters => clusters,
365
- :hpo => hpo
366
- }
367
-
368
- template = File.open(File.join(REPORT_FOLDER, 'cluster_report.erb')).read
369
- report = Report_html.new(container, 'Patient clusters report')
370
- report.build(template)
371
- report.write(options[:output_file]+"_#{method_name}_clusters.html")
372
- end
373
-
374
- system_call(EXTERNAL_CODE, 'generate_boxpot.R', "-i #{temp_folder} -o #{File.join(temp_folder, 'sim_boxplot')}") if !File.exists?(File.join(temp_folder, 'sim_boxplot.png'))
375
-
295
+ get_semantic_similarity_clustering(options, patient_data, temp_folder)
376
296
 
377
297
  #----------------------------------
378
298
  # GENERAL COHORT ANALYZER REPORT
379
299
  #----------------------------------
380
- total_patients = 0
381
- new_cluster_phenotypes = {}
382
- phenotypes_frequency = Hash.new(0)
383
- top_cluster_phenotypes.each_with_index do |cluster, clusterID|
384
- total_patients = cluster.length
385
- cluster.each do |phenotypes|
386
- phenotypes.each do |p|
387
- phenotypes_frequency[p] += 1
388
- end
389
- end
390
- new_cluster_phenotypes[clusterID] = [total_patients, phenotypes_frequency.keys, phenotypes_frequency.values.map{|v| v.fdiv(total_patients) * 100}]
391
- phenotypes_frequency = Hash.new(0)
392
- end
393
-
300
+ new_cluster_phenotypes = get_top_dummy_clusters_stats(top_clust_phen)
394
301
 
395
302
  container = {
396
303
  :temp_folder => temp_folder,
397
- # :top_cluster_phenotypes => top_cluster_phenotypes.length,
304
+ # :top_clust_phen => top_clust_phen.length,
398
305
  :summary_stats => summary_stats,
399
306
  :clustering_methods => options[:clustering_methods],
400
307
  :hpo_stats => hpo_stats,
@@ -413,8 +320,8 @@ new_cluster_phenotypes.each do |clusterID, info|
413
320
  container["clust_#{clusterID}"] = clust_info
414
321
  clust_info = []
415
322
  end
416
-
417
323
  template = File.open(File.join(REPORT_FOLDER, 'cohort_report.erb')).read
418
324
  report = Report_html.new(container, 'Cohort quality report')
419
325
  report.build(template)
420
- report.write(options[:output_file]+'.html')
326
+ report.write(options[:output_file]+'.html')
327
+
data/bin/comPatMondo.rb CHANGED
@@ -4,15 +4,12 @@
4
4
  # @author Fernando Moreno Jabato <jabato(at)uma(dot)es>
5
5
 
6
6
  ROOT_PATH = File.dirname(__FILE__)
7
- EXTERNAL_DATA = File.expand_path(File.join(ROOT_PATH, '..', 'external_data'))
8
- MONDO_FILE = File.join(EXTERNAL_DATA, 'mondo.obo')
9
- HPO_FILE = File.join(EXTERNAL_DATA, 'hp.obo')
10
- EXTERNAL_CODE = File.expand_path(File.join(ROOT_PATH, '..', 'external_code'))
11
7
  $: << File.expand_path(File.join(ROOT_PATH, '..', 'lib', 'pets'))
12
8
 
13
9
  require 'optparse'
14
10
  require 'semtools'
15
11
  require 'csv'
12
+ require 'constants.rb'
16
13
  require 'coPatReporterMethods.rb'
17
14
 
18
15
  ##########################