metient 0.1.1.dev2__tar.gz → 0.1.1.dev3__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (23) hide show
  1. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/PKG-INFO +1 -1
  2. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/README.md +5 -3
  3. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/lib/vertex_labeling.py +1 -1
  4. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/data_extraction_util.py +82 -104
  5. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient.egg-info/PKG-INFO +1 -1
  6. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/setup.py +1 -1
  7. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/__init__.py +0 -0
  8. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/lib/__init__.py +0 -0
  9. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/metient.py +0 -0
  10. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/__init__.py +0 -0
  11. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/create_conf_intervals_from_reads.py +0 -0
  12. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/deprecated.py +0 -0
  13. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/eval_util.py +0 -0
  14. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/extra.py +0 -0
  15. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/globals.py +0 -0
  16. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/pairtree_data_extraction_util.py +0 -0
  17. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/plotting_util.py +0 -0
  18. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient/util/vertex_labeling_util.py +0 -0
  19. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient.egg-info/SOURCES.txt +0 -0
  20. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient.egg-info/dependency_links.txt +0 -0
  21. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient.egg-info/requires.txt +0 -0
  22. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/metient.egg-info/top_level.txt +0 -0
  23. {metient-0.1.1.dev2 → metient-0.1.1.dev3}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 1.0
2
2
  Name: metient
3
- Version: 0.1.1.dev2
3
+ Version: 0.1.1.dev3
4
4
  Summary: UNKNOWN
5
5
  Home-page: https://github.com/divyakoyy/metient.git
6
6
  Author: UNKNOWN
@@ -23,14 +23,16 @@ Each row in the tsv should correspond to the reference and variant read counts a
23
23
  The required fields for the tsv file:
24
24
  | Column name | Description |
25
25
  |----------|----------|
26
- | **anatomical_site_index** | Zero-based index for anatomical_site_label column |
26
+ | **anatomical_site_index** | Zero-based index for anatomical_site_label column. Rows with the same anatomical site index and cluster_index will get pooled together.|
27
27
  | **anatomical_site_label** | Name of the anatomical site |
28
+ | **character_label** | Zero-based index for character_label column |
28
29
  | **character_label** | Name of the mutation or cluster of mutations. This is used in visualizations, so it should be short. NOTE: due to graphing dependencies, this string cannot contain colons. |
29
- | **cluster_index** | If using a clustering method, the cluster index that this mutation belongs to. NOTE: this must correspond to the indices used in the tree txt file. |
30
+ | **cluster_index** | If using a clustering method, the cluster index that this mutation belongs to. NOTE: this must correspond to the indices used in the tree txt file. Rows with the same anatomical site index and cluster_index will get pooled together.|
30
31
  | **ref** | The number of reads that map to the reference allele for this mutation or mutation cluster in this anatomical site. |
31
32
  | **var** | The number of reads that map to the variant allele for this mutation or mutation cluster in this anatomical site. |
32
- | **var_read_prob** | Let j = character_index. This is the probabilty of observing a read from the variant allele for mutation at j in a cell bearing the mutation. Thus, if mutation at j occurred at a diploid locus, this should be 0.5. In a haploid cell (e.g., male sex chromosome), this should be 1.0. If a copy-number aberration (CNA) duplicated the reference allele in the lineage bearing mutation j prior to j occurring, there will be two reference alleles and a single variant allele in all cells bearing j, such that var_read_prob = 0.3333. This gives Metient the ability to correct for the effect CNAs have on the relationship between VAF (i.e., the proportion of alleles bearing the mutation) and subclonal frequency (i.e., the proportion of cells bearing the mutation). This is modeled around PairTree's VAF correction, see S2.2 of [PairTree's supplementary info](https://aacr.silverchair-cdn.com/aacr/content_public/journal/bloodcancerdiscov/3/3/10.1158_2643-3230.bcd-21-0092/9/bcd-21-0092_supplementary_information_suppsm_sf1-sf21.pdf?Expires=1709221974&Signature=dJH6~Dg-6gEb-S88i0wDGW28QZn16keQj34Vo2tAvJL2cUJrQo48afpHPp-a2zAwQa~ET6SDgw3hb3ITacB06GDUc3GYCdCgYtfPMjFGwygFj-Q9xf-c44VAvwiyliwsBXK1shZmURlFMwSjzkwRwasuWu50sMNmeJSoVyX3nQ-rRBlK93aDR5s9c0l-p4aGvTi6QmfKJPsxXaHB4Lz5yXSl3Xd~JPK-Y~ltC14epDRb~MiSPWUFCAiYetUXcQ7J7vd6b4XQKT9PnYkjQtUq55tLSoUkOGe5JkJ32NXCeoT~l-XD97pCeDYVDOYzAuOkAG0tDYrPebEh2TGTA3fnbA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA) for more details. |
33
33
  | **site_category** | Must be one of `primary` or `metastasis`. If multiple primaries are specified (i.e., the true primary is not known), we will run Metient multiple times with each primary used as the true primary. Output files are saved with the suffix `_{anatomical_site_label}` to indicate which primary was used in that run. |
34
+ | **var_read_prob** | This gives Metient the ability to correct for the effect copy number alterations CNAs) have on the relationship between VAF (i.e., the proportion of alleles bearing the mutation) and subclonal frequency (i.e., the proportion of cells bearing the mutation). Let j = character_index. This is the probabilty of observing a read from the variant allele for mutation at j in a cell bearing the mutation. Thus, if mutation at j occurred at a diploid locus, this should be 0.5. In a haploid cell (e.g., male sex chromosome), this should be 1.0. If a CNA duplicated the reference allele in the lineage bearing mutation j prior to j occurring, there will be two reference alleles and a single variant allele in all cells bearing j, such that var_read_prob = 0.3333. If using a CN caller that reports major and minor CN: `var_read_prob = (p*maj)/(p*(maj+min)+(1-p)*2)`, where `p` is tumor purity, `maj` is major CN, `min` is minor CN, and we're assuming the variant allele has major CN. For more information, see S2.2 of [PairTree's supplementary info](https://aacr.silverchair-cdn.com/aacr/content_public/journal/bloodcancerdiscov/3/3/10.1158_2643-3230.bcd-21-0092/9/bcd-21-0092_supplementary_information_suppsm_sf1-sf21.pdf?Expires=1709221974&Signature=dJH6~Dg-6gEb-S88i0wDGW28QZn16keQj34Vo2tAvJL2cUJrQo48afpHPp-a2zAwQa~ET6SDgw3hb3ITacB06GDUc3GYCdCgYtfPMjFGwygFj-Q9xf-c44VAvwiyliwsBXK1shZmURlFMwSjzkwRwasuWu50sMNmeJSoVyX3nQ-rRBlK93aDR5s9c0l-p4aGvTi6QmfKJPsxXaHB4Lz5yXSl3Xd~JPK-Y~ltC14epDRb~MiSPWUFCAiYetUXcQ7J7vd6b4XQKT9PnYkjQtUq55tLSoUkOGe5JkJ32NXCeoT~l-XD97pCeDYVDOYzAuOkAG0tDYrPebEh2TGTA3fnbA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA) for more details. |
35
+
34
36
 
35
37
  anatomical_site_index anatomical_site_label cluster_index character_label ref var var_read_prob site_category num_mutations
36
38
  0 breast 0 HER2 982 78 0.43 primary 54
@@ -919,7 +919,7 @@ def calibrate(tree_fns, tsv_fns, print_config, output_dir, run_names,
919
919
  Os, batch_size, custom_colors, bias_weights, solve_polytomies):
920
920
 
921
921
  if not (len(tree_fns) == len(tsv_fns) == len(run_names)):
922
- raise ValueError("Inputs Ts, tsv_fns, primary_sites and run_names must have equal length (length = number of patients in cohort")
922
+ raise ValueError("Inputs Ts, tsv_fns, and run_names must have equal length (length = number of patients in cohort")
923
923
 
924
924
  if isinstance(tree_fns[0], str):
925
925
  Ts = []
@@ -2,7 +2,6 @@ import csv
2
2
  import numpy as np
3
3
  import os
4
4
  import torch
5
- import copy
6
5
  from collections import OrderedDict
7
6
  import pandas as pd
8
7
  import sys
@@ -14,7 +13,6 @@ print("CUDA GPU:",torch.cuda.is_available())
14
13
  if torch.cuda.is_available():
15
14
  torch.set_default_tensor_type(torch.cuda.FloatTensor)
16
15
 
17
-
18
16
  def get_adjacency_matrix_from_txt_edge_list(txt_file):
19
17
  edges = []
20
18
  max_idx = -1
@@ -44,28 +42,14 @@ def get_mut_to_cluster_map_from_pyclone_output(pyclone_cluster_fn, min_mut_thres
44
42
 
45
43
  def pool_input_tsv(tsv_fn, output_dir, run_name):
46
44
  '''
45
+ Pool reads from the same anatomical site index and cluster index
47
46
  '''
48
- df = pd.read_csv(tsv_fn, delimiter="\t")
49
- mut_name_to_cluster_id = {}
50
- cluster_id_to_mut_names = {}
51
- for _,row in df.iterrows():
52
- cid = int(row['cluster_index'])
53
- mut_name_to_cluster_id[row['character_label']] = cid
54
- if cid not in cluster_id_to_mut_names:
55
- cluster_id_to_mut_names[cid] = []
56
- cluster_id_to_mut_names[cid].append(row['character_label'])
57
-
58
- cluster_id_to_cluster_name = {k:";".join(list(v)) for k,v in cluster_id_to_mut_names.items()}
59
- cols = df.columns
60
-
61
- df = df.drop(columns=["cluster_index"])
62
- _, output_fn = write_pooled_tsv_from_clusters(df, mut_name_to_cluster_id, cluster_id_to_cluster_name, {},
63
- output_dir, run_name, ";", None)
47
+ df = pd.read_csv(tsv_fn, delimiter="\t")
48
+ assert(set(df['site_category'])==set(['primary', 'metastasis']))
49
+ _, output_fn = write_pooled_tsv_from_clusters(df, {}, output_dir, run_name, None)
64
50
  return output_fn
65
51
 
66
- def write_pooled_tsv_from_clusters(df, mut_name_to_cluster_id, cluster_id_to_cluster_name,
67
- aggregation_rules, output_dir, patient_id, cluster_sep,
68
- mutation_sep):
52
+ def write_pooled_tsv_from_clusters(df, aggregation_rules, output_dir, patient_id, mutation_sep):
69
53
  '''
70
54
  After clustering with any clustering algorithm, prepares tsvs by pooling mutations belonging
71
55
  to the same cluster
@@ -74,7 +58,7 @@ def write_pooled_tsv_from_clusters(df, mut_name_to_cluster_id, cluster_id_to_clu
74
58
  df: pandas DataFrame with reqiured columns: [character_label,
75
59
  anatomical_site_index, anatomical_site_label, ref, var, var_read_prob, site_category]
76
60
 
77
- mut_name_to_cluster_id: dictionary mapping mutation name ('character_label' in input df)
61
+ mut_idx_to_cluster_id: dictionary mapping mutation index ('character_index' in input df)
78
62
  to a cluster id
79
63
 
80
64
  cluster_id_to_cluster_name: dictionary mapping a cluster id to the name in the output tsv
@@ -94,8 +78,8 @@ def write_pooled_tsv_from_clusters(df, mut_name_to_cluster_id, cluster_id_to_clu
94
78
 
95
79
  '''
96
80
 
97
- required_cols = ["anatomical_site_index", "anatomical_site_label",
98
- "character_label", "ref","var",
81
+ required_cols = ["anatomical_site_index", "anatomical_site_label", "cluster_index",
82
+ "character_index", "character_label", "ref","var",
99
83
  "var_read_prob", "site_category"]
100
84
 
101
85
  all_cols = required_cols + list(aggregation_rules.keys())
@@ -116,33 +100,34 @@ def write_pooled_tsv_from_clusters(df, mut_name_to_cluster_id, cluster_id_to_clu
116
100
  df['ref'] = df.apply(lambda row: row['total_reads_corrected']-row['var'], axis=1)
117
101
  df['var_read_prob'] = 0.5
118
102
 
119
- df['cluster'] = df.apply(lambda row: mut_name_to_cluster_id[row['character_label']] if row['character_label'] in mut_name_to_cluster_id else np.nan, axis=1)
120
- df = df.dropna(subset=['cluster'])
103
+ #df['cluster'] = df.apply(lambda row: mut_idx_to_cluster_id[row['character_index']] if row['character_index'] in mut_idx_to_cluster_id else np.nan, axis=1)
104
+ df = df.dropna(subset=['cluster_index'])
105
+
106
+ # Save the number of mutations in each cluster before pooling
107
+ cluster_id_to_num_muts = df.groupby('cluster_index')['character_index'].nunique().to_dict()
108
+ cluster_id_to_mut_names = df.groupby('cluster_index')['character_label'].unique().apply(list).to_dict()
121
109
 
122
110
  # 2. Pool reference and variant allele counts from all mutations within a cluster
123
- pooled_df = df.drop(['character_label', 'anatomical_site_index'], axis=1)
111
+ pooled_df = df.drop(['character_label','character_index'], axis=1) # we're going to add this back in later
124
112
 
125
113
  # TODO: validate that site_category is the same for each anatomical site
126
- ref_var_rules = {'ref': np.sum, 'var': np.sum,'total_reads_corrected': np.sum, "var_read_prob": 'first', 'site_category':'first'}
114
+ ref_var_rules = {'ref': np.sum, 'var': np.sum,'total_reads_corrected': np.sum, "var_read_prob": 'first',
115
+ 'site_category':'first', 'anatomical_site_label':lambda x: ';'.join(set(x)),}
116
+
117
+ pooled_df = pooled_df.groupby(['cluster_index', 'anatomical_site_index'], as_index=False).agg({**ref_var_rules, **aggregation_rules})
127
118
 
128
- pooled_df = pooled_df.groupby(['cluster', 'anatomical_site_label'], as_index=False).agg({**ref_var_rules, **aggregation_rules})
129
-
130
- pooled_df['character_label'] = pooled_df.apply(lambda row: cluster_id_to_cluster_name[row['cluster']], axis=1)
131
-
132
119
  # 3. Add indices for mutations, samples and anatomical sites as needed for input format
133
- pooled_df['character_index'] = pooled_df['cluster'].astype(int)
120
+ pooled_df['character_index'] = pooled_df['cluster_index'].tolist()
134
121
  pooled_df['anatomical_site_index'] = pooled_df.apply(lambda row: list(pooled_df['anatomical_site_label'].unique()).index(row["anatomical_site_label"]), axis=1)
135
- all_cols.append("total_reads_corrected")
136
- all_cols.append("character_index")
137
- pooled_df = pooled_df[all_cols]
138
122
 
139
123
  # 4. Do some post-processing, e.g. adding number of mutations and shortening character label for display
140
- pooled_df['num_mutations'] = pooled_df.apply(lambda row: len(row['character_label'].split(cluster_sep)), axis=1)
141
- pooled_df['full_label'] = pooled_df['character_label']
142
- pooled_df['character_label'] = pooled_df.apply(lambda row:get_pruned_mut_label(row['character_label'], cluster_sep, mutation_sep), axis=1)
124
+ pooled_df['num_mutations'] = pooled_df.apply(lambda row:cluster_id_to_num_muts[int(row['character_index'])], axis=1)
125
+ pooled_df['character_label'] = pooled_df.apply(lambda row:get_pruned_mut_label(cluster_id_to_mut_names[row['cluster_index']], mutation_sep), axis=1)
143
126
  pooled_df['total_reads_corrected'] = pooled_df['total_reads_corrected'].round(0).astype(int)
144
127
  pooled_df['var'] = pooled_df['var'].round(0).astype(int)
145
128
  pooled_df['ref'] = pooled_df.apply(lambda row: row['total_reads_corrected']-row['var'], axis=1)
129
+ all_cols.append('num_mutations')
130
+ pooled_df = pooled_df[all_cols]
146
131
 
147
132
  # Save
148
133
  output_fn = os.path.join(output_dir, f"{patient_id}_clustered_SNVs.tsv")
@@ -162,71 +147,70 @@ def calc_var_read_prob(major_cn, minor_cn, purity):
162
147
  var_read_prob = (p*major_cn)/x
163
148
  return var_read_prob
164
149
 
165
- def write_pooled_tsv_from_conipher_pyclone_clusters(input_data_tsv_fn, clusters_tsv_fn,
166
- aggregation_rules, output_dir, patient_id,
167
- cluster_sep, mutation_sep):
150
+ # def write_pooled_tsv_from_conipher_pyclone_clusters(input_data_tsv_fn, clusters_tsv_fn,
151
+ # aggregation_rules, output_dir, patient_id,
152
+ # cluster_sep, mutation_sep):
168
153
 
169
- '''
170
- After clustering with PyClone using CONIPHER (https://github.com/McGranahanLab/CONIPHER-wrapper),
171
- prepares tsvs by pooling mutations belonging to the same cluster
154
+ # '''
155
+ # After clustering with PyClone using CONIPHER (https://github.com/McGranahanLab/CONIPHER-wrapper),
156
+ # prepares tsvs by pooling mutations belonging to the same cluster
172
157
 
173
- Args:
174
- tsv_fn: path to tsv with reqiured columns: [#sample_index, sample_label,
175
- character_index, character_label, anatomical_site_index, anatomical_site_label,
176
- ref, var]
158
+ # Args:
159
+ # tsv_fn: path to tsv with reqiured columns: [#sample_index, sample_label,
160
+ # character_index, character_label, anatomical_site_index, anatomical_site_label,
161
+ # ref, var]
177
162
 
178
- clusters_tsv_fn: PyClone results tsv that maps each mutation to a cluster id
163
+ # clusters_tsv_fn: PyClone results tsv that maps each mutation to a cluster id
179
164
 
180
- aggregation_rules: dictionary indicating how to aggregate any extra columns that are not in
181
- the required columns (specificied above). e.g. "first" aggregation rules can be used for columns
182
- shared by rows with the same anatomical_site_label, since we are aggregating within an
183
- anatomical_site_label.
165
+ # aggregation_rules: dictionary indicating how to aggregate any extra columns that are not in
166
+ # the required columns (specificied above). e.g. "first" aggregation rules can be used for columns
167
+ # shared by rows with the same anatomical_site_label, since we are aggregating within an
168
+ # anatomical_site_label.
184
169
 
185
- output_dir: where to save clustered tsv
170
+ # output_dir: where to save clustered tsv
186
171
 
187
- patient_id: name of patient used in tsv filename
172
+ # patient_id: name of patient used in tsv filename
188
173
 
189
- cluster_sep: string that separates names of mutations when creating a cluster name.
190
- e.g. cluster name for mutations ABC:4:3 and DEF:1:2 with cluster_sep=";" will be
191
- "ABC:4:3;DEF:1:2"
174
+ # cluster_sep: string that separates names of mutations when creating a cluster name.
175
+ # e.g. cluster name for mutations ABC:4:3 and DEF:1:2 with cluster_sep=";" will be
176
+ # "ABC:4:3;DEF:1:2"
192
177
 
193
- mutation_sep: string that separates information about a variant, e.g.":" for a mutation
194
- named ABC:4:3. Can be None if variants are not formatted this way.
178
+ # mutation_sep: string that separates information about a variant, e.g.":" for a mutation
179
+ # named ABC:4:3. Can be None if variants are not formatted this way.
195
180
 
196
181
 
197
- Outputs:
182
+ # Outputs:
198
183
 
199
- Saves pooled tsv at {output_dir}/{patient_id}_clustered_SNVs.tsv
184
+ # Saves pooled tsv at {output_dir}/{patient_id}_clustered_SNVs.tsv
200
185
 
201
- '''
186
+ # '''
202
187
 
203
- df = pd.read_csv(input_data_tsv_fn, delimiter="\t", index_col=0)
204
- pyclone_df = pd.read_csv(clusters_tsv_fn, delimiter="\t")
205
- mut_name_to_cluster_id = dict()
206
- cluster_id_to_mut_names = dict()
207
- # 1. Get mapping between mutation names and PyClone cluster ids
208
- for _, row in df.iterrows():
209
- mut_items = row['character_label'].split(":")
210
- cluster_id = pyclone_df[(pyclone_df['CHR']==int(mut_items[1]))&(pyclone_df['POS']==int(mut_items[2]))&(pyclone_df['REF']==mut_items[3])]['treeCLUSTER'].unique()
211
- assert(len(cluster_id) <= 1)
212
- if len(cluster_id) == 1:
213
- cluster_id = int(cluster_id.item())
214
- mut_name_to_cluster_id[row['character_label']] = cluster_id
215
- if cluster_id not in cluster_id_to_mut_names:
216
- cluster_id_to_mut_names[cluster_id] = set()
217
- else:
218
- cluster_id_to_mut_names[cluster_id].add(row['character_label'])
188
+ # df = pd.read_csv(input_data_tsv_fn, delimiter="\t", index_col=0)
189
+ # pyclone_df = pd.read_csv(clusters_tsv_fn, delimiter="\t")
190
+ # mut_idx_to_cluster_id = dict()
191
+ # cluster_id_to_mut_names = dict()
192
+ # # 1. Get mapping between mutation names and PyClone cluster ids
193
+ # for _, row in df.iterrows():
194
+ # mut_items = row['character_label'].split(":")
195
+ # cluster_id = pyclone_df[(pyclone_df['CHR']==int(mut_items[1]))&(pyclone_df['POS']==int(mut_items[2]))&(pyclone_df['REF']==mut_items[3])]['treeCLUSTER'].unique()
196
+ # assert(len(cluster_id) <= 1)
197
+ # if len(cluster_id) == 1:
198
+ # cluster_id = int(cluster_id.item())
199
+ # mut_idx_to_cluster_id[int(row['character_index'])] = cluster_id
200
+ # if cluster_id not in cluster_id_to_mut_names:
201
+ # cluster_id_to_mut_names[cluster_id] = set()
202
+ # else:
203
+ # cluster_id_to_mut_names[cluster_id].add(row['character_label'])
219
204
 
220
- # 2. Set new names for clustered mutations
221
- cluster_id_to_cluster_name = {k:cluster_sep.join(list(v)) for k,v in cluster_id_to_mut_names.items()}
205
+ # # 2. Set new names for clustered mutations
206
+ # cluster_id_to_cluster_name = {k:cluster_sep.join(list(v)) for k,v in cluster_id_to_mut_names.items()}
222
207
 
223
- print(patient_id, {k:v for k,v in list(mut_name_to_cluster_id.items())[:3]})
224
- df['var_read_prob'] = df.apply(lambda row: calc_var_read_prob(row['major_cn'], row['minor_cn'], row['purity']), axis=1)
225
- df['site_category'] = df.apply(lambda row: 'primary' if 'primary' in row['anatomical_site_label'] else 'metastasis', axis=1)
208
+ # df['var_read_prob'] = df.apply(lambda row: calc_var_read_prob(row['major_cn'], row['minor_cn'], row['purity']), axis=1)
209
+ # df['site_category'] = df.apply(lambda row: 'primary' if 'primary' in row['anatomical_site_label'] else 'metastasis', axis=1)
226
210
 
227
- # 3. Pool mutations and write to file
228
- return write_pooled_tsv_from_clusters(df, mut_name_to_cluster_id, cluster_id_to_cluster_name,
229
- aggregation_rules, output_dir, patient_id, cluster_sep, mutation_sep)
211
+ # # 3. Pool mutations and write to file
212
+ # return write_pooled_tsv_from_clusters(df, mut_idx_to_cluster_id, cluster_id_to_cluster_name,
213
+ # aggregation_rules, output_dir, patient_id, cluster_sep, mutation_sep)
230
214
 
231
215
  def is_resolved_polytomy_cluster(cluster_label):
232
216
  '''
@@ -399,23 +383,26 @@ def get_ref_var_omega_matrices(tsv_filepaths):
399
383
 
400
384
  return ref_matrices, var_matrices, omega_matrices, ordered_sites, idx_to_label_dicts, idx_to_num_muts
401
385
 
402
- def get_pruned_mut_label(mut_label, cluster_sep, mutation_sep, k=2):
386
+ def get_pruned_mut_label(mut_names, mutation_sep, k=2):
387
+
388
+ # If mutation names contain colons (e.g. LOC1:9:12312321), take everything before the first
389
+ # colon for displaying since otherwise we can't display the label w/ our graphing library dependencies
403
390
  if mutation_sep != None:
404
- gene_names = [item.split(mutation_sep)[0] for item in mut_label.split(cluster_sep)]
391
+ gene_names = [item.split(mutation_sep)[0] for item in mut_names]
405
392
  else:
406
- gene_names = [item for item in mut_label.split(cluster_sep)]
393
+ gene_names = [item for item in mut_names]
407
394
 
408
- gene_candidates = []
395
+ gene_candidates = set()
409
396
  for gene in gene_names:
410
397
  gene = gene.upper()
411
398
  if gene in CANCER_DRIVER_GENES:
412
- gene_candidates.append(gene)
399
+ gene_candidates.add(gene)
413
400
  elif gene in ENSEMBLE_TO_GENE_MAP:
414
- gene_candidates.append(ENSEMBLE_TO_GENE_MAP[gene])
401
+ gene_candidates.add(ENSEMBLE_TO_GENE_MAP[gene])
415
402
  final_genes = gene_names if len(gene_candidates) == 0 else gene_candidates
416
403
 
417
404
  k = k if len(final_genes) > k else len(final_genes)
418
- return "_".join(final_genes[:k])
405
+ return "_".join(list(final_genes)[:k])
419
406
 
420
407
  def get_primary_sites(tsv_filepath):
421
408
  # For tsvs with very large fields
@@ -579,15 +566,6 @@ def _get_adj_matrix_from_machina_tree(tree_edges, idx_to_character_label, remove
579
566
  ret_dict = {v:k for k,v in ret_dict.items()}
580
567
  return T, ret_dict
581
568
 
582
- def get_index_to_cluster_label_from_corrected_sim_tsv(ref_var_fn):
583
- df = pd.read_csv(ref_var_fn, sep="\t")
584
- idx_to_label = {}
585
- labels = df['character_label'].unique()
586
- for label in labels:
587
- idx = df[df['character_label']==label]['character_index'].unique().item()
588
- idx_to_label[idx] = label
589
- return idx_to_label
590
-
591
569
  def get_adj_matrices_from_spruce_mutation_trees(mut_trees_filename, idx_to_character_label, is_sim_data=False):
592
570
  '''
593
571
  When running MACHINA's generatemutationtrees executable (SPRUCE), it provides a txt file with
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 1.0
2
2
  Name: metient
3
- Version: 0.1.1.dev2
3
+ Version: 0.1.1.dev3
4
4
  Summary: UNKNOWN
5
5
  Home-page: https://github.com/divyakoyy/metient.git
6
6
  Author: UNKNOWN
@@ -4,4 +4,4 @@ with open('requirements.txt') as f:
4
4
  requirements = f.read().splitlines()
5
5
  #print(requirements)
6
6
 
7
- setup(name='metient', version='0.1.1.dev2', url="https://github.com/divyakoyy/metient.git", packages=['metient', 'metient.util', 'metient.lib'], install_requires=requirements,)
7
+ setup(name='metient', version='0.1.1.dev3', url="https://github.com/divyakoyy/metient.git", packages=['metient', 'metient.util', 'metient.lib'], install_requires=requirements,)
File without changes