bio-exominer 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ bin/*
3
+ -
4
+ features/**/*.feature
5
+ LICENSE.txt
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
@@ -0,0 +1,14 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - 2.1.0
5
+ - ruby-head
6
+ # - jruby-19mode # JRuby in 1.9 mode - no support for msgpack
7
+
8
+ # - rbx-19mode
9
+ # - 1.8.7
10
+ # - jruby-18mode # JRuby in 1.8 mode
11
+ # - rbx-18mode
12
+
13
+ # uncomment this line if your project needs to run something other than `rake`:
14
+ # script: bundle exec rspec spec
data/Gemfile ADDED
@@ -0,0 +1,17 @@
1
+ source "http://rubygems.org"
2
+ # Add dependencies required to use your gem here.
3
+ # Example:
4
+ # gem "activesupport", ">= 2.3.5"
5
+
6
+ gem 'msgpack'
7
+
8
+ # Add dependencies to develop your gem here.
9
+ # Include everything needed to run rake, tests, features, etc.
10
+ group :development do
11
+ gem "minitest", "~> 5.0.7"
12
+ gem "rspec"
13
+ gem "cucumber"
14
+ gem "bundler"
15
+ gem "jeweler", "~> 2.0.0"
16
+ gem "rdoc"
17
+ end
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2013 Cuppen Group and Pjotr Prins
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,413 @@
1
+ # bio-exominer
2
+
3
+ [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-exominer.png)](http://travis-ci.org/pjotrp/bioruby-exominer)
4
+
5
+ Exominer helps build a list of genes from publications.
6
+
7
+ Such a gene list may be used for identifying candidate genes connected to
8
+ a specific disease, but also may be used to compile a targeted
9
+ exome design for sequencing.
10
+
11
+ A quick example of a result for a search for pancreatic cancer genes
12
+ that were not listed in an exome design can be seen
13
+ [here](http://biobeat.org/examples/pancreatic_minus_new_design.html).
14
+
15
+ | gene | textmatch | description | context | resource | doi |
16
+ | ----- | --------- | ------------------------------------- | ------- | --- | --- |
17
+ | AKP8L | HAP95 | A kinase (PRKA) anchor protein 8-like | A cancer-associated RING finger protein, RNF43, is a ubiquitin ligase that interacts with a nuclear protein, HAP95 | Whole-exome sequencing of neoplastic cysts of the pancreas reveals recurrent mutations in components of ubiquitin-dependent pathways | doi:10.1073/pnas.1118046108 |
18
+
19
+ Here, the second column shows the fuzzy text match, the first column the
20
+ official HUGO name, the third column a description of the gene, the
21
+ fourth column the textual context in the publication, the fifth column
22
+ the title of the publication and the sixth column the DOI. The second
23
+ entry for AM is a false positive; quickly seen by checking the
24
+ context in the fourth column. This output is generated by a SPARQL
25
+ query and a lot of flexibility in combining resources and generating
26
+ output is possible. Note that this is just one example.
27
+
28
+ The inputs for Exominer consists of a list of Pubmed IDs with text files (PDF,
29
+ HTML, Word, Excel have to be exported to plain text first). Exominer
30
+ harvests gene names from these documents using a default symbol list
31
+ with aliases. Ideally, all texts would only contain HUGO symbols,
32
+ i.e. the over 30K standardized gene names by the HUGO Gene
33
+ Nomenclature Committee (HGNC). In reality, scientific authors take
34
+ liberties and the search for names is 'fuzzy'. Therefore the search
35
+ for Exominer also mines for the 12 odd million symbols and aliases
36
+ that are known through NCBI.
37
+
38
+ All matches are written with their sources, symbol frequencies,
39
+ publication year, and user provided keywords and impact scores and
40
+ written out.
41
+
42
+ Exominer also exports to RDF, so that the gene symbols can be stored
43
+ into a triple-store graph database and link out to Bio2rdf resources.
44
+ The latter allows, for example, harvesting of pathways.
45
+
46
+ Every RDF export contains full information on the origin of symbols.
47
+ Over time designs can be compared against each other and a historical
48
+ record is maintained. It is a good idea to store the textual versions
49
+ of the files too.
50
+
51
+ The initial symbol list with aliases can be fetched/generated from external
52
+ sources, such as NCBI, Biomart and/or Bio2rdf. Some examples are listed in this
53
+ README and related scripts are in ./scripts. For a more specific treatment of
54
+ design and input/output of exominer, see ./doc/design.md.
55
+
56
+ Questions to ask from the RDF
57
+
58
+ * What genes are mentioned in a paper?
59
+ * What papers refer to certain genes?
60
+ * What genes are mentioned most in papers?
61
+ * What genes are mentioned only in one paper?
62
+ * What genes are mentioned since 2011?
63
+ * What genes are linked to a certain disease subtype?
64
+ * What genes are linked to some author or lab?
65
+ * What genes exist in a design?
66
+ * What are the genes in a design that are non-HUGO named
67
+ * What are the genes in a paper that are non-HUGO named
68
+ * How do designs differ?
69
+ * What genes are not in a design mentioned since 2010?
70
+
71
+ When linking out to TCGA and bio2rdf we can get mutation information and gene sizes
72
+
73
+ * Give mutations of genes and their sizes of those listed in a paper
74
+ * Give mutations of genes and their sizes of those listed in a design
75
+
76
+ The TCGA (maf) data was provided by Will's Ruby publisci RDF module. We can ask
77
+ patient related questions
78
+
79
+ * How many patients are in the TCGA database?
80
+ * How many patients are in the TCGA per tumor type?
81
+
82
+ And mutation related questions
83
+
84
+ * Rank patients on number of mutations
85
+ * How many genes show at least one mutation per patient
86
+ * What genes in what patients show more than X mutations (normalized for gene length)
87
+ * Rank genes on number of mutations (normalized for gene length)
88
+ * List mutated genes per patient
89
+ * List patient per mutated gene
90
+ * List all mutations that have exactly the same start position and matching variant type (SNP, INS, DEL)
91
+
92
+ These questions are answered through SPARQL queries below.
93
+
94
+ Note: this software is under active development!
95
+
96
+ ## Installation
97
+
98
+ ```sh
99
+ gem install bio-exominer
100
+ ```
101
+
102
+ ## Quick start
103
+
104
+ List all genes in a paper. Visit the paper with your browser and save
105
+ it as HTML or text to 'paper.txt'
106
+
107
+ ## Command line interface (CLI)
108
+
109
+ ### Adding NCBI symbols and aliases
110
+
111
+ NCBI provides a current list of all NCBI used symbols in one large file at
112
+
113
+ ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
114
+ gzip -d gene_info.gz
115
+
116
+ Fetch this file and unpack. Note: unpacked this is a 1.4Gb file; do not
117
+ check this file into a github repository! Create the symbol/alias list for
118
+ exominer with
119
+
120
+ ncbi_exominer_symbols gene_info > ncbi_symbols.tab
121
+
122
+ That makes for some 14 million symbols + aliases(!).
123
+
124
+ The ncbi_symbols.tab file contains entries, synonyms and descriptsions, such as
125
+
126
+ repA1 pLeuDn_01 putative replication-associated protein
127
+ repA2 pLeuDn_03 putative replication-associated protein
128
+ leuA pLeuDn_04 2-isopropylmalate synthase
129
+ leuB pLeuDn_05 3-isopropylmalate dehydrogenase
130
+
131
+ You can remove the original gene_info file again after generating the ncbi_symbols file.
132
+
133
+ Next to the ncbi_symbols.tab file a frequency file is generated named
134
+ ncbi_exominer_symbols.freq, which contains the frequency of every
135
+ character used in symbol names:
136
+
137
+ p: 1255137
138
+ L: 1907635
139
+ e: 1334974
140
+ u: 465711
141
+ D: 2110781
142
+ n: 533637
143
+ _: 11942258
144
+
145
+ and a list of all characters
146
+
147
+ "#%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{}
148
+
149
+ In this list some gene symbols and gene names include dashes and dots
150
+ and other characters. Some gene names even contain spaces - we skip
151
+ these for further processing.
152
+
153
+ Later, the millions of NCBI symbols and aliases do not all write to a
154
+ triple-store. Only those symbols get stored that are mined from the
155
+ documents.
156
+
157
+ ### Adding HUGO symbols and aliases
158
+
159
+ To make sure all recent HUGO symbols are added, download the HUGO symbols file
160
+ from EBI and parse that
161
+
162
+ ```sh
163
+ wget ftp://ftp.ebi.ac.uk/pub/databases/genenames/reference_genome_set.txt.gz
164
+ gzip -d reference_genome_set.txt.gz
165
+ hugo_exominer_symbols reference_genome_set.txt > hugo_symbols.tab
166
+ ```
167
+
168
+ The hugo_symbols.tab is included with the gem (in test/data/input/hugo_symbols) and will
169
+ always be loaded if you use the --hugo switch without specifying a symbol file. It contains
170
+ entries, synonyms and discriptions, such as
171
+
172
+ ERAP2 L-RAP|LRAP endoplasmic reticulum aminopeptidase 2
173
+ ERAS HRAS2|HRASP ES cell expressed Ras
174
+ ERBB2 NEU|HER-2|CD340|HER2|NGL v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2
175
+ ERBB2IP ERBIN|LAP2 erbb2 interacting protein
176
+
177
+ ### Making a text file of your document
178
+
179
+ Save HTML/Word/Excel/PDF files in a textual format. Command line
180
+ tools, such as lynx, antiword and pdftotext exist for this purpose. An
181
+ example of a textual version of an online Nature paper can be made with
182
+
183
+ lynx --dump http://www.nature.com/nature/journal/v490/n7418/full/nature11412.html >> tcga_bc.txt
184
+
185
+ Warning: do not check this file into any public repository! Nature publishing
186
+ group will not be amused.
187
+
188
+ ### Using Exominer to mine a text file for symbols
189
+
190
+ Pass the symbol file on the command line and pipe in the textual file, e.g.
191
+
192
+ exominer -s ncbi_symbols.tab --hugo hugo_symbols.tab < tcga_bc.txt
193
+
194
+ This results in a list of symbols and aliases found in the paper, with
195
+ their tally. For example
196
+
197
+ 35 FOXA1 forkhead box A1
198
+ 36 cas CRISPR associated Cas2 family protein
199
+ 36 AKT1 v-akt murine thymoma viral oncogene homolog 1
200
+ 37 BRCA2 hypothetical protein
201
+ 37 BRAF v-raf murine sarcoma viral oncogene homolog B1
202
+ 37 BRCA1 breast cancer 1, early onset
203
+ 38 A replication gene A protein
204
+ 38 AFF2 Ady2-Fun34 like Family, similar to S. cerevisiae FUN34 (YNR002C) and ADY2 (YCR010C); similar to Yarrowia glyoxalate pathway regulator, possible transmembrane acetate facilitator/sensor
205
+ 39 PDGFRA platelet-derived growth factor receptor, alpha polypeptide
206
+ 39 RAD51C Rad51 DNA recombinase 3
207
+ 39 MAP3K1 mitogen-activated protein kinase kinase kinase 1, E3 ubiquitin protein ligase
208
+ 41 AKT3 v-akt murine thymoma viral oncogene homolog 3 (protein kinase B, gamma)
209
+ 43 ATM hypothetical protein
210
+ 90 can carbonic anhydrase 2 Can
211
+
212
+ Out of a total of 12,774,630 symbols and 3,201,281 aliases scanned
213
+
214
+ This is not an authorative list but because it is such a comprehensive
215
+ list of symbols and aliases there should be few false negatives.
216
+ Obviously the last one is a false positive, but these should be easy
217
+ to spot and weed out. The idea is to end up with a list of candidate
218
+ exome targets. So the possible next step (when not using using a
219
+ triple-store) allows for subtracting symbols already in a design (not
220
+ yet implemented/NYI):
221
+
222
+ exominer -s ncbi_symbols.tab --ignore list.tab < tcga_bc.txt
223
+
224
+ where list.tab contains a list of symbols to ignore. These symbols
225
+ *with* their aliases are skipped in the text mining step.
226
+
227
+ This can be useful when mining a paper at a time. Mulitible papers is better,
228
+ because there will be more evidence on gene names and symbols. Exominer can
229
+ export results to RDF for powerful querying. More on that below.
230
+
231
+ Also when you have an existing exome design, is is possible to add
232
+ a prepared exome list and accompanying design to an
233
+ RDF triple store for further exploration.
234
+
235
+ ## Speeding up text search
236
+
237
+ To speed things up you can create a binary version of the symbols
238
+ table with
239
+
240
+ pack_exominer_symbols ncbi_symbols.tab
241
+
242
+ and rename that file to
243
+
244
+ mv symbols.bin ncbi_symbols.bin
245
+
246
+ Now use the bin file instead with exominer's -s switch.
247
+
248
+ ## Using exominer with a triple-store
249
+
250
+ exominer supports RDF! This means that you can use a triple-store as a
251
+ 'back-end' and add results of multiple runs incrementally. For every
252
+ symbol it is possible to track back the publication and even mine
253
+ extra information, such as publication date, journal type, and whether
254
+ a symbol exists in one or more stored designs. We can even link
255
+ aliases to Hugo symbols and link-out
256
+ and fetch gene information, such as the length of the nucleotide
257
+ sequence. Welcome to the world of the semantic web!
258
+
259
+ When parsing a publication or other resource we want to refer the
260
+ result set to that. Ideally a DOI is used which can be turned into a
261
+ URI through http://crossref.org/, e.g. doi:10.1038/171737a0 becomes
262
+ http://dx.doi.org/10.1038/171737a0 and can be queried, as explained
263
+ [here](http://inkdroid.org/journal/2011/04/25/dois-as-linked-data/).
264
+
265
+ If no URI exists, one can use a URL to a web publication, or even
266
+ simply the file name with the year and some tags for describing
267
+ the target of the publication, such as species or disease type.
268
+
269
+ The DOI describing the file:
270
+
271
+ exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
272
+ --doi doi:10.1038/nature11412 < tcga_bc.txt
273
+
274
+ allows for mining title and publication date for every
275
+ symbol found. To add some meta information you could add semi-colon
276
+ separated tags
277
+
278
+ exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
279
+ --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
280
+
281
+ which helps mining data later on. If no doi exists, you may just add
282
+ title and year:
283
+
284
+ exominer --rdf -s ncbi_symbols.tab --tag 'title=Comprehensive molecular portraits of human breast tumours' \
285
+ --tag 'year=2012;species=human;type=breast cancer' < tcga_bc.txt
286
+
287
+ multiple tags are also allowed.
288
+
289
+ exominer generates RDF which can be added to a triple-store. If you
290
+ want to add a design (old or new) treat it as a publication and use something like
291
+
292
+ exominer --rdf --hugo hugo_symbols.tab --tag 'design=Targeted exome;year=2013;' < design.txt
293
+
294
+ These commands create turtle RDF with the --rdf switch. Pipe
295
+ the output into the triple-store with
296
+
297
+ curl -T file.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/exominer.rdf
298
+
299
+ The URI can be a little more descriptive, e.g.:
300
+
301
+ curl -T design2012.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/design2012.rdf
302
+
303
+ Finally, to support multiple searches and make it easier to
304
+ dereference sources you can supply a unique name to each result set
305
+ with the --name switch. E.g.
306
+
307
+ exominer --rdf --name tcga_bc -s ncbi_symbols.tab --hugo hugo_symbols.tab --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
308
+
309
+ ## Context
310
+
311
+ When a gene name gets mined from a text, it is nice to see where it is
312
+ coming from. exominer provides context for this reason by including
313
+ the text around the gene name with every reference. This is also a
314
+ great way to weed out false positives! If the context for a gene named
315
+ SE says: 'Department of Oncology, Lund University, SE-221 85 Lund,
316
+ Sweden' - you may think twice about including it into your design.
317
+
318
+ Computers are not always good at automated text mining. The human eye
319
+ can pick these mistakes up quickly, exominer makes use of human
320
+ recognition. The RDF output contains this context by default. To switch
321
+ context off, simply you can either add a CLI switch, or pass in a tag
322
+ saying 'context=false'.
323
+
324
+ One extra (interesting) facility for context is the --context=line
325
+ command. This will set the context to the full line in a text file
326
+ (from LF to LF). This can be very useful when parsing tabular
327
+ data (Excel dumps, for example).
328
+
329
+ ## Vocabularies
330
+
331
+ In addition to the standard W3C vocabularies, exominer uses the
332
+ [journal archiving and interchange tag set
333
+ (JAT)](http://jats.nlm.nih.gov/archiving/) for describing
334
+ publications. Another is [Bibliontology](http://bibliontology.com/).
335
+ The British Library vocabulary may be
336
+ [useful](http://www.bl.uk/bibliographic/datasamples.html) too.
337
+
338
+ ## Using exominer with a triple-store
339
+
340
+ If you intend to use exominer with a triple-store you need to install
341
+ one. In principle you can use bio-rdf with any RDF triple store.
342
+ Instructions for installing [4store](http://4store.org/) can be found on
343
+ [bioruby-rdf](https://github.com/pjotrp/bioruby-rdf). You can add
344
+ a new triple-store with
345
+
346
+ ```sh
347
+ 4s-backend-setup exominer
348
+ 4s-backend exominer
349
+ 4s-httpd -p 8081 exominer
350
+ ```
351
+
352
+ and check the webserver is running on http://localhost:8081/status/.
353
+ Again, check bioruby-rdf for instructions on installing 4store and
354
+ sparql-query and examples.
355
+
356
+ ## Mining gene symbols with SPARQL
357
+
358
+ ### Looking for all database information in the triple-store
359
+
360
+ ```sparql
361
+ SELECT * WHERE { ?s ?p ?o }
362
+ ```
363
+
364
+ This can be run with the sparql-query tool
365
+
366
+ ```
367
+ sparql-query http://localhost:8081/sparql/ 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
368
+ ```
369
+
370
+
371
+
372
+ With a non-HUGO geneid information can be fetched with
373
+
374
+ ```sparql
375
+ SELECT ?type1, ?label1, count(*)
376
+ WHERE {
377
+ ?s1 ?p1 ?o1 .
378
+ ?o1 bif:contains "HK1" .
379
+ ?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type1 .
380
+ ?s1 <http://www.w3.org/2000/01/rdf-schema#label> ?label1 .
381
+ }
382
+ ORDER BY DESC (count(*))
383
+ ```
384
+
385
+ will render a list of gene id's. Follow up with, for example,
386
+ http://bio2rdf.org/geneid:100036759
387
+
388
+ ## Project home page
389
+
390
+ Information on the source tree, documentation, examples, issues and
391
+ how to contribute, see
392
+
393
+ http://github.com/pjotrp/bioruby-exominer
394
+
395
+ ## TODO
396
+
397
+ * Fix doi to make full URI
398
+
399
+ ## Cite
400
+
401
+ If you use this software, please cite one of
402
+
403
+ * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
404
+ * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
405
+
406
+ ## Biogems.info
407
+
408
+ This Biogem is published at (http://biogems.info/index.html#bio-exominer)
409
+
410
+ ## Copyright
411
+
412
+ Copyright (c) 2013,2014 Cuppen Group and Pjotr Prins. See LICENSE.txt for further details.
413
+