bio-exominer 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ bin/*
3
+ -
4
+ features/**/*.feature
5
+ LICENSE.txt
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
@@ -0,0 +1,14 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - 2.1.0
5
+ - ruby-head
6
+ # - jruby-19mode # JRuby in 1.9 mode - no support for msgpack
7
+
8
+ # - rbx-19mode
9
+ # - 1.8.7
10
+ # - jruby-18mode # JRuby in 1.8 mode
11
+ # - rbx-18mode
12
+
13
+ # uncomment this line if your project needs to run something other than `rake`:
14
+ # script: bundle exec rspec spec
data/Gemfile ADDED
@@ -0,0 +1,17 @@
1
+ source "http://rubygems.org"
2
+ # Add dependencies required to use your gem here.
3
+ # Example:
4
+ # gem "activesupport", ">= 2.3.5"
5
+
6
+ gem 'msgpack'
7
+
8
+ # Add dependencies to develop your gem here.
9
+ # Include everything needed to run rake, tests, features, etc.
10
+ group :development do
11
+ gem "minitest", "~> 5.0.7"
12
+ gem "rspec"
13
+ gem "cucumber"
14
+ gem "bundler"
15
+ gem "jeweler", "~> 2.0.0"
16
+ gem "rdoc"
17
+ end
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2013 Cuppen Group and Pjotr Prins
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,413 @@
1
+ # bio-exominer
2
+
3
+ [![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-exominer.png)](http://travis-ci.org/pjotrp/bioruby-exominer)
4
+
5
+ Exominer helps build a list of genes from publications.
6
+
7
+ Such a gene list may be used for identifying candidate genes connected to
8
+ a specific disease, but also may be used to compile a targeted
9
+ exome design for sequencing.
10
+
11
+ A quick example of a result for a search for pancreatic cancer genes
12
+ that were not listed in an exome design can be seen
13
+ [here](http://biobeat.org/examples/pancreatic_minus_new_design.html).
14
+
15
+ | gene | textmatch | description | context | resource | doi |
16
+ | ----- | --------- | ------------------------------------- | ------- | --- | --- |
17
+ | AKP8L | HAP95 | A kinase (PRKA) anchor protein 8-like | A cancer-associated RING finger protein, RNF43, is a ubiquitin ligase that interacts with a nuclear protein, HAP95 | Whole-exome sequencing of neoplastic cysts of the pancreas reveals recurrent mutations in components of ubiquitin-dependent pathways | doi:10.1073/pnas.1118046108 |
18
+
19
+ Here, the second column shows the fuzzy text match, the first column the
20
+ official HUGO name, the third column a description of the gene, the
21
+ fourth column the textual context in the publication, the fifth column
22
+ the title of the publication and the sixth column the DOI. The second
23
+ entry for AM is a false positive; quickly seen by checking the
24
+ context in the fourth column. This output is generated by a SPARQL
25
+ query and a lot of flexibility in combining resources and generating
26
+ output is possible. Note that this is just one example.
27
+
28
+ The inputs for Exominer consists of a list of Pubmed IDs with text files (PDF,
29
+ HTML, Word, Excel have to be exported to plain text first). Exominer
30
+ harvests gene names from these documents using a default symbol list
31
+ with aliases. Ideally, all texts would only contain HUGO symbols,
32
+ i.e. the over 30K standardized gene names by the HUGO Gene
33
+ Nomenclature Committee (HGNC). In reality, scientific authors take
34
+ liberties and the search for names is 'fuzzy'. Therefore the search
35
+ for Exominer also mines for the 12 odd million symbols and aliases
36
+ that are known through NCBI.
37
+
38
+ All matches are written with their sources, symbol frequencies,
39
+ publication year, and user provided keywords and impact scores and
40
+ written out.
41
+
42
+ Exominer also exports to RDF, so that the gene symbols can be stored
43
+ into a triple-store graph database and link out to Bio2rdf resources.
44
+ The latter allows, for example, harvesting of pathways.
45
+
46
+ Every RDF export contains full information on the origin of symbols.
47
+ Over time designs can be compared against each other and a historical
48
+ record is maintained. It is a good idea to store the textual versions
49
+ of the files too.
50
+
51
+ The initial symbol list with aliases can be fetched/generated from external
52
+ sources, such as NCBI, Biomart and/or Bio2rdf. Some examples are listed in this
53
+ README and related scripts are in ./scripts. For a more specific treatment of
54
+ design and input/output of exominer, see ./doc/design.md.
55
+
56
+ Questions to ask from the RDF
57
+
58
+ * What genes are mentioned in a paper?
59
+ * What papers refer to certain genes?
60
+ * What genes are mentioned most in papers?
61
+ * What genes are mentioned only in one paper?
62
+ * What genes are mentioned since 2011?
63
+ * What genes are linked to a certain disease subtype?
64
+ * What genes are linked to some author or lab?
65
+ * What genes exist in a design?
66
+ * What are the genes in a design that are non-HUGO named
67
+ * What are the genes in a paper that are non-HUGO named
68
+ * How do designs differ?
69
+ * What genes are not in a design mentioned since 2010?
70
+
71
+ When linking out to TCGA and bio2rdf we can get mutation information and gene sizes
72
+
73
+ * Give mutations of genes and their sizes of those listed in a paper
74
+ * Give mutations of genes and their sizes of those listed in a design
75
+
76
+ The TCGA (maf) data was provided by Will's Ruby publisci RDF module. We can ask
77
+ patient related questions
78
+
79
+ * How many patients are in the TCGA database?
80
+ * How many patients are in the TCGA per tumor type?
81
+
82
+ And mutation related questions
83
+
84
+ * Rank patients on number of mutations
85
+ * How many genes show at least one mutation per patient
86
+ * What genes in what patients show more than X mutations (normalized for gene length)
87
+ * Rank genes on number of mutations (normalized for gene length)
88
+ * List mutated genes per patient
89
+ * List patient per mutated gene
90
+ * List all mutations that have exactly the same start position and matching variant type (SNP, INS, DEL)
91
+
92
+ These questions are answered through SPARQL queries below.
93
+
94
+ Note: this software is under active development!
95
+
96
+ ## Installation
97
+
98
+ ```sh
99
+ gem install bio-exominer
100
+ ```
101
+
102
+ ## Quick start
103
+
104
+ List all genes in a paper. Visit the paper with your browser and save
105
+ it as HTML or text to 'paper.txt'
106
+
107
+ ## Command line interface (CLI)
108
+
109
+ ### Adding NCBI symbols and aliases
110
+
111
+ NCBI provides a current list of all NCBI used symbols in one large file at
112
+
113
+ ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
114
+ gzip -d gene_info.gz
115
+
116
+ Fetch this file and unpack. Note: unpacked this is a 1.4Gb file; do not
117
+ check this file into a github repository! Create the symbol/alias list for
118
+ exominer with
119
+
120
+ ncbi_exominer_symbols gene_info > ncbi_symbols.tab
121
+
122
+ That makes for some 14 million symbols + aliases(!).
123
+
124
+ The ncbi_symbols.tab file contains entries, synonyms and descriptsions, such as
125
+
126
+ repA1 pLeuDn_01 putative replication-associated protein
127
+ repA2 pLeuDn_03 putative replication-associated protein
128
+ leuA pLeuDn_04 2-isopropylmalate synthase
129
+ leuB pLeuDn_05 3-isopropylmalate dehydrogenase
130
+
131
+ You can remove the original gene_info file again after generating the ncbi_symbols file.
132
+
133
+ Next to the ncbi_symbols.tab file a frequency file is generated named
134
+ ncbi_exominer_symbols.freq, which contains the frequency of every
135
+ character used in symbol names:
136
+
137
+ p: 1255137
138
+ L: 1907635
139
+ e: 1334974
140
+ u: 465711
141
+ D: 2110781
142
+ n: 533637
143
+ _: 11942258
144
+
145
+ and a list of all characters
146
+
147
+ "#%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{}
148
+
149
+ In this list some gene symbols and gene names include dashes and dots
150
+ and other characters. Some gene names even contain spaces - we skip
151
+ these for further processing.
152
+
153
+ Later, the millions of NCBI symbols and aliases do not all write to a
154
+ triple-store. Only those symbols get stored that are mined from the
155
+ documents.
156
+
157
+ ### Adding HUGO symbols and aliases
158
+
159
+ To make sure all recent HUGO symbols are added, download the HUGO symbols file
160
+ from EBI and parse that
161
+
162
+ ```sh
163
+ wget ftp://ftp.ebi.ac.uk/pub/databases/genenames/reference_genome_set.txt.gz
164
+ gzip -d reference_genome_set.txt.gz
165
+ hugo_exominer_symbols reference_genome_set.txt > hugo_symbols.tab
166
+ ```
167
+
168
+ The hugo_symbols.tab is included with the gem (in test/data/input/hugo_symbols) and will
169
+ always be loaded if you use the --hugo switch without specifying a symbol file. It contains
170
+ entries, synonyms and discriptions, such as
171
+
172
+ ERAP2 L-RAP|LRAP endoplasmic reticulum aminopeptidase 2
173
+ ERAS HRAS2|HRASP ES cell expressed Ras
174
+ ERBB2 NEU|HER-2|CD340|HER2|NGL v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2
175
+ ERBB2IP ERBIN|LAP2 erbb2 interacting protein
176
+
177
+ ### Making a text file of your document
178
+
179
+ Save HTML/Word/Excel/PDF files in a textual format. Command line
180
+ tools, such as lynx, antiword and pdftotext exist for this purpose. An
181
+ example of a textual version of an online Nature paper can be made with
182
+
183
+ lynx --dump http://www.nature.com/nature/journal/v490/n7418/full/nature11412.html >> tcga_bc.txt
184
+
185
+ Warning: do not check this file into any public repository! Nature publishing
186
+ group will not be amused.
187
+
188
+ ### Using Exominer to mine a text file for symbols
189
+
190
+ Pass the symbol file on the command line and pipe in the textual file, e.g.
191
+
192
+ exominer -s ncbi_symbols.tab --hugo hugo_symbols.tab < tcga_bc.txt
193
+
194
+ This results in a list of symbols and aliases found in the paper, with
195
+ their tally. For example
196
+
197
+ 35 FOXA1 forkhead box A1
198
+ 36 cas CRISPR associated Cas2 family protein
199
+ 36 AKT1 v-akt murine thymoma viral oncogene homolog 1
200
+ 37 BRCA2 hypothetical protein
201
+ 37 BRAF v-raf murine sarcoma viral oncogene homolog B1
202
+ 37 BRCA1 breast cancer 1, early onset
203
+ 38 A replication gene A protein
204
+ 38 AFF2 Ady2-Fun34 like Family, similar to S. cerevisiae FUN34 (YNR002C) and ADY2 (YCR010C); similar to Yarrowia glyoxalate pathway regulator, possible transmembrane acetate facilitator/sensor
205
+ 39 PDGFRA platelet-derived growth factor receptor, alpha polypeptide
206
+ 39 RAD51C Rad51 DNA recombinase 3
207
+ 39 MAP3K1 mitogen-activated protein kinase kinase kinase 1, E3 ubiquitin protein ligase
208
+ 41 AKT3 v-akt murine thymoma viral oncogene homolog 3 (protein kinase B, gamma)
209
+ 43 ATM hypothetical protein
210
+ 90 can carbonic anhydrase 2 Can
211
+
212
+ Out of a total of 12,774,630 symbols and 3,201,281 aliases scanned
213
+
214
+ This is not an authorative list but because it is such a comprehensive
215
+ list of symbols and aliases there should be few false negatives.
216
+ Obviously the last one is a false positive, but these should be easy
217
+ to spot and weed out. The idea is to end up with a list of candidate
218
+ exome targets. So the possible next step (when not using using a
219
+ triple-store) allows for subtracting symbols already in a design (not
220
+ yet implemented/NYI):
221
+
222
+ exominer -s ncbi_symbols.tab --ignore list.tab < tcga_bc.txt
223
+
224
+ where list.tab contains a list of symbols to ignore. These symbols
225
+ *with* their aliases are skipped in the text mining step.
226
+
227
+ This can be useful when mining a paper at a time. Mulitible papers is better,
228
+ because there will be more evidence on gene names and symbols. Exominer can
229
+ export results to RDF for powerful querying. More on that below.
230
+
231
+ Also when you have an existing exome design, is is possible to add
232
+ a prepared exome list and accompanying design to an
233
+ RDF triple store for further exploration.
234
+
235
+ ## Speeding up text search
236
+
237
+ To speed things up you can create a binary version of the symbols
238
+ table with
239
+
240
+ pack_exominer_symbols ncbi_symbols.tab
241
+
242
+ and rename that file to
243
+
244
+ mv symbols.bin ncbi_symbols.bin
245
+
246
+ Now use the bin file instead with exominer's -s switch.
247
+
248
+ ## Using exominer with a triple-store
249
+
250
+ exominer supports RDF! This means that you can use a triple-store as a
251
+ 'back-end' and add results of multiple runs incrementally. For every
252
+ symbol it is possible to track back the publication and even mine
253
+ extra information, such as publication date, journal type, and whether
254
+ a symbol exists in one or more stored designs. We can even link
255
+ aliases to Hugo symbols and link-out
256
+ and fetch gene information, such as the length of the nucleotide
257
+ sequence. Welcome to the world of the semantic web!
258
+
259
+ When parsing a publication or other resource we want to refer the
260
+ result set to that. Ideally a DOI is used which can be turned into a
261
+ URI through http://crossref.org/, e.g. doi:10.1038/171737a0 becomes
262
+ http://dx.doi.org/10.1038/171737a0 and can be queried, as explained
263
+ [here](http://inkdroid.org/journal/2011/04/25/dois-as-linked-data/).
264
+
265
+ If no URI exists, one can use a URL to a web publication, or even
266
+ simply the file name with the year and some tags for describing
267
+ the target of the publication, such as species or disease type.
268
+
269
+ The DOI describing the file:
270
+
271
+ exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
272
+ --doi doi:10.1038/nature11412 < tcga_bc.txt
273
+
274
+ allows for mining title and publication date for every
275
+ symbol found. To add some meta information you could add semi-colon
276
+ separated tags
277
+
278
+ exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
279
+ --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
280
+
281
+ which helps mining data later on. If no doi exists, you may just add
282
+ title and year:
283
+
284
+ exominer --rdf -s ncbi_symbols.tab --tag 'title=Comprehensive molecular portraits of human breast tumours' \
285
+ --tag 'year=2012;species=human;type=breast cancer' < tcga_bc.txt
286
+
287
+ multiple tags are also allowed.
288
+
289
+ exominer generates RDF which can be added to a triple-store. If you
290
+ want to add a design (old or new) treat it as a publication and use something like
291
+
292
+ exominer --rdf --hugo hugo_symbols.tab --tag 'design=Targeted exome;year=2013;' < design.txt
293
+
294
+ These commands create turtle RDF with the --rdf switch. Pipe
295
+ the output into the triple-store with
296
+
297
+ curl -T file.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/exominer.rdf
298
+
299
+ The URI can be a little more descriptive, e.g.:
300
+
301
+ curl -T design2012.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/design2012.rdf
302
+
303
+ Finally, to support multiple searches and make it easier to
304
+ dereference sources you can supply a unique name to each result set
305
+ with the --name switch. E.g.
306
+
307
+ exominer --rdf --name tcga_bc -s ncbi_symbols.tab --hugo hugo_symbols.tab --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
308
+
309
+ ## Context
310
+
311
+ When a gene name gets mined from a text, it is nice to see where it is
312
+ coming from. exominer provides context for this reason by including
313
+ the text around the gene name with every reference. This is also a
314
+ great way to weed out false positives! If the context for a gene named
315
+ SE says: 'Department of Oncology, Lund University, SE-221 85 Lund,
316
+ Sweden' - you may think twice about including it into your design.
317
+
318
+ Computers are not always good at automated text mining. The human eye
319
+ can pick these mistakes up quickly, exominer makes use of human
320
+ recognition. The RDF output contains this context by default. To switch
321
+ context off, simply you can either add a CLI switch, or pass in a tag
322
+ saying 'context=false'.
323
+
324
+ One extra (interesting) facility for context is the --context=line
325
+ command. This will set the context to the full line in a text file
326
+ (from LF to LF). This can be very useful when parsing tabular
327
+ data (Excel dumps, for example).
328
+
329
+ ## Vocabularies
330
+
331
+ In addition to the standard W3C vocabularies, exominer uses the
332
+ [journal archiving and interchange tag set
333
+ (JAT)](http://jats.nlm.nih.gov/archiving/) for describing
334
+ publications. Another is [Bibliontology](http://bibliontology.com/).
335
+ The British Library vocabulary may be
336
+ [useful](http://www.bl.uk/bibliographic/datasamples.html) too.
337
+
338
+ ## Using exominer with a triple-store
339
+
340
+ If you intend to use exominer with a triple-store you need to install
341
+ one. In principle you can use bio-rdf with any RDF triple store.
342
+ Instructions for installing [4store](http://4store.org/) can be found on
343
+ [bioruby-rdf](https://github.com/pjotrp/bioruby-rdf). You can add
344
+ a new triple-store with
345
+
346
+ ```sh
347
+ 4s-backend-setup exominer
348
+ 4s-backend exominer
349
+ 4s-httpd -p 8081 exominer
350
+ ```
351
+
352
+ and check the webserver is running on http://localhost:8081/status/.
353
+ Again, check bioruby-rdf for instructions on installing 4store and
354
+ sparql-query and examples.
355
+
356
+ ## Mining gene symbols with SPARQL
357
+
358
+ ### Looking for all database information in the triple-store
359
+
360
+ ```sparql
361
+ SELECT * WHERE { ?s ?p ?o }
362
+ ```
363
+
364
+ This can be run with the sparql-query tool
365
+
366
+ ```
367
+ sparql-query http://localhost:8081/sparql/ 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
368
+ ```
369
+
370
+
371
+
372
+ With a non-HUGO geneid information can be fetched with
373
+
374
+ ```sparql
375
+ SELECT ?type1, ?label1, count(*)
376
+ WHERE {
377
+ ?s1 ?p1 ?o1 .
378
+ ?o1 bif:contains "HK1" .
379
+ ?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type1 .
380
+ ?s1 <http://www.w3.org/2000/01/rdf-schema#label> ?label1 .
381
+ }
382
+ ORDER BY DESC (count(*))
383
+ ```
384
+
385
+ will render a list of gene id's. Follow up with, for example,
386
+ http://bio2rdf.org/geneid:100036759
387
+
388
+ ## Project home page
389
+
390
+ Information on the source tree, documentation, examples, issues and
391
+ how to contribute, see
392
+
393
+ http://github.com/pjotrp/bioruby-exominer
394
+
395
+ ## TODO
396
+
397
+ * Fix doi to make full URI
398
+
399
+ ## Cite
400
+
401
+ If you use this software, please cite one of
402
+
403
+ * [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
404
+ * [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
405
+
406
+ ## Biogems.info
407
+
408
+ This Biogem is published at (http://biogems.info/index.html#bio-exominer)
409
+
410
+ ## Copyright
411
+
412
+ Copyright (c) 2013,2014 Cuppen Group and Pjotr Prins. See LICENSE.txt for further details.
413
+