bio-exominer 0.0.3
Sign up to get free protection for your applications and to get access to all the features.
- data/.document +5 -0
- data/.rspec +1 -0
- data/.travis.yml +14 -0
- data/Gemfile +17 -0
- data/LICENSE.txt +20 -0
- data/README.md +413 -0
- data/Rakefile +58 -0
- data/VERSION +1 -0
- data/bin/exominer +250 -0
- data/bin/hugo_exominer_symbols +74 -0
- data/bin/ncbi_exominer_symbols +79 -0
- data/bin/pack_exominer_symbols +38 -0
- data/features/bio-exominer.feature +9 -0
- data/features/step_definitions/bio-exominer_steps.rb +0 -0
- data/features/support/env.rb +13 -0
- data/lib/bio-exominer.rb +14 -0
- data/lib/bio-exominer/exominer.rb +3 -0
- data/lib/bio-exominer/rdf.rb +38 -0
- data/lib/bio-exominer/symbols.rb +49 -0
- data/lib/bio-exominer/textparser.rb +124 -0
- data/scripts/4store.sh +30 -0
- data/scripts/example.sh +9 -0
- data/scripts/example_rdf.sh +7 -0
- data/scripts/load_rdf.sh +15 -0
- data/spec/bio-exominer_spec.rb +8 -0
- data/spec/rdf_spec.rb +28 -0
- data/spec/spec_helper.rb +19 -0
- data/spec/text_parser_spec.rb +59 -0
- data/test/data/input/hugo_symbols +38106 -0
- metadata +195 -0
data/.document
ADDED
data/.rspec
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
--color
|
data/.travis.yml
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
language: ruby
|
2
|
+
rvm:
|
3
|
+
- 1.9.3
|
4
|
+
- 2.1.0
|
5
|
+
- ruby-head
|
6
|
+
# - jruby-19mode # JRuby in 1.9 mode - no support for msgpack
|
7
|
+
|
8
|
+
# - rbx-19mode
|
9
|
+
# - 1.8.7
|
10
|
+
# - jruby-18mode # JRuby in 1.8 mode
|
11
|
+
# - rbx-18mode
|
12
|
+
|
13
|
+
# uncomment this line if your project needs to run something other than `rake`:
|
14
|
+
# script: bundle exec rspec spec
|
data/Gemfile
ADDED
@@ -0,0 +1,17 @@
|
|
1
|
+
source "http://rubygems.org"
|
2
|
+
# Add dependencies required to use your gem here.
|
3
|
+
# Example:
|
4
|
+
# gem "activesupport", ">= 2.3.5"
|
5
|
+
|
6
|
+
gem 'msgpack'
|
7
|
+
|
8
|
+
# Add dependencies to develop your gem here.
|
9
|
+
# Include everything needed to run rake, tests, features, etc.
|
10
|
+
group :development do
|
11
|
+
gem "minitest", "~> 5.0.7"
|
12
|
+
gem "rspec"
|
13
|
+
gem "cucumber"
|
14
|
+
gem "bundler"
|
15
|
+
gem "jeweler", "~> 2.0.0"
|
16
|
+
gem "rdoc"
|
17
|
+
end
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2013 Cuppen Group and Pjotr Prins
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,413 @@
|
|
1
|
+
# bio-exominer
|
2
|
+
|
3
|
+
[![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-exominer.png)](http://travis-ci.org/pjotrp/bioruby-exominer)
|
4
|
+
|
5
|
+
Exominer helps build a list of genes from publications.
|
6
|
+
|
7
|
+
Such a gene list may be used for identifying candidate genes connected to
|
8
|
+
a specific disease, but also may be used to compile a targeted
|
9
|
+
exome design for sequencing.
|
10
|
+
|
11
|
+
A quick example of a result for a search for pancreatic cancer genes
|
12
|
+
that were not listed in an exome design can be seen
|
13
|
+
[here](http://biobeat.org/examples/pancreatic_minus_new_design.html).
|
14
|
+
|
15
|
+
| gene | textmatch | description | context | resource | doi |
|
16
|
+
| ----- | --------- | ------------------------------------- | ------- | --- | --- |
|
17
|
+
| AKP8L | HAP95 | A kinase (PRKA) anchor protein 8-like | A cancer-associated RING finger protein, RNF43, is a ubiquitin ligase that interacts with a nuclear protein, HAP95 | Whole-exome sequencing of neoplastic cysts of the pancreas reveals recurrent mutations in components of ubiquitin-dependent pathways | doi:10.1073/pnas.1118046108 |
|
18
|
+
|
19
|
+
Here, the second column shows the fuzzy text match, the first column the
|
20
|
+
official HUGO name, the third column a description of the gene, the
|
21
|
+
fourth column the textual context in the publication, the fifth column
|
22
|
+
the title of the publication and the sixth column the DOI. The second
|
23
|
+
entry for AM is a false positive; quickly seen by checking the
|
24
|
+
context in the fourth column. This output is generated by a SPARQL
|
25
|
+
query and a lot of flexibility in combining resources and generating
|
26
|
+
output is possible. Note that this is just one example.
|
27
|
+
|
28
|
+
The inputs for Exominer consists of a list of Pubmed IDs with text files (PDF,
|
29
|
+
HTML, Word, Excel have to be exported to plain text first). Exominer
|
30
|
+
harvests gene names from these documents using a default symbol list
|
31
|
+
with aliases. Ideally, all texts would only contain HUGO symbols,
|
32
|
+
i.e. the over 30K standardized gene names by the HUGO Gene
|
33
|
+
Nomenclature Committee (HGNC). In reality, scientific authors take
|
34
|
+
liberties and the search for names is 'fuzzy'. Therefore the search
|
35
|
+
for Exominer also mines for the 12 odd million symbols and aliases
|
36
|
+
that are known through NCBI.
|
37
|
+
|
38
|
+
All matches are written with their sources, symbol frequencies,
|
39
|
+
publication year, and user provided keywords and impact scores and
|
40
|
+
written out.
|
41
|
+
|
42
|
+
Exominer also exports to RDF, so that the gene symbols can be stored
|
43
|
+
into a triple-store graph database and link out to Bio2rdf resources.
|
44
|
+
The latter allows, for example, harvesting of pathways.
|
45
|
+
|
46
|
+
Every RDF export contains full information on the origin of symbols.
|
47
|
+
Over time designs can be compared against each other and a historical
|
48
|
+
record is maintained. It is a good idea to store the textual versions
|
49
|
+
of the files too.
|
50
|
+
|
51
|
+
The initial symbol list with aliases can be fetched/generated from external
|
52
|
+
sources, such as NCBI, Biomart and/or Bio2rdf. Some examples are listed in this
|
53
|
+
README and related scripts are in ./scripts. For a more specific treatment of
|
54
|
+
design and input/output of exominer, see ./doc/design.md.
|
55
|
+
|
56
|
+
Questions to ask from the RDF
|
57
|
+
|
58
|
+
* What genes are mentioned in a paper?
|
59
|
+
* What papers refer to certain genes?
|
60
|
+
* What genes are mentioned most in papers?
|
61
|
+
* What genes are mentioned only in one paper?
|
62
|
+
* What genes are mentioned since 2011?
|
63
|
+
* What genes are linked to a certain disease subtype?
|
64
|
+
* What genes are linked to some author or lab?
|
65
|
+
* What genes exist in a design?
|
66
|
+
* What are the genes in a design that are non-HUGO named
|
67
|
+
* What are the genes in a paper that are non-HUGO named
|
68
|
+
* How do designs differ?
|
69
|
+
* What genes are not in a design mentioned since 2010?
|
70
|
+
|
71
|
+
When linking out to TCGA and bio2rdf we can get mutation information and gene sizes
|
72
|
+
|
73
|
+
* Give mutations of genes and their sizes of those listed in a paper
|
74
|
+
* Give mutations of genes and their sizes of those listed in a design
|
75
|
+
|
76
|
+
The TCGA (maf) data was provided by Will's Ruby publisci RDF module. We can ask
|
77
|
+
patient related questions
|
78
|
+
|
79
|
+
* How many patients are in the TCGA database?
|
80
|
+
* How many patients are in the TCGA per tumor type?
|
81
|
+
|
82
|
+
And mutation related questions
|
83
|
+
|
84
|
+
* Rank patients on number of mutations
|
85
|
+
* How many genes show at least one mutation per patient
|
86
|
+
* What genes in what patients show more than X mutations (normalized for gene length)
|
87
|
+
* Rank genes on number of mutations (normalized for gene length)
|
88
|
+
* List mutated genes per patient
|
89
|
+
* List patient per mutated gene
|
90
|
+
* List all mutations that have exactly the same start position and matching variant type (SNP, INS, DEL)
|
91
|
+
|
92
|
+
These questions are answered through SPARQL queries below.
|
93
|
+
|
94
|
+
Note: this software is under active development!
|
95
|
+
|
96
|
+
## Installation
|
97
|
+
|
98
|
+
```sh
|
99
|
+
gem install bio-exominer
|
100
|
+
```
|
101
|
+
|
102
|
+
## Quick start
|
103
|
+
|
104
|
+
List all genes in a paper. Visit the paper with your browser and save
|
105
|
+
it as HTML or text to 'paper.txt'
|
106
|
+
|
107
|
+
## Command line interface (CLI)
|
108
|
+
|
109
|
+
### Adding NCBI symbols and aliases
|
110
|
+
|
111
|
+
NCBI provides a current list of all NCBI used symbols in one large file at
|
112
|
+
|
113
|
+
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
|
114
|
+
gzip -d gene_info.gz
|
115
|
+
|
116
|
+
Fetch this file and unpack. Note: unpacked this is a 1.4Gb file; do not
|
117
|
+
check this file into a github repository! Create the symbol/alias list for
|
118
|
+
exominer with
|
119
|
+
|
120
|
+
ncbi_exominer_symbols gene_info > ncbi_symbols.tab
|
121
|
+
|
122
|
+
That makes for some 14 million symbols + aliases(!).
|
123
|
+
|
124
|
+
The ncbi_symbols.tab file contains entries, synonyms and descriptsions, such as
|
125
|
+
|
126
|
+
repA1 pLeuDn_01 putative replication-associated protein
|
127
|
+
repA2 pLeuDn_03 putative replication-associated protein
|
128
|
+
leuA pLeuDn_04 2-isopropylmalate synthase
|
129
|
+
leuB pLeuDn_05 3-isopropylmalate dehydrogenase
|
130
|
+
|
131
|
+
You can remove the original gene_info file again after generating the ncbi_symbols file.
|
132
|
+
|
133
|
+
Next to the ncbi_symbols.tab file a frequency file is generated named
|
134
|
+
ncbi_exominer_symbols.freq, which contains the frequency of every
|
135
|
+
character used in symbol names:
|
136
|
+
|
137
|
+
p: 1255137
|
138
|
+
L: 1907635
|
139
|
+
e: 1334974
|
140
|
+
u: 465711
|
141
|
+
D: 2110781
|
142
|
+
n: 533637
|
143
|
+
_: 11942258
|
144
|
+
|
145
|
+
and a list of all characters
|
146
|
+
|
147
|
+
"#%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{}
|
148
|
+
|
149
|
+
In this list some gene symbols and gene names include dashes and dots
|
150
|
+
and other characters. Some gene names even contain spaces - we skip
|
151
|
+
these for further processing.
|
152
|
+
|
153
|
+
Later, the millions of NCBI symbols and aliases do not all write to a
|
154
|
+
triple-store. Only those symbols get stored that are mined from the
|
155
|
+
documents.
|
156
|
+
|
157
|
+
### Adding HUGO symbols and aliases
|
158
|
+
|
159
|
+
To make sure all recent HUGO symbols are added, download the HUGO symbols file
|
160
|
+
from EBI and parse that
|
161
|
+
|
162
|
+
```sh
|
163
|
+
wget ftp://ftp.ebi.ac.uk/pub/databases/genenames/reference_genome_set.txt.gz
|
164
|
+
gzip -d reference_genome_set.txt.gz
|
165
|
+
hugo_exominer_symbols reference_genome_set.txt > hugo_symbols.tab
|
166
|
+
```
|
167
|
+
|
168
|
+
The hugo_symbols.tab is included with the gem (in test/data/input/hugo_symbols) and will
|
169
|
+
always be loaded if you use the --hugo switch without specifying a symbol file. It contains
|
170
|
+
entries, synonyms and discriptions, such as
|
171
|
+
|
172
|
+
ERAP2 L-RAP|LRAP endoplasmic reticulum aminopeptidase 2
|
173
|
+
ERAS HRAS2|HRASP ES cell expressed Ras
|
174
|
+
ERBB2 NEU|HER-2|CD340|HER2|NGL v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2
|
175
|
+
ERBB2IP ERBIN|LAP2 erbb2 interacting protein
|
176
|
+
|
177
|
+
### Making a text file of your document
|
178
|
+
|
179
|
+
Save HTML/Word/Excel/PDF files in a textual format. Command line
|
180
|
+
tools, such as lynx, antiword and pdftotext exist for this purpose. An
|
181
|
+
example of a textual version of an online Nature paper can be made with
|
182
|
+
|
183
|
+
lynx --dump http://www.nature.com/nature/journal/v490/n7418/full/nature11412.html >> tcga_bc.txt
|
184
|
+
|
185
|
+
Warning: do not check this file into any public repository! Nature publishing
|
186
|
+
group will not be amused.
|
187
|
+
|
188
|
+
### Using Exominer to mine a text file for symbols
|
189
|
+
|
190
|
+
Pass the symbol file on the command line and pipe in the textual file, e.g.
|
191
|
+
|
192
|
+
exominer -s ncbi_symbols.tab --hugo hugo_symbols.tab < tcga_bc.txt
|
193
|
+
|
194
|
+
This results in a list of symbols and aliases found in the paper, with
|
195
|
+
their tally. For example
|
196
|
+
|
197
|
+
35 FOXA1 forkhead box A1
|
198
|
+
36 cas CRISPR associated Cas2 family protein
|
199
|
+
36 AKT1 v-akt murine thymoma viral oncogene homolog 1
|
200
|
+
37 BRCA2 hypothetical protein
|
201
|
+
37 BRAF v-raf murine sarcoma viral oncogene homolog B1
|
202
|
+
37 BRCA1 breast cancer 1, early onset
|
203
|
+
38 A replication gene A protein
|
204
|
+
38 AFF2 Ady2-Fun34 like Family, similar to S. cerevisiae FUN34 (YNR002C) and ADY2 (YCR010C); similar to Yarrowia glyoxalate pathway regulator, possible transmembrane acetate facilitator/sensor
|
205
|
+
39 PDGFRA platelet-derived growth factor receptor, alpha polypeptide
|
206
|
+
39 RAD51C Rad51 DNA recombinase 3
|
207
|
+
39 MAP3K1 mitogen-activated protein kinase kinase kinase 1, E3 ubiquitin protein ligase
|
208
|
+
41 AKT3 v-akt murine thymoma viral oncogene homolog 3 (protein kinase B, gamma)
|
209
|
+
43 ATM hypothetical protein
|
210
|
+
90 can carbonic anhydrase 2 Can
|
211
|
+
|
212
|
+
Out of a total of 12,774,630 symbols and 3,201,281 aliases scanned
|
213
|
+
|
214
|
+
This is not an authorative list but because it is such a comprehensive
|
215
|
+
list of symbols and aliases there should be few false negatives.
|
216
|
+
Obviously the last one is a false positive, but these should be easy
|
217
|
+
to spot and weed out. The idea is to end up with a list of candidate
|
218
|
+
exome targets. So the possible next step (when not using using a
|
219
|
+
triple-store) allows for subtracting symbols already in a design (not
|
220
|
+
yet implemented/NYI):
|
221
|
+
|
222
|
+
exominer -s ncbi_symbols.tab --ignore list.tab < tcga_bc.txt
|
223
|
+
|
224
|
+
where list.tab contains a list of symbols to ignore. These symbols
|
225
|
+
*with* their aliases are skipped in the text mining step.
|
226
|
+
|
227
|
+
This can be useful when mining a paper at a time. Mulitible papers is better,
|
228
|
+
because there will be more evidence on gene names and symbols. Exominer can
|
229
|
+
export results to RDF for powerful querying. More on that below.
|
230
|
+
|
231
|
+
Also when you have an existing exome design, is is possible to add
|
232
|
+
a prepared exome list and accompanying design to an
|
233
|
+
RDF triple store for further exploration.
|
234
|
+
|
235
|
+
## Speeding up text search
|
236
|
+
|
237
|
+
To speed things up you can create a binary version of the symbols
|
238
|
+
table with
|
239
|
+
|
240
|
+
pack_exominer_symbols ncbi_symbols.tab
|
241
|
+
|
242
|
+
and rename that file to
|
243
|
+
|
244
|
+
mv symbols.bin ncbi_symbols.bin
|
245
|
+
|
246
|
+
Now use the bin file instead with exominer's -s switch.
|
247
|
+
|
248
|
+
## Using exominer with a triple-store
|
249
|
+
|
250
|
+
exominer supports RDF! This means that you can use a triple-store as a
|
251
|
+
'back-end' and add results of multiple runs incrementally. For every
|
252
|
+
symbol it is possible to track back the publication and even mine
|
253
|
+
extra information, such as publication date, journal type, and whether
|
254
|
+
a symbol exists in one or more stored designs. We can even link
|
255
|
+
aliases to Hugo symbols and link-out
|
256
|
+
and fetch gene information, such as the length of the nucleotide
|
257
|
+
sequence. Welcome to the world of the semantic web!
|
258
|
+
|
259
|
+
When parsing a publication or other resource we want to refer the
|
260
|
+
result set to that. Ideally a DOI is used which can be turned into a
|
261
|
+
URI through http://crossref.org/, e.g. doi:10.1038/171737a0 becomes
|
262
|
+
http://dx.doi.org/10.1038/171737a0 and can be queried, as explained
|
263
|
+
[here](http://inkdroid.org/journal/2011/04/25/dois-as-linked-data/).
|
264
|
+
|
265
|
+
If no URI exists, one can use a URL to a web publication, or even
|
266
|
+
simply the file name with the year and some tags for describing
|
267
|
+
the target of the publication, such as species or disease type.
|
268
|
+
|
269
|
+
The DOI describing the file:
|
270
|
+
|
271
|
+
exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
|
272
|
+
--doi doi:10.1038/nature11412 < tcga_bc.txt
|
273
|
+
|
274
|
+
allows for mining title and publication date for every
|
275
|
+
symbol found. To add some meta information you could add semi-colon
|
276
|
+
separated tags
|
277
|
+
|
278
|
+
exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
|
279
|
+
--doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
|
280
|
+
|
281
|
+
which helps mining data later on. If no doi exists, you may just add
|
282
|
+
title and year:
|
283
|
+
|
284
|
+
exominer --rdf -s ncbi_symbols.tab --tag 'title=Comprehensive molecular portraits of human breast tumours' \
|
285
|
+
--tag 'year=2012;species=human;type=breast cancer' < tcga_bc.txt
|
286
|
+
|
287
|
+
multiple tags are also allowed.
|
288
|
+
|
289
|
+
exominer generates RDF which can be added to a triple-store. If you
|
290
|
+
want to add a design (old or new) treat it as a publication and use something like
|
291
|
+
|
292
|
+
exominer --rdf --hugo hugo_symbols.tab --tag 'design=Targeted exome;year=2013;' < design.txt
|
293
|
+
|
294
|
+
These commands create turtle RDF with the --rdf switch. Pipe
|
295
|
+
the output into the triple-store with
|
296
|
+
|
297
|
+
curl -T file.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/exominer.rdf
|
298
|
+
|
299
|
+
The URI can be a little more descriptive, e.g.:
|
300
|
+
|
301
|
+
curl -T design2012.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/design2012.rdf
|
302
|
+
|
303
|
+
Finally, to support multiple searches and make it easier to
|
304
|
+
dereference sources you can supply a unique name to each result set
|
305
|
+
with the --name switch. E.g.
|
306
|
+
|
307
|
+
exominer --rdf --name tcga_bc -s ncbi_symbols.tab --hugo hugo_symbols.tab --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
|
308
|
+
|
309
|
+
## Context
|
310
|
+
|
311
|
+
When a gene name gets mined from a text, it is nice to see where it is
|
312
|
+
coming from. exominer provides context for this reason by including
|
313
|
+
the text around the gene name with every reference. This is also a
|
314
|
+
great way to weed out false positives! If the context for a gene named
|
315
|
+
SE says: 'Department of Oncology, Lund University, SE-221 85 Lund,
|
316
|
+
Sweden' - you may think twice about including it into your design.
|
317
|
+
|
318
|
+
Computers are not always good at automated text mining. The human eye
|
319
|
+
can pick these mistakes up quickly, exominer makes use of human
|
320
|
+
recognition. The RDF output contains this context by default. To switch
|
321
|
+
context off, simply you can either add a CLI switch, or pass in a tag
|
322
|
+
saying 'context=false'.
|
323
|
+
|
324
|
+
One extra (interesting) facility for context is the --context=line
|
325
|
+
command. This will set the context to the full line in a text file
|
326
|
+
(from LF to LF). This can be very useful when parsing tabular
|
327
|
+
data (Excel dumps, for example).
|
328
|
+
|
329
|
+
## Vocabularies
|
330
|
+
|
331
|
+
In addition to the standard W3C vocabularies, exominer uses the
|
332
|
+
[journal archiving and interchange tag set
|
333
|
+
(JAT)](http://jats.nlm.nih.gov/archiving/) for describing
|
334
|
+
publications. Another is [Bibliontology](http://bibliontology.com/).
|
335
|
+
The British Library vocabulary may be
|
336
|
+
[useful](http://www.bl.uk/bibliographic/datasamples.html) too.
|
337
|
+
|
338
|
+
## Using exominer with a triple-store
|
339
|
+
|
340
|
+
If you intend to use exominer with a triple-store you need to install
|
341
|
+
one. In principle you can use bio-rdf with any RDF triple store.
|
342
|
+
Instructions for installing [4store](http://4store.org/) can be found on
|
343
|
+
[bioruby-rdf](https://github.com/pjotrp/bioruby-rdf). You can add
|
344
|
+
a new triple-store with
|
345
|
+
|
346
|
+
```sh
|
347
|
+
4s-backend-setup exominer
|
348
|
+
4s-backend exominer
|
349
|
+
4s-httpd -p 8081 exominer
|
350
|
+
```
|
351
|
+
|
352
|
+
and check the webserver is running on http://localhost:8081/status/.
|
353
|
+
Again, check bioruby-rdf for instructions on installing 4store and
|
354
|
+
sparql-query and examples.
|
355
|
+
|
356
|
+
## Mining gene symbols with SPARQL
|
357
|
+
|
358
|
+
### Looking for all database information in the triple-store
|
359
|
+
|
360
|
+
```sparql
|
361
|
+
SELECT * WHERE { ?s ?p ?o }
|
362
|
+
```
|
363
|
+
|
364
|
+
This can be run with the sparql-query tool
|
365
|
+
|
366
|
+
```
|
367
|
+
sparql-query http://localhost:8081/sparql/ 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
|
368
|
+
```
|
369
|
+
|
370
|
+
|
371
|
+
|
372
|
+
With a non-HUGO geneid information can be fetched with
|
373
|
+
|
374
|
+
```sparql
|
375
|
+
SELECT ?type1, ?label1, count(*)
|
376
|
+
WHERE {
|
377
|
+
?s1 ?p1 ?o1 .
|
378
|
+
?o1 bif:contains "HK1" .
|
379
|
+
?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type1 .
|
380
|
+
?s1 <http://www.w3.org/2000/01/rdf-schema#label> ?label1 .
|
381
|
+
}
|
382
|
+
ORDER BY DESC (count(*))
|
383
|
+
```
|
384
|
+
|
385
|
+
will render a list of gene id's. Follow up with, for example,
|
386
|
+
http://bio2rdf.org/geneid:100036759
|
387
|
+
|
388
|
+
## Project home page
|
389
|
+
|
390
|
+
Information on the source tree, documentation, examples, issues and
|
391
|
+
how to contribute, see
|
392
|
+
|
393
|
+
http://github.com/pjotrp/bioruby-exominer
|
394
|
+
|
395
|
+
## TODO
|
396
|
+
|
397
|
+
* Fix doi to make full URI
|
398
|
+
|
399
|
+
## Cite
|
400
|
+
|
401
|
+
If you use this software, please cite one of
|
402
|
+
|
403
|
+
* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
|
404
|
+
* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
|
405
|
+
|
406
|
+
## Biogems.info
|
407
|
+
|
408
|
+
This Biogem is published at (http://biogems.info/index.html#bio-exominer)
|
409
|
+
|
410
|
+
## Copyright
|
411
|
+
|
412
|
+
Copyright (c) 2013,2014 Cuppen Group and Pjotr Prins. See LICENSE.txt for further details.
|
413
|
+
|