RubyGems - bio-exominer - Versions diffs - 0.0.3 - Mend

bio-exominer 0.0.3

Files changed (30) hide show

data/.document +5 -0
data/.rspec +1 -0
data/.travis.yml +14 -0
data/Gemfile +17 -0
data/LICENSE.txt +20 -0
data/README.md +413 -0
data/Rakefile +58 -0
data/VERSION +1 -0
data/bin/exominer +250 -0
data/bin/hugo_exominer_symbols +74 -0
data/bin/ncbi_exominer_symbols +79 -0
data/bin/pack_exominer_symbols +38 -0
data/features/bio-exominer.feature +9 -0
data/features/step_definitions/bio-exominer_steps.rb +0 -0
data/features/support/env.rb +13 -0
data/lib/bio-exominer.rb +14 -0
data/lib/bio-exominer/exominer.rb +3 -0
data/lib/bio-exominer/rdf.rb +38 -0
data/lib/bio-exominer/symbols.rb +49 -0
data/lib/bio-exominer/textparser.rb +124 -0
data/scripts/4store.sh +30 -0
data/scripts/example.sh +9 -0
data/scripts/example_rdf.sh +7 -0
data/scripts/load_rdf.sh +15 -0
data/spec/bio-exominer_spec.rb +8 -0
data/spec/rdf_spec.rb +28 -0
data/spec/spec_helper.rb +19 -0
data/spec/text_parser_spec.rb +59 -0
data/test/data/input/hugo_symbols +38106 -0
metadata +195 -0

data/.document ADDED

@@ -0,0 +1,5 @@
+lib/**/*.rb
+bin/*
+-
+features/**/*.feature
+LICENSE.txt

data/.rspec ADDED

	@@ -0,0 +1 @@
1	+ --color

data/.travis.yml ADDED

@@ -0,0 +1,14 @@
+language: ruby
+rvm:
+  - 1.9.3
+  - 2.1.0
+  - ruby-head
+#  - jruby-19mode # JRuby in 1.9 mode - no support for msgpack
+# - rbx-19mode
+#  - 1.8.7
+#  - jruby-18mode # JRuby in 1.8 mode
+#  - rbx-18mode
+# uncomment this line if your project needs to run something other than `rake`:
+# script: bundle exec rspec spec

data/Gemfile ADDED

@@ -0,0 +1,17 @@
+source "http://rubygems.org"
+# Add dependencies required to use your gem here.
+# Example:
+#   gem "activesupport", ">= 2.3.5"
+gem 'msgpack'
+# Add dependencies to develop your gem here.
+# Include everything needed to run rake, tests, features, etc.
+group :development do
+  gem "minitest", "~> 5.0.7"
+  gem "rspec"
+  gem "cucumber"
+  gem "bundler"
+  gem "jeweler", "~> 2.0.0"
+  gem "rdoc"
+end

data/LICENSE.txt ADDED

@@ -0,0 +1,20 @@
+Copyright (c) 2013 Cuppen Group and Pjotr Prins
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,413 @@
+# bio-exominer
+[![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-exominer.png)](http://travis-ci.org/pjotrp/bioruby-exominer)
+Exominer helps build a list of genes from publications.
+Such a gene list may be used for identifying candidate genes connected to
+a specific disease, but also may be used to compile a targeted
+exome design for sequencing.
+A quick example of a result for a search for pancreatic cancer genes
+that were not listed in an exome design can be seen
+[here](http://biobeat.org/examples/pancreatic_minus_new_design.html).
+| gene  | textmatch | description                           | context | resource | doi |
+| ----- | --------- | ------------------------------------- | ------- | --- | --- |
+| AKP8L | HAP95     | A kinase (PRKA) anchor protein 8-like | A cancer-associated RING finger protein, RNF43, is a ubiquitin ligase that interacts with a nuclear protein, HAP95 | Whole-exome sequencing of neoplastic cysts of the pancreas reveals recurrent mutations in components of ubiquitin-dependent pathways | doi:10.1073/pnas.1118046108 |
+Here, the second column shows the fuzzy text match, the first column the
+official HUGO name, the third column a description of the gene, the
+fourth column the textual context in the publication, the fifth column
+the title of the publication and the sixth column the DOI. The second
+entry for AM is a false positive; quickly seen by checking the
+context in the fourth column. This output is generated by a SPARQL
+query and a lot of flexibility in combining resources and generating
+output is possible.  Note that this is just one example.
+The inputs for Exominer consists of a list of Pubmed IDs with text files (PDF,
+HTML, Word, Excel have to be exported to plain text first).  Exominer
+harvests gene names from these documents using a default symbol list
+with aliases.  Ideally, all texts would only contain HUGO symbols,
+i.e. the over 30K standardized gene names by the HUGO Gene
+Nomenclature Committee (HGNC). In reality, scientific authors take
+liberties and the search for names is 'fuzzy'. Therefore the search
+for Exominer also mines for the 12 odd million symbols and aliases
+that are known through NCBI.
+All matches are written with their sources, symbol frequencies,
+publication year, and user provided keywords and impact scores and
+written out.
+Exominer also exports to RDF, so that the gene symbols can be stored
+into a triple-store graph database and link out to Bio2rdf resources.
+The latter allows, for example, harvesting of pathways.
+Every RDF export contains full information on the origin of symbols.
+Over time designs can be compared against each other and a historical
+record is maintained. It is a good idea to store the textual versions
+of the files too.
+The initial symbol list with aliases can be fetched/generated from external
+sources, such as NCBI, Biomart and/or Bio2rdf. Some examples are listed in this
+README and related scripts are in ./scripts. For a more specific treatment of
+design and input/output of exominer, see ./doc/design.md.
+Questions to ask from the RDF
+* What genes are mentioned in a paper?
+* What papers refer to certain genes?
+* What genes are mentioned most in papers?
+* What genes are mentioned only in one paper?
+* What genes are mentioned since 2011?
+* What genes are linked to a certain disease subtype?
+* What genes are linked to some author or lab?
+* What genes exist in a design?
+* What are the genes in a design that are non-HUGO named
+* What are the genes in a paper that are non-HUGO named
+* How do designs differ?
+* What genes are not in a design mentioned since 2010?
+When linking out to TCGA and bio2rdf we can get mutation information and gene sizes
+* Give mutations of genes and their sizes of those listed in a paper
+* Give mutations of genes and their sizes of those listed in a design
+The TCGA (maf) data was provided by Will's Ruby publisci RDF module. We can ask
+patient related questions
+* How many patients are in the TCGA database?
+* How many patients are in the TCGA per tumor type?
+And mutation related questions
+* Rank patients on number of mutations
+* How many genes show at least one mutation per patient
+* What genes in what patients show more than X mutations (normalized for gene length)
+* Rank genes on number of mutations (normalized for gene length)
+* List mutated genes per patient
+* List patient per mutated gene
+* List all mutations that have exactly the same start position and matching variant type (SNP, INS, DEL)
+These questions are answered through SPARQL queries below.
+Note: this software is under active development!
+## Installation
+```sh
+gem install bio-exominer
+```
+## Quick start
+List all genes in a paper. Visit the paper with your browser and save
+it as HTML or text to 'paper.txt'
+## Command line interface (CLI)
+### Adding NCBI symbols and aliases
+NCBI provides a current list of all NCBI used symbols in one large file at
+  ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
+  gzip -d gene_info.gz
+Fetch this file and unpack. Note: unpacked this is a 1.4Gb file; do not
+check this file into a github repository! Create the symbol/alias list for
+exominer with
+  ncbi_exominer_symbols gene_info > ncbi_symbols.tab
+That makes for some 14 million symbols + aliases(!).
+The ncbi_symbols.tab file contains entries, synonyms and descriptsions, such as
+  repA1   pLeuDn_01       putative replication-associated protein
+  repA2   pLeuDn_03       putative replication-associated protein
+  leuA    pLeuDn_04       2-isopropylmalate synthase
+  leuB    pLeuDn_05       3-isopropylmalate dehydrogenase
+You can remove the original gene_info file again after generating the ncbi_symbols file.
+Next to the ncbi_symbols.tab file a frequency file is generated named
+ncbi_exominer_symbols.freq, which contains the frequency of every
+character used in symbol names:
+  p: 1255137
+  L: 1907635
+  e: 1334974
+  u: 465711
+  D: 2110781
+  n: 533637
+  _: 11942258
+and a list of all characters
+   "#%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{}
+In this list some gene symbols and gene names include dashes and dots
+and other characters. Some gene names even contain spaces - we skip
+these for further processing.
+Later, the millions of NCBI symbols and aliases do not all write to a
+triple-store. Only those symbols get stored that are mined from the
+documents.
+### Adding HUGO symbols and aliases
+To make sure all recent HUGO symbols are added, download the HUGO symbols file
+from EBI and parse that
+```sh
+  wget ftp://ftp.ebi.ac.uk/pub/databases/genenames/reference_genome_set.txt.gz
+  gzip -d reference_genome_set.txt.gz
+  hugo_exominer_symbols reference_genome_set.txt > hugo_symbols.tab
+```
+The hugo_symbols.tab is included with the gem (in test/data/input/hugo_symbols) and will
+always be loaded if you use the --hugo switch without specifying a symbol file. It contains
+entries, synonyms and discriptions, such as
+  ERAP2 L-RAP|LRAP  endoplasmic reticulum aminopeptidase 2
+  ERAS  HRAS2|HRASP ES cell expressed Ras
+  ERBB2 NEU|HER-2|CD340|HER2|NGL  v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2
+  ERBB2IP ERBIN|LAP2  erbb2 interacting protein
+### Making a text file of your document
+Save HTML/Word/Excel/PDF files in a textual format. Command line
+tools, such as lynx, antiword and pdftotext exist for this purpose. An
+example of a textual version of an online Nature paper can be made with
+  lynx --dump http://www.nature.com/nature/journal/v490/n7418/full/nature11412.html >> tcga_bc.txt
+Warning: do not check this file into any public repository! Nature publishing
+group will not be amused.
+### Using Exominer to mine a text file for symbols
+Pass the symbol file on the command line and pipe in the textual file, e.g.
+  exominer -s ncbi_symbols.tab --hugo hugo_symbols.tab < tcga_bc.txt
+This results in a list of symbols and aliases found in the paper, with
+their tally. For example
+    35      FOXA1   forkhead box A1
+    36      cas     CRISPR associated Cas2 family protein
+    36      AKT1    v-akt murine thymoma viral oncogene homolog 1
+    37      BRCA2   hypothetical protein
+    37      BRAF    v-raf murine sarcoma viral oncogene homolog B1
+    37      BRCA1   breast cancer 1, early onset
+    38      A       replication gene A protein
+    38      AFF2    Ady2-Fun34 like Family, similar to S. cerevisiae FUN34 (YNR002C) and ADY2 (YCR010C); similar to Yarrowia glyoxalate pathway regulator, possible transmembrane acetate facilitator/sensor
+    39      PDGFRA  platelet-derived growth factor receptor, alpha polypeptide
+    39      RAD51C  Rad51 DNA recombinase 3
+    39      MAP3K1  mitogen-activated protein kinase kinase kinase 1, E3 ubiquitin protein ligase
+    41      AKT3    v-akt murine thymoma viral oncogene homolog 3 (protein kinase B, gamma)
+    43      ATM     hypothetical protein
+    90      can     carbonic anhydrase 2 Can
+  Out of a total of 12,774,630 symbols and 3,201,281 aliases scanned
+This is not an authorative list but because it is such a comprehensive
+list of symbols and aliases there should be few false negatives.
+Obviously the last one is a false positive, but these should be easy
+to spot and weed out. The idea is to end up with a list of candidate
+exome targets. So the possible next step (when not using using a
+triple-store) allows for subtracting symbols already in a design (not
+yet implemented/NYI):
+  exominer -s ncbi_symbols.tab --ignore list.tab < tcga_bc.txt
+where list.tab contains a list of symbols to ignore. These symbols
+*with* their aliases are skipped in the text mining step.
+This can be useful when mining a paper at a time. Mulitible papers is better,
+because there will be more evidence on gene names and symbols. Exominer can
+export results to RDF for powerful querying. More on that below.
+Also when you have an existing exome design, is is possible to add
+a prepared exome list and accompanying design to an
+RDF triple store for further exploration.
+## Speeding up text search
+To speed things up you can create a binary version of the symbols
+table with
+  pack_exominer_symbols ncbi_symbols.tab
+and rename that file to
+  mv symbols.bin ncbi_symbols.bin
+Now use the bin file instead with exominer's -s switch.
+## Using exominer with a triple-store
+exominer supports RDF! This means that you can use a triple-store as a
+'back-end' and add results of multiple runs incrementally. For every
+symbol it is possible to track back the publication and even mine
+extra information, such as publication date, journal type, and whether
+a symbol exists in one or more stored designs. We can even link
+aliases to Hugo symbols and link-out
+and fetch gene information, such as the length of the nucleotide
+sequence. Welcome to the world of the semantic web!
+When parsing a publication or other resource we want to refer the
+result set to that. Ideally a DOI is used which can be turned into a
+URI through http://crossref.org/, e.g. doi:10.1038/171737a0 becomes
+http://dx.doi.org/10.1038/171737a0 and can be queried, as explained
+[here](http://inkdroid.org/journal/2011/04/25/dois-as-linked-data/).
+If no URI exists, one can use a URL to a web publication, or even
+simply the file name with the year and some tags for describing
+the target of the publication, such as species or disease type.
+The DOI describing the file:
+  exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
+    --doi doi:10.1038/nature11412 < tcga_bc.txt
+allows for mining title and publication date for every
+symbol found. To add some meta information you could add semi-colon
+separated tags
+  exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
+    --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
+which helps mining data later on. If no doi exists, you may just add
+title and year:
+  exominer --rdf -s ncbi_symbols.tab --tag 'title=Comprehensive molecular portraits of human breast tumours' \
+    --tag 'year=2012;species=human;type=breast cancer' < tcga_bc.txt
+multiple tags are also allowed.
+exominer generates RDF which can be added to a triple-store. If you
+want to add a design (old or new) treat it as a publication and use something like
+  exominer --rdf --hugo hugo_symbols.tab --tag 'design=Targeted exome;year=2013;' < design.txt
+These commands create turtle RDF with the --rdf switch. Pipe
+the output into the triple-store with
+  curl -T file.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/exominer.rdf
+The URI can be a little more descriptive, e.g.:
+  curl -T design2012.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/design2012.rdf
+Finally, to support multiple searches and make it easier to
+dereference sources you can supply a unique name to each result set
+with the --name switch. E.g.
+  exominer --rdf --name tcga_bc -s ncbi_symbols.tab --hugo hugo_symbols.tab --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
+## Context
+When a gene name gets mined from a text, it is nice to see where it is
+coming from. exominer provides context for this reason by including
+the text around the gene name with every reference. This is also a
+great way to weed out false positives! If the context for a gene named
+SE says: 'Department of Oncology, Lund University, SE-221 85 Lund,
+Sweden' - you may think twice about including it into your design.
+Computers are not always good at automated text mining. The human eye
+can pick these mistakes up quickly, exominer makes use of human
+recognition. The RDF output contains this context by default. To switch
+context off, simply you can either add a CLI switch, or pass in a tag
+saying 'context=false'.
+One extra (interesting) facility for context is the --context=line
+command. This will set the context to the full line in a text file
+(from LF to LF). This can be very useful when parsing tabular
+data (Excel dumps, for example).
+## Vocabularies
+In addition to the standard W3C vocabularies, exominer uses the
+[journal archiving and interchange tag set
+(JAT)](http://jats.nlm.nih.gov/archiving/) for describing
+publications. Another is [Bibliontology](http://bibliontology.com/).
+The British Library vocabulary may be
+[useful](http://www.bl.uk/bibliographic/datasamples.html) too.
+## Using exominer with a triple-store
+If you intend to use exominer with a triple-store you need to install
+one. In principle you can use bio-rdf with any RDF triple store.
+Instructions for installing [4store](http://4store.org/) can be found on
+[bioruby-rdf](https://github.com/pjotrp/bioruby-rdf). You can add
+a new triple-store with
+```sh
+4s-backend-setup exominer
+4s-backend exominer
+4s-httpd -p 8081 exominer
+```
+and check the webserver is running on http://localhost:8081/status/.
+Again, check bioruby-rdf for instructions on installing 4store and
+sparql-query and examples.
+## Mining gene symbols with SPARQL
+### Looking for all database information in the triple-store
+```sparql
+SELECT * WHERE { ?s ?p ?o }
+```
+This can be run with the sparql-query tool
+```
+sparql-query http://localhost:8081/sparql/ 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
+```
+With a non-HUGO geneid information can be fetched with
+```sparql
+SELECT ?type1, ?label1, count(*)
+WHERE {
+?s1 ?p1 ?o1 .
+?o1 bif:contains "HK1" .
+?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type1 .
+?s1 <http://www.w3.org/2000/01/rdf-schema#label> ?label1 .
+}
+ORDER BY DESC (count(*))
+```
+will render a list of gene id's. Follow up with, for example,
+http://bio2rdf.org/geneid:100036759
+## Project home page
+Information on the source tree, documentation, examples, issues and
+how to contribute, see
+  http://github.com/pjotrp/bioruby-exominer
+## TODO
+* Fix doi to make full URI
+## Cite
+If you use this software, please cite one of
+* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
+* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
+## Biogems.info
+This Biogem is published at (http://biogems.info/index.html#bio-exominer)
+## Copyright
+Copyright (c) 2013,2014 Cuppen Group and Pjotr Prins. See LICENSE.txt for further details.