RubyGems - bio-exominer - Versions diffs - 0.0.3 - Mend

bio-exominer 0.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

data/.document +5 -0
data/.rspec +1 -0
data/.travis.yml +14 -0
data/Gemfile +17 -0
data/LICENSE.txt +20 -0
data/README.md +413 -0
data/Rakefile +58 -0
data/VERSION +1 -0
data/bin/exominer +250 -0
data/bin/hugo_exominer_symbols +74 -0
data/bin/ncbi_exominer_symbols +79 -0
data/bin/pack_exominer_symbols +38 -0
data/features/bio-exominer.feature +9 -0
data/features/step_definitions/bio-exominer_steps.rb +0 -0
data/features/support/env.rb +13 -0
data/lib/bio-exominer.rb +14 -0
data/lib/bio-exominer/exominer.rb +3 -0
data/lib/bio-exominer/rdf.rb +38 -0
data/lib/bio-exominer/symbols.rb +49 -0
data/lib/bio-exominer/textparser.rb +124 -0
data/scripts/4store.sh +30 -0
data/scripts/example.sh +9 -0
data/scripts/example_rdf.sh +7 -0
data/scripts/load_rdf.sh +15 -0
data/spec/bio-exominer_spec.rb +8 -0
data/spec/rdf_spec.rb +28 -0
data/spec/spec_helper.rb +19 -0
data/spec/text_parser_spec.rb +59 -0
data/test/data/input/hugo_symbols +38106 -0
metadata +195 -0

data/.document ADDED

@@ -0,0 +1,5 @@
+lib/**/*.rb
+bin/*
+-
+features/**/*.feature
+LICENSE.txt

data/.rspec ADDED

	@@ -0,0 +1 @@
1	+ --color

data/.travis.yml ADDED

@@ -0,0 +1,14 @@
+language: ruby
+rvm:
+  - 1.9.3
+  - 2.1.0
+  - ruby-head
+#  - jruby-19mode # JRuby in 1.9 mode - no support for msgpack
+# - rbx-19mode
+#  - 1.8.7
+#  - jruby-18mode # JRuby in 1.8 mode
+#  - rbx-18mode
+# uncomment this line if your project needs to run something other than `rake`:
+# script: bundle exec rspec spec

data/Gemfile ADDED

@@ -0,0 +1,17 @@
+source "http://rubygems.org"
+# Add dependencies required to use your gem here.
+# Example:
+#   gem "activesupport", ">= 2.3.5"
+gem 'msgpack'
+# Add dependencies to develop your gem here.
+# Include everything needed to run rake, tests, features, etc.
+group :development do
+  gem "minitest", "~> 5.0.7"
+  gem "rspec"
+  gem "cucumber"
+  gem "bundler"
+  gem "jeweler", "~> 2.0.0"
+  gem "rdoc"
+end

data/LICENSE.txt ADDED

@@ -0,0 +1,20 @@
+Copyright (c) 2013 Cuppen Group and Pjotr Prins
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,413 @@
+# bio-exominer
+[![Build Status](https://secure.travis-ci.org/pjotrp/bioruby-exominer.png)](http://travis-ci.org/pjotrp/bioruby-exominer)
+Exominer helps build a list of genes from publications.
+Such a gene list may be used for identifying candidate genes connected to
+a specific disease, but also may be used to compile a targeted
+exome design for sequencing.
+A quick example of a result for a search for pancreatic cancer genes
+that were not listed in an exome design can be seen
+[here](http://biobeat.org/examples/pancreatic_minus_new_design.html).
+| gene  | textmatch | description                           | context | resource | doi |
+| ----- | --------- | ------------------------------------- | ------- | --- | --- |
+| AKP8L | HAP95     | A kinase (PRKA) anchor protein 8-like | A cancer-associated RING finger protein, RNF43, is a ubiquitin ligase that interacts with a nuclear protein, HAP95 | Whole-exome sequencing of neoplastic cysts of the pancreas reveals recurrent mutations in components of ubiquitin-dependent pathways | doi:10.1073/pnas.1118046108 |
+Here, the second column shows the fuzzy text match, the first column the
+official HUGO name, the third column a description of the gene, the
+fourth column the textual context in the publication, the fifth column
+the title of the publication and the sixth column the DOI. The second
+entry for AM is a false positive; quickly seen by checking the
+context in the fourth column. This output is generated by a SPARQL
+query and a lot of flexibility in combining resources and generating
+output is possible.  Note that this is just one example.
+The inputs for Exominer consists of a list of Pubmed IDs with text files (PDF,
+HTML, Word, Excel have to be exported to plain text first).  Exominer
+harvests gene names from these documents using a default symbol list
+with aliases.  Ideally, all texts would only contain HUGO symbols,
+i.e. the over 30K standardized gene names by the HUGO Gene
+Nomenclature Committee (HGNC). In reality, scientific authors take
+liberties and the search for names is 'fuzzy'. Therefore the search
+for Exominer also mines for the 12 odd million symbols and aliases
+that are known through NCBI.
+All matches are written with their sources, symbol frequencies,
+publication year, and user provided keywords and impact scores and
+written out.
+Exominer also exports to RDF, so that the gene symbols can be stored
+into a triple-store graph database and link out to Bio2rdf resources.
+The latter allows, for example, harvesting of pathways.
+Every RDF export contains full information on the origin of symbols.
+Over time designs can be compared against each other and a historical
+record is maintained. It is a good idea to store the textual versions
+of the files too.
+The initial symbol list with aliases can be fetched/generated from external
+sources, such as NCBI, Biomart and/or Bio2rdf. Some examples are listed in this
+README and related scripts are in ./scripts. For a more specific treatment of
+design and input/output of exominer, see ./doc/design.md.
+Questions to ask from the RDF
+* What genes are mentioned in a paper?
+* What papers refer to certain genes?
+* What genes are mentioned most in papers?
+* What genes are mentioned only in one paper?
+* What genes are mentioned since 2011?
+* What genes are linked to a certain disease subtype?
+* What genes are linked to some author or lab?
+* What genes exist in a design?
+* What are the genes in a design that are non-HUGO named
+* What are the genes in a paper that are non-HUGO named
+* How do designs differ?
+* What genes are not in a design mentioned since 2010?
+When linking out to TCGA and bio2rdf we can get mutation information and gene sizes
+* Give mutations of genes and their sizes of those listed in a paper
+* Give mutations of genes and their sizes of those listed in a design
+The TCGA (maf) data was provided by Will's Ruby publisci RDF module. We can ask
+patient related questions
+* How many patients are in the TCGA database?
+* How many patients are in the TCGA per tumor type?
+And mutation related questions
+* Rank patients on number of mutations
+* How many genes show at least one mutation per patient
+* What genes in what patients show more than X mutations (normalized for gene length)
+* Rank genes on number of mutations (normalized for gene length)
+* List mutated genes per patient
+* List patient per mutated gene
+* List all mutations that have exactly the same start position and matching variant type (SNP, INS, DEL)
+These questions are answered through SPARQL queries below.
+Note: this software is under active development!
+## Installation
+```sh
+gem install bio-exominer
+```
+## Quick start
+List all genes in a paper. Visit the paper with your browser and save
+it as HTML or text to 'paper.txt'
+## Command line interface (CLI)
+### Adding NCBI symbols and aliases
+NCBI provides a current list of all NCBI used symbols in one large file at
+  ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
+  gzip -d gene_info.gz
+Fetch this file and unpack. Note: unpacked this is a 1.4Gb file; do not
+check this file into a github repository! Create the symbol/alias list for
+exominer with
+  ncbi_exominer_symbols gene_info > ncbi_symbols.tab
+That makes for some 14 million symbols + aliases(!).
+The ncbi_symbols.tab file contains entries, synonyms and descriptsions, such as
+  repA1   pLeuDn_01       putative replication-associated protein
+  repA2   pLeuDn_03       putative replication-associated protein
+  leuA    pLeuDn_04       2-isopropylmalate synthase
+  leuB    pLeuDn_05       3-isopropylmalate dehydrogenase
+You can remove the original gene_info file again after generating the ncbi_symbols file.
+Next to the ncbi_symbols.tab file a frequency file is generated named
+ncbi_exominer_symbols.freq, which contains the frequency of every
+character used in symbol names:
+  p: 1255137
+  L: 1907635
+  e: 1334974
+  u: 465711
+  D: 2110781
+  n: 533637
+  _: 11942258
+and a list of all characters
+   "#%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz{}
+In this list some gene symbols and gene names include dashes and dots
+and other characters. Some gene names even contain spaces - we skip
+these for further processing.
+Later, the millions of NCBI symbols and aliases do not all write to a
+triple-store. Only those symbols get stored that are mined from the
+documents.
+### Adding HUGO symbols and aliases
+To make sure all recent HUGO symbols are added, download the HUGO symbols file
+from EBI and parse that
+```sh
+  wget ftp://ftp.ebi.ac.uk/pub/databases/genenames/reference_genome_set.txt.gz
+  gzip -d reference_genome_set.txt.gz
+  hugo_exominer_symbols reference_genome_set.txt > hugo_symbols.tab
+```
+The hugo_symbols.tab is included with the gem (in test/data/input/hugo_symbols) and will
+always be loaded if you use the --hugo switch without specifying a symbol file. It contains
+entries, synonyms and discriptions, such as
+  ERAP2 L-RAP|LRAP  endoplasmic reticulum aminopeptidase 2
+  ERAS  HRAS2|HRASP ES cell expressed Ras
+  ERBB2 NEU|HER-2|CD340|HER2|NGL  v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2
+  ERBB2IP ERBIN|LAP2  erbb2 interacting protein
+### Making a text file of your document
+Save HTML/Word/Excel/PDF files in a textual format. Command line
+tools, such as lynx, antiword and pdftotext exist for this purpose. An
+example of a textual version of an online Nature paper can be made with
+  lynx --dump http://www.nature.com/nature/journal/v490/n7418/full/nature11412.html >> tcga_bc.txt
+Warning: do not check this file into any public repository! Nature publishing
+group will not be amused.
+### Using Exominer to mine a text file for symbols
+Pass the symbol file on the command line and pipe in the textual file, e.g.
+  exominer -s ncbi_symbols.tab --hugo hugo_symbols.tab < tcga_bc.txt
+This results in a list of symbols and aliases found in the paper, with
+their tally. For example
+    35      FOXA1   forkhead box A1
+    36      cas     CRISPR associated Cas2 family protein
+    36      AKT1    v-akt murine thymoma viral oncogene homolog 1
+    37      BRCA2   hypothetical protein
+    37      BRAF    v-raf murine sarcoma viral oncogene homolog B1
+    37      BRCA1   breast cancer 1, early onset
+    38      A       replication gene A protein
+    38      AFF2    Ady2-Fun34 like Family, similar to S. cerevisiae FUN34 (YNR002C) and ADY2 (YCR010C); similar to Yarrowia glyoxalate pathway regulator, possible transmembrane acetate facilitator/sensor
+    39      PDGFRA  platelet-derived growth factor receptor, alpha polypeptide
+    39      RAD51C  Rad51 DNA recombinase 3
+    39      MAP3K1  mitogen-activated protein kinase kinase kinase 1, E3 ubiquitin protein ligase
+    41      AKT3    v-akt murine thymoma viral oncogene homolog 3 (protein kinase B, gamma)
+    43      ATM     hypothetical protein
+    90      can     carbonic anhydrase 2 Can
+  Out of a total of 12,774,630 symbols and 3,201,281 aliases scanned
+This is not an authorative list but because it is such a comprehensive
+list of symbols and aliases there should be few false negatives.
+Obviously the last one is a false positive, but these should be easy
+to spot and weed out. The idea is to end up with a list of candidate
+exome targets. So the possible next step (when not using using a
+triple-store) allows for subtracting symbols already in a design (not
+yet implemented/NYI):
+  exominer -s ncbi_symbols.tab --ignore list.tab < tcga_bc.txt
+where list.tab contains a list of symbols to ignore. These symbols
+*with* their aliases are skipped in the text mining step.
+This can be useful when mining a paper at a time. Mulitible papers is better,
+because there will be more evidence on gene names and symbols. Exominer can
+export results to RDF for powerful querying. More on that below.
+Also when you have an existing exome design, is is possible to add
+a prepared exome list and accompanying design to an
+RDF triple store for further exploration.
+## Speeding up text search
+To speed things up you can create a binary version of the symbols
+table with
+  pack_exominer_symbols ncbi_symbols.tab
+and rename that file to
+  mv symbols.bin ncbi_symbols.bin
+Now use the bin file instead with exominer's -s switch.
+## Using exominer with a triple-store
+exominer supports RDF! This means that you can use a triple-store as a
+'back-end' and add results of multiple runs incrementally. For every
+symbol it is possible to track back the publication and even mine
+extra information, such as publication date, journal type, and whether
+a symbol exists in one or more stored designs. We can even link
+aliases to Hugo symbols and link-out
+and fetch gene information, such as the length of the nucleotide
+sequence. Welcome to the world of the semantic web!
+When parsing a publication or other resource we want to refer the
+result set to that. Ideally a DOI is used which can be turned into a
+URI through http://crossref.org/, e.g. doi:10.1038/171737a0 becomes
+http://dx.doi.org/10.1038/171737a0 and can be queried, as explained
+[here](http://inkdroid.org/journal/2011/04/25/dois-as-linked-data/).
+If no URI exists, one can use a URL to a web publication, or even
+simply the file name with the year and some tags for describing
+the target of the publication, such as species or disease type.
+The DOI describing the file:
+  exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
+    --doi doi:10.1038/nature11412 < tcga_bc.txt
+allows for mining title and publication date for every
+symbol found. To add some meta information you could add semi-colon
+separated tags
+  exominer --rdf -s ncbi_symbols.tab --hugo hugo_symbols.tab \
+    --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
+which helps mining data later on. If no doi exists, you may just add
+title and year:
+  exominer --rdf -s ncbi_symbols.tab --tag 'title=Comprehensive molecular portraits of human breast tumours' \
+    --tag 'year=2012;species=human;type=breast cancer' < tcga_bc.txt
+multiple tags are also allowed.
+exominer generates RDF which can be added to a triple-store. If you
+want to add a design (old or new) treat it as a publication and use something like
+  exominer --rdf --hugo hugo_symbols.tab --tag 'design=Targeted exome;year=2013;' < design.txt
+These commands create turtle RDF with the --rdf switch. Pipe
+the output into the triple-store with
+  curl -T file.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/exominer.rdf
+The URI can be a little more descriptive, e.g.:
+  curl -T design2012.rdf -H 'Content-Type: application/x-turtle' http://localhost:8081/data/design2012.rdf
+Finally, to support multiple searches and make it easier to
+dereference sources you can supply a unique name to each result set
+with the --name switch. E.g.
+  exominer --rdf --name tcga_bc -s ncbi_symbols.tab --hugo hugo_symbols.tab --doi doi:10.1038/nature11412 --tag 'species=human;type=breast cancer' < tcga_bc.txt
+## Context
+When a gene name gets mined from a text, it is nice to see where it is
+coming from. exominer provides context for this reason by including
+the text around the gene name with every reference. This is also a
+great way to weed out false positives! If the context for a gene named
+SE says: 'Department of Oncology, Lund University, SE-221 85 Lund,
+Sweden' - you may think twice about including it into your design.
+Computers are not always good at automated text mining. The human eye
+can pick these mistakes up quickly, exominer makes use of human
+recognition. The RDF output contains this context by default. To switch
+context off, simply you can either add a CLI switch, or pass in a tag
+saying 'context=false'.
+One extra (interesting) facility for context is the --context=line
+command. This will set the context to the full line in a text file
+(from LF to LF). This can be very useful when parsing tabular
+data (Excel dumps, for example).
+## Vocabularies
+In addition to the standard W3C vocabularies, exominer uses the
+[journal archiving and interchange tag set
+(JAT)](http://jats.nlm.nih.gov/archiving/) for describing
+publications. Another is [Bibliontology](http://bibliontology.com/).
+The British Library vocabulary may be
+[useful](http://www.bl.uk/bibliographic/datasamples.html) too.
+## Using exominer with a triple-store
+If you intend to use exominer with a triple-store you need to install
+one. In principle you can use bio-rdf with any RDF triple store.
+Instructions for installing [4store](http://4store.org/) can be found on
+[bioruby-rdf](https://github.com/pjotrp/bioruby-rdf). You can add
+a new triple-store with
+```sh
+4s-backend-setup exominer
+4s-backend exominer
+4s-httpd -p 8081 exominer
+```
+and check the webserver is running on http://localhost:8081/status/.
+Again, check bioruby-rdf for instructions on installing 4store and
+sparql-query and examples.
+## Mining gene symbols with SPARQL
+### Looking for all database information in the triple-store
+```sparql
+SELECT * WHERE { ?s ?p ?o }
+```
+This can be run with the sparql-query tool
+```
+sparql-query http://localhost:8081/sparql/ 'SELECT * WHERE { ?s ?p ?o } LIMIT 10'
+```
+With a non-HUGO geneid information can be fetched with
+```sparql
+SELECT ?type1, ?label1, count(*)
+WHERE {
+?s1 ?p1 ?o1 .
+?o1 bif:contains "HK1" .
+?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type1 .
+?s1 <http://www.w3.org/2000/01/rdf-schema#label> ?label1 .
+}
+ORDER BY DESC (count(*))
+```
+will render a list of gene id's. Follow up with, for example,
+http://bio2rdf.org/geneid:100036759
+## Project home page
+Information on the source tree, documentation, examples, issues and
+how to contribute, see
+  http://github.com/pjotrp/bioruby-exominer
+## TODO
+* Fix doi to make full URI
+## Cite
+If you use this software, please cite one of
+* [BioRuby: bioinformatics software for the Ruby programming language](http://dx.doi.org/10.1093/bioinformatics/btq475)
+* [Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics](http://dx.doi.org/10.1093/bioinformatics/bts080)
+## Biogems.info
+This Biogem is published at (http://biogems.info/index.html#bio-exominer)
+## Copyright
+Copyright (c) 2013,2014 Cuppen Group and Pjotr Prins. See LICENSE.txt for further details.