ruby-ensembl-api 0.9.6
Sign up to get free protection for your applications and to get access to all the features.
- data/TUTORIAL.rdoc +623 -0
- data/bin/ensembl +40 -0
- data/lib/ensembl.rb +64 -0
- data/lib/ensembl/core/activerecord.rb +1914 -0
- data/lib/ensembl/core/collection.rb +60 -0
- data/lib/ensembl/core/project.rb +264 -0
- data/lib/ensembl/core/slice.rb +693 -0
- data/lib/ensembl/core/transcript.rb +425 -0
- data/lib/ensembl/core/transform.rb +97 -0
- data/lib/ensembl/db_connection.rb +216 -0
- data/lib/ensembl/variation/activerecord.rb +253 -0
- data/lib/ensembl/variation/variation.rb +163 -0
- data/test/unit/data/seq_c6qbl.fa +10 -0
- data/test/unit/data/seq_cso19_coding.fa +16 -0
- data/test/unit/data/seq_cso19_transcript.fa +28 -0
- data/test/unit/data/seq_drd3_gene.fa +838 -0
- data/test/unit/data/seq_drd3_transcript.fa +22 -0
- data/test/unit/data/seq_drd4_transcript.fa +24 -0
- data/test/unit/data/seq_forward_composite.fa +1669 -0
- data/test/unit/data/seq_par_boundary.fa +169 -0
- data/test/unit/data/seq_rnd3_transcript.fa +47 -0
- data/test/unit/data/seq_ub2r1_coding.fa +13 -0
- data/test/unit/data/seq_ub2r1_gene.fa +174 -0
- data/test/unit/data/seq_ub2r1_transcript.fa +26 -0
- data/test/unit/data/seq_y.fa +2 -0
- data/test/unit/ensembl_genomes/test_collection.rb +51 -0
- data/test/unit/ensembl_genomes/test_gene.rb +52 -0
- data/test/unit/ensembl_genomes/test_slice.rb +71 -0
- data/test/unit/ensembl_genomes/test_variation.rb +17 -0
- data/test/unit/release_50/core/test_project.rb +215 -0
- data/test/unit/release_50/core/test_project_human.rb +58 -0
- data/test/unit/release_50/core/test_relationships.rb +66 -0
- data/test/unit/release_50/core/test_sequence.rb +175 -0
- data/test/unit/release_50/core/test_slice.rb +121 -0
- data/test/unit/release_50/core/test_transcript.rb +108 -0
- data/test/unit/release_50/core/test_transform.rb +223 -0
- data/test/unit/release_50/variation/test_activerecord.rb +143 -0
- data/test/unit/release_50/variation/test_variation.rb +84 -0
- data/test/unit/release_53/core/test_gene.rb +66 -0
- data/test/unit/release_53/core/test_project.rb +96 -0
- data/test/unit/release_53/core/test_project_human.rb +65 -0
- data/test/unit/release_53/core/test_slice.rb +47 -0
- data/test/unit/release_53/core/test_transform.rb +63 -0
- data/test/unit/release_53/variation/test_activerecord.rb +145 -0
- data/test/unit/release_53/variation/test_variation.rb +71 -0
- data/test/unit/release_56/core/test_gene.rb +66 -0
- data/test/unit/release_56/core/test_project.rb +96 -0
- data/test/unit/release_56/core/test_slice.rb +54 -0
- data/test/unit/release_56/core/test_transform.rb +63 -0
- data/test/unit/release_56/variation/test_activerecord.rb +142 -0
- data/test/unit/release_56/variation/test_variation.rb +68 -0
- data/test/unit/test_connection.rb +66 -0
- data/test/unit/test_releases.rb +136 -0
- metadata +128 -0
data/TUTORIAL.rdoc
ADDED
@@ -0,0 +1,623 @@
|
|
1
|
+
= Ruby Ensembl Core API tutorial
|
2
|
+
By Jan Aerts. Copy-paste-modified from the excellent perl API tutorial at
|
3
|
+
http://www.ensembl.org/info/software/core/core_tutorial.html (with permission of the core Ensembl team).
|
4
|
+
|
5
|
+
Based on release 50.
|
6
|
+
|
7
|
+
== Introduction
|
8
|
+
This tutorial describes how to use the Ensembl Core Ruby API. It is intended to be an introduction and demonstration of the general API concepts. This tutorial is not comprehensive, but it will hopefully enable to reader to become quickly productive, and facilitate a rapid understanding of the core system. This tutorial assumes at least some familiarity with Ruby.
|
9
|
+
|
10
|
+
The Ruby API provides a level of abstraction over the Ensembl Core databases. To external users the API may be useful to automate the extraction of particular data. As a brief introduction this tutorial focuses primarily on the retrieval of data from the Ensembl Core databases.
|
11
|
+
|
12
|
+
The Ruby API is only one of many ways of accessing the data stored in Ensembl. Additionally there is a genome browser web interface, and the BioMart system. BioMart may be a more appropriate tool for certain types of data mining.
|
13
|
+
|
14
|
+
This API is for read-only querying of the database.
|
15
|
+
|
16
|
+
== Other sources of information
|
17
|
+
The Ensembl Core API has a decent set of code documentation in the form of standard Ruby RDOC. This is documentation is mixed in with the actual code, but can be automatically extracted and formatted using some software tools. One version of this documentation is available at the website you're looking at.
|
18
|
+
|
19
|
+
If you have your RUBYLIB environment variable set correctly, you can use the command ri. For example the following command will bring up some documentation about the Slice class and each of its methods:
|
20
|
+
|
21
|
+
ri Ensembl::Core::Slice
|
22
|
+
|
23
|
+
For additional information you can contact Jan Aerts (jan.aerts@sanger.ac.uk) or preferably send an email to the bioruby mailing list (see http://www.bioruby.org).
|
24
|
+
|
25
|
+
== Obtaining and installing the code
|
26
|
+
The Ensembl Ruby API is made available as a gem. See the github site for more information (http://github.com/jandot/ruby-ensembl-api/wikis/home).
|
27
|
+
|
28
|
+
Basically, it comes down to:
|
29
|
+
sudo gem install jandot-ruby-ensembl-api --source http://gems.github.com
|
30
|
+
|
31
|
+
== Code conventions
|
32
|
+
Several naming conventions are used throughout the API. Learning these conventions will aid in your understanding of the code.
|
33
|
+
|
34
|
+
Variable names are underscore-separated all lower-case words.
|
35
|
+
slice_1
|
36
|
+
exon_1
|
37
|
+
gene_a
|
38
|
+
|
39
|
+
Class and package names are CamelCase words that begin with capital letters.
|
40
|
+
|
41
|
+
Ensembl::Core::Gene
|
42
|
+
Ensembl::Core::Exon
|
43
|
+
Ensembl::Core::CoordSystem
|
44
|
+
Ensembl::Core::SeqRegion
|
45
|
+
|
46
|
+
Method names are entirely lower-case, underscore separated words. Methods are called on an object or class by appending a period to that object or class and adding the method name.
|
47
|
+
|
48
|
+
Ensembl::Core::Slice.genes
|
49
|
+
transcript_a.five_prime_utr_seq
|
50
|
+
|
51
|
+
Class methods are responsible for the creation of various objects. Most of this is standard ActiveRecord behaviour and will be discussed below.
|
52
|
+
|
53
|
+
== ActiveRecord
|
54
|
+
|
55
|
+
Most of the API is based on ActiveRecord to get data from that database. In general, each table is described by a class with the same name: the coord_system table is covered by the Ensembl::Core::CoordSystem class, the seq_region table is covered by the Ensembl::Core::SeqRegion class, etc. As a result, accessors are available for all columns in each table. For example, the seq_region table has the following columns: seq_region_id, name, coord_system_id and length. Through ActiveRecord, these column names become available as attributes of Ensembl::Core::SeqRegion objects:
|
56
|
+
|
57
|
+
puts my_seq_region.seq_region_id
|
58
|
+
puts my_seq_region.name
|
59
|
+
puts my_seq_region.coord_system_id
|
60
|
+
puts my_seq_region.length.to_s
|
61
|
+
|
62
|
+
ActiveRecord makes it easy to extract data from those tables using the collection of find methods. There are three types of find methods (e.g. for the Ensembl::Core::CoordSystem class):
|
63
|
+
|
64
|
+
* find based on primary key in table:
|
65
|
+
|
66
|
+
my_coord_system = CoordSystem.find(5)
|
67
|
+
|
68
|
+
* find_by_sql:
|
69
|
+
|
70
|
+
my_coord_system = CoordSystem.find_by_sql('SELECT * FROM coord_system WHERE name = 'chromosome'")
|
71
|
+
|
72
|
+
* find_by_<insert_your_column_name_here>
|
73
|
+
|
74
|
+
my_coord_system1 = CoordSystem.find_by_name('chromosome')
|
75
|
+
my_coord_system2 = CoordSystem.find_by_rank(3)
|
76
|
+
|
77
|
+
To find out which find_by_<column> methods are available, you can list the column names using the column_names class methods:
|
78
|
+
|
79
|
+
puts Ensembl::Core::CoordSystem.column_names.join("\t")
|
80
|
+
|
81
|
+
For more information on the find methods, see ar.rubyonrails.org/classes/ActiveRecord/Base.html#M000344
|
82
|
+
|
83
|
+
The relationships between different tables are accessible through the classes as well. For example, to loop over all seq_regions belonging to a coord_system (a coord_system "has many" seq_regions):
|
84
|
+
|
85
|
+
chr_coord_system = CoordSystem.find_by_name('chromosome')
|
86
|
+
chr_coord_system.seq_regions.each do |seq_region|
|
87
|
+
puts seq_region.name
|
88
|
+
end
|
89
|
+
|
90
|
+
Of course, you can go the other way as well (a seq_region "belongs to" a coord_system):
|
91
|
+
|
92
|
+
chr4 = SeqRegion.find_by_name('4')
|
93
|
+
puts chr4.coord_system.name #--> 'chromosome'
|
94
|
+
|
95
|
+
To find out what relationships exist for a given class, you can use the reflect_on_all_associations class methods:
|
96
|
+
|
97
|
+
puts SeqRegion.reflect_on_all_associations(:has_many).collect{|a| a.name.to_s}.join("\n")
|
98
|
+
puts SeqRegion.reflect_on_all_associations(:has_one).collect{|a| a.name.to_s}.join("\n")
|
99
|
+
puts SeqRegion.reflect_on_all_associations(:belongs_to).collect{|a| a.name.to_s}.join("\n")
|
100
|
+
|
101
|
+
== Connecting to the Ensembl database and a minimal script
|
102
|
+
|
103
|
+
All data used and created by Ensembl is stored in MySQL relational databases. If you want to access this database the first thing you have to do is to connect to it. This is done behind the scenes using the ActiveRecord module.
|
104
|
+
|
105
|
+
First, we need to tell our computer where they can find the API code. This information is contained in the RUBYLIB environment variable. Suppose you have save the API in /usr/local/lib/ruby/ensembl-api (with subdirectories lib/, test/, samples/, ...), you could set the environment variable on a bash shell like this:
|
106
|
+
export RUBYLIB=$RUBYLIB:/usr/local/lib/ruby/ensembl-api/lib
|
107
|
+
|
108
|
+
Next, we need to import all Ruby modules that we will be using. Every Ensembl script that you will write will contain a use statement like the following:
|
109
|
+
|
110
|
+
require 'ensembl'
|
111
|
+
|
112
|
+
Alternatively, if you installed the API as a gem, you would write:
|
113
|
+
|
114
|
+
require 'rubygems'
|
115
|
+
require_gem 'ensembl-api'
|
116
|
+
|
117
|
+
|
118
|
+
Ensembl stores its data in a separate database for each species and each release of that species. The Ruby Ensembl API does a lot automatically, so you only have to know the species name to connect to the release 45 version of its core database. This name should be provided in snake_case (all lowercase connected by underscore):
|
119
|
+
|
120
|
+
Ensembl::Core::CoreDBConnection.connect('homo_sapiens')
|
121
|
+
|
122
|
+
With the connection established, you'll be able to get objects from the database, e.g.
|
123
|
+
|
124
|
+
chromosome_4 = Ensembl::Core::SeqRegion.find_by_name('4')
|
125
|
+
|
126
|
+
You have to include the 'Ensembl::Core::' bit to every call to a class. However, if you include the line
|
127
|
+
|
128
|
+
include Ensembl::Core
|
129
|
+
|
130
|
+
just after you "require 'ensembl'", you don't have to anymore. The rest of this tutorial expects you to have done the include command. So a very short but complete ruby script could look like this:
|
131
|
+
|
132
|
+
require 'ensembl'
|
133
|
+
include Ensembl::Core
|
134
|
+
CoreDBConnection.connect('homo_sapiens')
|
135
|
+
chromosome_4 = SeqRegion.find_by_name('4')
|
136
|
+
puts chromosome_4.name
|
137
|
+
|
138
|
+
== Slices
|
139
|
+
|
140
|
+
A Slice object represents a single continuous region of a genome. Slices can be used to obtain sequence, features or other information from a particular region of interest. There are several ways to obtain a slice, but we will start with the Ensembl::Core::Slice#fetch_by_region method which is the most commonly used. This class method takes numerous arguments but most of them are optional. In order, the arguments are: coord_system_name, seq_region_name, start, end, strand, coord_system_version. The following are several examples of how to use the Ensembl::Core::Slice#fetch_by_region method:
|
141
|
+
|
142
|
+
* Obtain a slice covering the entire chromosome X
|
143
|
+
|
144
|
+
slice = Slice.fetch_by_region('chromosome', 'X')
|
145
|
+
|
146
|
+
* Obtain a slice covering the entire clone AL359765.6
|
147
|
+
|
148
|
+
slice = Slice.fetch_by_region('clone', 'AL359765.6')
|
149
|
+
|
150
|
+
* Obtain a slice covering an entire NT contig
|
151
|
+
|
152
|
+
slice = Slice.fetch_by_region('supercontig', 'NT_011333')
|
153
|
+
|
154
|
+
* Obtain a slice covering the region from 1MB to 2MB (inclusively) of chromosome 20
|
155
|
+
|
156
|
+
slice = Slice.fetch_by_region('chromosome', '20', 1000000, 2000000)
|
157
|
+
|
158
|
+
Another useful way to obtain a slice is with respect to a gene, e.g. with 5kb flanking sequence:
|
159
|
+
|
160
|
+
slice = Slice.fetch_by_gene_stable_id('ENSG00000099889', 5000)
|
161
|
+
|
162
|
+
This will return a slice that contains the sequence of the gene specified by its stable Ensembl ID. It also returns 5000bp of flanking sequence at both the 5' and 3' ends, which is useful if you are interested in the environs that a gene inhabits. You needn't have the flanking sequence it you don't want it -- in this case set the number of flanking bases to zero or simply omit the second argument entirely. Note that the fetch_by_gene_stable_id() method always returns a slice on the forward strand even if the gene is on the reverse strand.
|
163
|
+
|
164
|
+
To retrieve a set of slices from a particular coordinate system the fetch_all method can be used:
|
165
|
+
|
166
|
+
* Retrieve slices of every chromosome in the database
|
167
|
+
|
168
|
+
slices = Slice.fetch_all('chromosome')
|
169
|
+
|
170
|
+
* Retrieve slices of every BAC clone in the database
|
171
|
+
|
172
|
+
slices = Slice.fetch_all('clone')
|
173
|
+
|
174
|
+
For certain types of analysis it is necessary to break up regions into smaller manageable pieces. The method Slice#split can be used to break up larger slices into smaller component slices. The following code creates an array of subslices of chromosome 1, with the (maximal) length of each slice 100000 bp and an overlap of 250 bp.
|
175
|
+
|
176
|
+
big_slice = Slice.fetch_by_region('chromosome', 1)
|
177
|
+
subslices = big_slice.split(100000, 250)
|
178
|
+
|
179
|
+
To obtain sequence from a slice the Slice#seq method can be used:
|
180
|
+
|
181
|
+
seq = slice.seq
|
182
|
+
puts seq
|
183
|
+
|
184
|
+
We can query the slice for information about itself:
|
185
|
+
|
186
|
+
seq_region = slice.seq_region.name
|
187
|
+
coord_system = slice.seq_region.coord_system.name
|
188
|
+
start = slice.start
|
189
|
+
stop = slice.stop
|
190
|
+
strand = slice.strand
|
191
|
+
|
192
|
+
puts "Slice: #{coord_system} #{seq_region} #{start}-#{stop} (#{strand})"
|
193
|
+
|
194
|
+
Many classes can provide a set of features which overlap a slice. The slice itself also provides a means to obtain features which overlap its region. To obtain a list of genes which overlap a slice:
|
195
|
+
|
196
|
+
slice_a = Slice.fetch_by_region('chromosome','X')
|
197
|
+
genes = slice_a.genes
|
198
|
+
|
199
|
+
*CAUTION*: The slice concept is a little bit different from that in the perl API. If you ask a gene for its slice using the perl API, you get a slice covering the _whole_ of the chromosome. In contrast, the slice created by the ruby API only contains that bit covered by the gene. The Ensembl::Core::SeqRegion class is used to refer to whole things. I just found it much more intuitive like that...
|
200
|
+
|
201
|
+
== Features
|
202
|
+
|
203
|
+
Features are objects in the database which have a defined location on the genome. All features in Ensembl include the Ensembl::Core::Sliceable mixin and have the following location defining attributes: start, end, strand, slice.
|
204
|
+
|
205
|
+
All feature objects can be retrieved using their #find method of their class or any of the generic #find_by_() methods (see the ActiveRecord bit of this tutorial). The following example illustrates how Transcript features and DnaDnaAlignFeature features can be obtained from the database. All features in the database can be retrieved in similar ways from their own object adaptors.
|
206
|
+
|
207
|
+
* Get a slice of chromosome 20, 10MB-11MB
|
208
|
+
|
209
|
+
slice = Slice.fetch_by_region('chromosome', '20', 10000000, 11000000 )
|
210
|
+
|
211
|
+
* Fetch all of the transcripts overlapping chromosome 20, 10MB-11MB
|
212
|
+
|
213
|
+
transcripts = slice.transcripts
|
214
|
+
transcripts.each do |transcript|
|
215
|
+
name = transcript.stable_id
|
216
|
+
internal_id = transcript.id
|
217
|
+
start = transcript.start
|
218
|
+
stop = transcript.stop
|
219
|
+
strand = transcript.strand
|
220
|
+
|
221
|
+
puts "Transcript #{name} [#{internal_id}] #{start}-#{stop} (#{strand})"
|
222
|
+
end
|
223
|
+
|
224
|
+
* Fetch all of the DNA-DNA alignments overlapping chromosome 20, 10MB-11MB
|
225
|
+
|
226
|
+
dafs = slice.dna_align_features
|
227
|
+
dafs.each do |daf|
|
228
|
+
name = daf.hit_name
|
229
|
+
internal_id = daf.id
|
230
|
+
start = daf.start
|
231
|
+
stop = daf.stop
|
232
|
+
strand = daf.strand
|
233
|
+
|
234
|
+
puts "DNA alignment #{name} [#{internal_id}] #{start}-#{stop} (#{strand})"
|
235
|
+
end
|
236
|
+
|
237
|
+
* Fetch a transcript by its internal identifier
|
238
|
+
|
239
|
+
transcript = Transcript.find(100)
|
240
|
+
|
241
|
+
* Fetch a DnaAlignFeature by its internal identifiers
|
242
|
+
|
243
|
+
daf = DnaAlignFeature.find(100)
|
244
|
+
|
245
|
+
All features also have the transform method which are described in detail in a later section of this tutorial.
|
246
|
+
|
247
|
+
=== Features across coordinate systems
|
248
|
+
|
249
|
+
In the Ensembl database, some features might be related to one coordinate system, while other features are related to another one (for more information on coordinate systems, see below). For example, there are three coordinate systems in cow: contigs, scaffolds and chromosomes. Scaffold Chr4.003.122 does not have any simple_features on it. However, the equivalent regions in the contig and chromosome coordinate systems have 37 and 85 (=total of 122), respectively. If you therefore ask that scaffold to list its simple_features, you wouldn't get any. A workaround for this, is to first create a slice for this scaffold, and ask that _slice_ for its simple_features.
|
250
|
+
|
251
|
+
scaffold = SeqRegion.find_by_name('Chr4.003.122')
|
252
|
+
puts scaffold.simple_features.length #--> 0
|
253
|
+
slice = Slice.fetch_by_region('scaffold','Chr4.003.122')
|
254
|
+
puts slice.simple_features.length #--> 122
|
255
|
+
|
256
|
+
or even:
|
257
|
+
puts scaffold.slice.simple_features.length #--> 122
|
258
|
+
|
259
|
+
The reason this works, is that any retrieval for a slice also checks what coordinate systems that type of feature is annotated on.
|
260
|
+
|
261
|
+
== Genes, Transcripts, and Exons
|
262
|
+
|
263
|
+
Genes, exons and transcripts are also features and can be treated in the same way as any other feature within Ensembl. A transcript in Ensembl is a grouping of exons. A gene in Ensembl is a grouping of transcripts which share any overlapping (or partially overlapping) exons. Transcripts also have an associated Translation object which defines the UTR and CDS composition of the transcript. Introns are not defined explicitly in the database but can be obtained by the Ensembl::Core::Transcript#introns method (not implemented yet).
|
264
|
+
|
265
|
+
Important: like all Ensembl features the start of an exon is always less than or equal to the end of the exon, regardless of the strand it is on. The start of the transcript is the start of the first exon of a transcript on the forward strand or the start of the last exon of a transcript on the reverse strand. The start and end of a gene are defined to be the lowest start value of its transcripts and the highest end value respectively.
|
266
|
+
|
267
|
+
Genes, translations, transcripts and exons all have stable identifiers. These are identifiers that are assigned to Ensembl's predictions, and maintained in subsequent releases. For example, if a transcript (or a sufficiently similar transcript) is re-predicted in a future release then it will be assigned the same stable identifier as its predecessor.
|
268
|
+
|
269
|
+
The following is an example of the retrieval of a set of genes, transcripts and exons:
|
270
|
+
|
271
|
+
slice = Slice.fetch_by_region('chromosome','X',1000000,10000000)
|
272
|
+
puts slice.display_name
|
273
|
+
|
274
|
+
slice.genes.each do |gene|
|
275
|
+
puts "\t" + gene.stable_id
|
276
|
+
|
277
|
+
gene.transcripts.each do |transcript|
|
278
|
+
puts "\t\t" + transcript.stable_id
|
279
|
+
|
280
|
+
transcript.exons.each do |exon|
|
281
|
+
puts "\t\t\t" + exon.id.to_s
|
282
|
+
end
|
283
|
+
end
|
284
|
+
end
|
285
|
+
|
286
|
+
In addition to the methods which are present on every feature, the transcript class has many other methods which are commonly used. Several methods can be used to obtain transcript related sequences. At the time of writing this tutorial, these methods return strings rather than bioruby Bio::Sequence objects. The following example demonstrates the use of some of these methods:
|
287
|
+
|
288
|
+
* The Ensembl::Core::Transcript#seq method returns the concatenation of the exon sequences. This is the cDNA of the transcript:
|
289
|
+
|
290
|
+
puts "cDNA: " + transcript.seq
|
291
|
+
|
292
|
+
* The Ensembl::Core::Transcript#cds_seq method returns only the CDS of the transcript
|
293
|
+
|
294
|
+
puts "CDS: " + transcript.cds_seq
|
295
|
+
|
296
|
+
* UTR sequences are obtained via the five_prime_utr_seq and three_prime_utr_seq methods
|
297
|
+
|
298
|
+
fiv_utr = transcript.five_prime_utr_seq
|
299
|
+
thr_utr = transcript.three_prime_utr_seq
|
300
|
+
|
301
|
+
puts "5' UTR: " + ( fiv_utr.nil? ? 'None' : fiv_utr )
|
302
|
+
puts "3' UTR: " + ( thr_utr.nil? ? 'None' : thr_utr )
|
303
|
+
|
304
|
+
* The peptide sequence is obtained from the Ensembl::Core::Transcript#protein_seq method. If the transcript is non-coding, undef is returned.
|
305
|
+
|
306
|
+
peptide = transcript.protein_seq
|
307
|
+
|
308
|
+
puts "Translation: " + ( peptide.nil? ? 'None' : peptide )
|
309
|
+
|
310
|
+
== Translations and ProteinFeatures
|
311
|
+
|
312
|
+
Translation objects and peptide sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. The following example obtains the peptide sequence of a Ensembl::Core::Transcript and the Ensembl::Core::Translation's stable identifier:
|
313
|
+
|
314
|
+
stable_id = 'ENST00000044768'
|
315
|
+
|
316
|
+
transcript = Transcript.find_by_stable_id(stable_id)
|
317
|
+
|
318
|
+
puts transcript.stable_id
|
319
|
+
puts transcript.translation.stable_id
|
320
|
+
|
321
|
+
--
|
322
|
+
NOTE TO SELF: the following bit is not implemented yet...
|
323
|
+
|
324
|
+
ProteinFeatures are features which are on an amino acid sequence rather than a nucleotide sequence. The method get_all_ProteinFeatures() can be used to obtain a set of protein features from a Translation object.
|
325
|
+
|
326
|
+
$translation = $transcript->translation();
|
327
|
+
|
328
|
+
my @pfeatures = @{ $translation->get_all_ProteinFeatures() };
|
329
|
+
while ( my $pfeature = shift @pfeatures ) {
|
330
|
+
my $logic_name = $pfeature->analysis()->logic_name();
|
331
|
+
|
332
|
+
printf(
|
333
|
+
"%d-%d %s %s %s\n",
|
334
|
+
$pfeature->start(), $pfeature->end(), $logic_name,
|
335
|
+
$pfeature->interpro_ac(),
|
336
|
+
$pfeature->idesc()
|
337
|
+
);
|
338
|
+
}
|
339
|
+
|
340
|
+
If only the protein features created by a particular analysis are desired the name of the analysis can be provided as an argument. To obtain the subset of features which are considered to be 'domain' features the convenience method get_all_DomainFeatures() can be used:
|
341
|
+
|
342
|
+
my $seg_features = $translation->get_all_ProteinFeatures('Seg');
|
343
|
+
my $domain_features = $translation->get_all_DomainFeatures();
|
344
|
+
++
|
345
|
+
|
346
|
+
== PredictionTranscripts
|
347
|
+
|
348
|
+
PredictionTranscripts are the results of ab initio gene finding programs that are stored in Ensembl. Example programs include Genscan and SNAP. Prediction transcripts have the same interface as normal transcripts and thus they can be used in the same way.
|
349
|
+
|
350
|
+
prediction_transcripts = slice.prediction_transcripts
|
351
|
+
prediction_transcripts.each do |pt|
|
352
|
+
exons = pt.prediction_exons
|
353
|
+
type = pt.analysis.logic_name
|
354
|
+
|
355
|
+
puts "#{type} prediction has #{exons.length.to_s} exons"
|
356
|
+
|
357
|
+
exons.each do |exon|
|
358
|
+
puts exon.to_yaml
|
359
|
+
end
|
360
|
+
end
|
361
|
+
|
362
|
+
== Alignment Features
|
363
|
+
|
364
|
+
Two types of alignments are stored in the core Ensembl database: alignments of DNA sequence to the genome and alignments of peptide sequence to the genome. These can be retrieved as Ensembl::Core::DnaAlignFeatures and Ensembl::Core::ProteinAlignFeatures respectively. A single gapped alignment is represented by a single feature with a cigar line. A cigar line is a compact representation of a gapped alignment as single string containing letters M (match) D (deletion), and I (insertion) prefixed by integer lengths (the number may be omitted if it is 1).
|
365
|
+
--
|
366
|
+
NOTE TO SELF: not implemented yet
|
367
|
+
A gapped alignment feature can be broken into its component ungapped alignments by the method ungapped_features() which returns a list of FeaturePair objects.
|
368
|
+
++
|
369
|
+
The following example shows the retrieval of some alignment features.
|
370
|
+
|
371
|
+
* Retrieve dna-dna alignment features from the slice region
|
372
|
+
|
373
|
+
features = slice.dna_align_features('Vertrna')
|
374
|
+
features.each do |f|
|
375
|
+
puts f.to_yaml
|
376
|
+
end
|
377
|
+
|
378
|
+
* Retrieve protein-dna alignment features from the slice region
|
379
|
+
|
380
|
+
features = slice.protein_align_features('Swall')
|
381
|
+
features.each do |f|
|
382
|
+
puts f.to_yaml
|
383
|
+
end
|
384
|
+
|
385
|
+
== Repeats
|
386
|
+
|
387
|
+
Repetitive regions found by RepeatMasker and TRF (Tandem Repeat Finder) are represented in the Ensembl database as RepeatFeatures. Short non-repetitive regions between repeats are found by the program Dust and are also stored as RepeatFeatures. RepeatFeatures can be retrieved and used in the same way as other Ensembl features.
|
388
|
+
|
389
|
+
repeats = slice.repeats
|
390
|
+
repeats.each do |r|
|
391
|
+
puts r.display_id + "\t" + repeat.start.to_s + "\t" + repeat.stop.to_s
|
392
|
+
end
|
393
|
+
|
394
|
+
--
|
395
|
+
NOTE TO SELF: not implemented yet
|
396
|
+
RepeatFeatures are used to perform repeat masking of the genomic sequence. Hard or soft-masked genomic sequence can be retrieved from Slice objects using the Slice#repeatmasked_seq method. Hard-masking replaces sequence in repeat regions with Ns. Soft-masking replaces sequence in repeat regions with lower-case sequence.
|
397
|
+
|
398
|
+
unmasked_seq = slice.seq
|
399
|
+
hardmasked_seq = slice.repeatmasked_seq
|
400
|
+
softmasked_seq = slice.repeatmasked_seq(undef, 1)
|
401
|
+
|
402
|
+
* Soft-mask sequence using TRF results only
|
403
|
+
|
404
|
+
tandem_masked_seq = slice.repeatmasked_seq(['TRF'], 1)
|
405
|
+
++
|
406
|
+
|
407
|
+
== Markers
|
408
|
+
|
409
|
+
Markers are imported into the Ensembl database from UniSTS and several other sources. A marker in Ensembl consists of a pair of primer sequences, an expected product size and a set of associated identifiers known as synonyms. Markers are placed on the genome electronically using an analysis program such as ePCR and their genomic positions are retrievable as MarkerFeatures. Map locations (genetic, radiation hybrid and in situ hybridization) for markers obtained from actual experimental evidence are also accessible.
|
410
|
+
|
411
|
+
Markers can be fetched by their name. The Marker#find_all_by_name returns an array, and Marker#find_by_name returns the first element of that array, i.e. a marker object.
|
412
|
+
|
413
|
+
marker = Marker.find_by_name('D9S1038E')
|
414
|
+
|
415
|
+
* Display the various names associated with the same marker
|
416
|
+
|
417
|
+
marker.marker_synonyms.each do |ms|
|
418
|
+
if ms.source.nil?
|
419
|
+
puts ms.name
|
420
|
+
else
|
421
|
+
puts ms.source + ':' + ms.name
|
422
|
+
end
|
423
|
+
end
|
424
|
+
|
425
|
+
* Display the primer info
|
426
|
+
|
427
|
+
puts "left primer: " + marker.left_primer.to_s
|
428
|
+
puts "right primer: " + marker.right_primer.to_s
|
429
|
+
puts "product size: " + marker.min_primer_dist.to_s + '-' + marker.max_primer_dist.to_s
|
430
|
+
|
431
|
+
* Display out genetic/RH/FISH map information
|
432
|
+
|
433
|
+
puts "Map locations:"
|
434
|
+
marker.marker_map_locations.each do |mapping|
|
435
|
+
puts mapping.map.map_name + "\t" + mapping.chromosome_name + "\t" + mapping.position.to_s
|
436
|
+
end
|
437
|
+
|
438
|
+
MarkerFeatures, which represent genomic positions of markers, can be retrieved and manipulated in the same way as other Ensembl features.
|
439
|
+
|
440
|
+
* Obtain the positions for an already retrieved marker
|
441
|
+
|
442
|
+
marker.marker_features.each do |mf|
|
443
|
+
puts mf.slice.display_name
|
444
|
+
end
|
445
|
+
|
446
|
+
* Retrieve all marker features in a given region
|
447
|
+
|
448
|
+
marker_features = slice.marker_features
|
449
|
+
marker_features.each do |mf|
|
450
|
+
puts mf.slice.display_name
|
451
|
+
end
|
452
|
+
|
453
|
+
== MiscFeatures
|
454
|
+
|
455
|
+
MiscFeatures are features with arbitrary attributes which are placed into arbitrary groupings. MiscFeatures can be retrieved as any other feature and are classified into distinct sets by a set code. Generally it only makes sense to retrieve all features which have a particular set code because very diverse types of MiscFeatures are stored in the database.
|
456
|
+
|
457
|
+
MiscFeature attributes are represented by Attribute objects and can be retrieved via a get_all_Attributes() method.
|
458
|
+
|
459
|
+
The following example retrieves all MiscFeatures representing ENCODE regions on a given slice and prints out their attributes:
|
460
|
+
|
461
|
+
encode_regions = slice.misc_features('encode')
|
462
|
+
encode_regions.each do |er|
|
463
|
+
attributes = er.misc_attribs
|
464
|
+
attributes.each do |a|
|
465
|
+
puts a.to_s
|
466
|
+
end
|
467
|
+
end
|
468
|
+
|
469
|
+
This example retrieves all misc features representing a BAC clone via its name and prints out their location and other information:
|
470
|
+
|
471
|
+
clones = MiscFeature.find_all_by_attribute_type_value('name', 'RP11-62N12')
|
472
|
+
clones.each do |clone|
|
473
|
+
slice = clone.slice
|
474
|
+
puts slice.to_yaml
|
475
|
+
|
476
|
+
attributes = clone.misc_attribs
|
477
|
+
attributes.each do |a|
|
478
|
+
puts a.to_s
|
479
|
+
end
|
480
|
+
end
|
481
|
+
|
482
|
+
== External References
|
483
|
+
|
484
|
+
Ensembl cross references its genes, transcripts and translations with identifiers from other databases. A cross reference is referenced by a Xref object. The following code snippet retrieves and prints Xrefs for a gene, its transcripts and its translations:
|
485
|
+
|
486
|
+
* Get the 'COG6' gene from human
|
487
|
+
|
488
|
+
cog6 = Gene.find_by_name('COG6')
|
489
|
+
puts 'GENE: ' + cog6.stable_id + " (internal id: " + cog6.id.to_s + ")"
|
490
|
+
|
491
|
+
cog6.xrefs.each do |x|
|
492
|
+
puts x.to_s
|
493
|
+
end
|
494
|
+
|
495
|
+
cog6.transcripts.each do |t|
|
496
|
+
puts 'TRANSCRIPT: ' + t.stable_id
|
497
|
+
t.xrefs.each do |x|
|
498
|
+
puts "\s\s" + x.to_s
|
499
|
+
end
|
500
|
+
|
501
|
+
# Watch out: pseudogenes have no translation
|
502
|
+
if ! t.translation.nil?
|
503
|
+
translation = t.translation
|
504
|
+
puts "\tTRANSLATION: " + translation.stable_id
|
505
|
+
translation.xrefs.each do |x|
|
506
|
+
puts "\t\s\s" + x.to_s
|
507
|
+
end
|
508
|
+
end
|
509
|
+
end
|
510
|
+
|
511
|
+
Often it is useful to obtain all of the Xrefs associated with a gene and its associated transcripts and translation as in the above example. As a shortcut to calling #xrefs on all of the above objects the Gene#all_xrefs method can be used instead. The above example could be shortened by using the following:
|
512
|
+
|
513
|
+
cog6.all_xrefs.each do |x|
|
514
|
+
puts x.to_s
|
515
|
+
end
|
516
|
+
|
517
|
+
This returns all xrefs for the gene itself, including those for all transcripts and translations.
|
518
|
+
|
519
|
+
== Coordinates
|
520
|
+
|
521
|
+
We have already discussed the fact that slices and features have coordinates, but we have not defined exactly what these coordinates mean.
|
522
|
+
|
523
|
+
Ensembl, and many other bioinformatics applications, use inclusive coordinates which start at 1. The first nucleotide of a DNA sequence is 1 and the first amino acid of a peptide sequence is also 1. The length of a sequence is defined as end - start + 1.
|
524
|
+
|
525
|
+
In some rare cases inserts are specified with a start which is one greater than the end. For example a feature with a start of 10 and an end of 9 would be a zero length feature between base pairs 9 and 10.
|
526
|
+
|
527
|
+
Slice coordinates are relative to the start of the underlying DNA sequence region (a Ensembl::Core::SeqRegion object). The strand of the slice represents its orientation relative to the default orientation of the sequence region. By convention the start of the slice is always less than the end, and does not vary with its strandedness. Most slices you will encounter will have a strand of 1, and this is what we will consider in our examples. It is legal to create a slice which extends past the boundaries of a sequence region.
|
528
|
+
|
529
|
+
== Coordinate Systems, Sequence Regions and Slices
|
530
|
+
|
531
|
+
Sequences stored in Ensembl are associated with coordinate systems. What the coordinate systems are varies from species to species. For example, the homo_sapiens database has the following coordinate systems: contig, clone, supercontig, chromosome. Sequence and features may be retrieved from any coordinate system despite the fact they are only stored internally in a single coordinate system. The database stores the relationship between these coordinate systems and the API provides means to convert between them. The API has a Ensembl::Core::CoordSystem object and object adaptor, however, these are most often used internally. The following example fetches a chromosome coordinate system object from the database:
|
532
|
+
|
533
|
+
chr_coord_system = CoordSystem.find_by_name('chromosome')
|
534
|
+
puts "Coordinate system: " + chr_coord_system.name + ":" + chr_coord_system.version
|
535
|
+
|
536
|
+
A coordinate system is uniquely defined by its name and version. Most coordinate systems do not have a version, and the ones that do have a default version, so it is usually sufficient to use only the name when requesting a coordinate system. For example, chromosome coordinate systems have a version which is the assembly that defined the construction of the coordinate system. The version of the human chromosome coordinate system might be something like NCBI35 or NCBI36, depending on the version of the Core databases used.
|
537
|
+
|
538
|
+
Ensembl::Core::SeqRegion objects have an associated Ensembl::Core::CoordSystem object and a #name method that returns its name which uniquely defines them. You may have noticed that the coordinate system of the sequence region was specified when obtaining a slice in the #fetch_by_region method. Similarly the version may also be specified (though it can almost always be omitted):
|
539
|
+
|
540
|
+
slice = Slice.fetch_by_region('chromosome', 'X', 1000000, 10000000, 'NCBI36')
|
541
|
+
|
542
|
+
To obtain all sequence regions for a given coordinate system, just call the Ensembl::Core::CoordSystem#seq_regions method.
|
543
|
+
|
544
|
+
coord_system = CoordSystem.find_by_name('chromosome')
|
545
|
+
chromomsomes = coord_system.seq_regions
|
546
|
+
chromosomes.each do |chr|
|
547
|
+
puts chr.name
|
548
|
+
end
|
549
|
+
|
550
|
+
Sometimes it is useful to obtain full slices of every sequence region in a given coordinate system; this may be done using the Slice#fetch_all method:
|
551
|
+
|
552
|
+
chromosomes = Slice.fetch_all('chromosome')
|
553
|
+
clones = Slice.fetch_all('clone')
|
554
|
+
|
555
|
+
Now suppose that you wish to write code which is independent of the species used. Not all species have the same coordinate systems; the available coordinate systems depends on the style of assembly used for that species (WGS, clone-based, etc.). You can obtain the list of available coordinate systems for a species using the Ensembl::Core::CoordSystem#find(:all) method and there is also a special pseudo-coordinate system named toplevel. The toplevel coordinate system is not a real coordinate system, but is used to refer to the highest level coordinate system in a given region. The toplevel coordinate system is particularly useful in genomes that are incompletely assembled. For example, the latest zebrafish genome consists of a set of assembled chromosomes, and a set of supercontigs that are not part of any chromosome. In this example, the toplevel coordinate system sometimes refers to the chromosome coordinate system and sometimes to the supercontig coordinate system depending on the region it is used in.
|
556
|
+
|
557
|
+
* List all coordinate systems in this database:
|
558
|
+
|
559
|
+
coord_systems = CoordSystem.find(:all)
|
560
|
+
coord_systems.each do |coord_system|
|
561
|
+
puts coord_system.name + "\t" + coord_system.version
|
562
|
+
end
|
563
|
+
|
564
|
+
* Get all slices on the highest coordinate system:
|
565
|
+
|
566
|
+
slices = Slice.fetch_all('top_level')
|
567
|
+
|
568
|
+
== Transform
|
569
|
+
|
570
|
+
Features on a seq_region in a given coordinate system may be moved to another coordinate system. This is useful if you are working with a particular coordinate system but you are interested in obtaining the features coordinates in another coordinate system.
|
571
|
+
|
572
|
+
TheEnsembl::Core::Sliceable#transform method (available to all features) can be used to move a feature to any coordinate system which is in the database. The feature will be a clone of the original feature, but with a different seq_region associated with it, as well as seq_region_start, seq_region_end and seq_region_strand.
|
573
|
+
|
574
|
+
#Suppose original_feature is on the 'chromosome' coordinate system
|
575
|
+
new_feature = original_feature.transform('clone')
|
576
|
+
if new_feature.nil?
|
577
|
+
puts "Feature is not defined in clonal coordinate system"
|
578
|
+
else
|
579
|
+
puts "Feature's clonal position:"
|
580
|
+
puts new_feature.seq_region.name
|
581
|
+
puts new_feature.seq_region_start.to_s + ".." + new_feature_seq_region_end
|
582
|
+
end
|
583
|
+
|
584
|
+
To print out the position of a feature (i.e. concatenating the seq_region name, start, end), it's easier to create a slice of it first, and then calling the Ensembl::Core::Slice#display_name method:
|
585
|
+
|
586
|
+
puts new_feature.slice.display_name
|
587
|
+
|
588
|
+
The transform method returns a copy of the original feature in the new coordinate system, or nil if the feature is not defined in that coordinate system. A feature is considered to be undefined in a coordinate system if it overlaps an undefined region or if it crosses a coordinate system boundary. Take for example the tiling path relationship between chromosome and contig coordinate systems:
|
589
|
+
|
590
|
+
|~~~~~~~| (Feature A) |~~~~| (Feature B)
|
591
|
+
|
592
|
+
(ctg 1) [=============]
|
593
|
+
(ctg 2) (------==========] (ctg 2)
|
594
|
+
(ctg 3) (--============] (ctg3)
|
595
|
+
|
596
|
+
Both Feature A and Feature B are defined in the chromosomal coordinate system described by the tiling path of contigs. However, Feature A is not defined in the contig coordinate system because it spans both Contig 1 and Contig 2. Feature B, on the other hand, is still defined in the contig coordinate system.
|
597
|
+
|
598
|
+
The special toplevel coordinate system can also be used in this instance to move the feature to the highest possible coordinate system in a given region:
|
599
|
+
|
600
|
+
new_feature = original_feature.transform('toplevel')
|
601
|
+
puts new_feature.slice.display_name
|
602
|
+
|
603
|
+
*NOTE*: In contrast to the perl API, there is no #transfer method.
|
604
|
+
|
605
|
+
== Project
|
606
|
+
|
607
|
+
When moving features between coordinate systems it is usually sufficient to use the Ensembl::Core::Sliceable#transform method. Sometimes, however, it is necessary to obtain coordinates in a another coordinate system even when a coordinate system boundary is crossed. Even though the feature is considered to be undefined in this case, the feature's coordinates can still be obtained in the requested coordinate system using the Slice#project method.
|
608
|
+
|
609
|
+
While #transform is a method only available to features, both slices and features have their own #project methods, which take the same arguments and have the same return values. The #project method takes a coordinate system name as an argument and returns an array of Slice and Gap objects. The following example illustrates the use of the #project method on a slice. The #project method on a feature can be used in the same way. As with the feature #transform method the pseudo coordinate system toplevel can be used to indicate you wish to project to the highest possible level.
|
610
|
+
|
611
|
+
original_slice = Slice.fetch_by_region('chromosome', '4', 329500, 380000)
|
612
|
+
target_slices = @source_slice_contigs_with_strand.project('contig')
|
613
|
+
target_slices.each do |ts|
|
614
|
+
puts ts.display_name
|
615
|
+
end
|
616
|
+
|
617
|
+
The above returns (for Bos taurus):
|
618
|
+
contig::AAFC03092598:60948:61145:1
|
619
|
+
contig::AAFC03118261:25411:37082:1
|
620
|
+
contig::AAFC03092594:1:3622:-1
|
621
|
+
contig:gap:50
|
622
|
+
contig::AAFC03092597:820:35709:-1
|
623
|
+
contig::AAFC03032210:13347:13415:1
|