ruby-ensembl-api 0.9.6

Sign up to get free protection for your applications and to get access to all the features.
Files changed (54) hide show
  1. data/TUTORIAL.rdoc +623 -0
  2. data/bin/ensembl +40 -0
  3. data/lib/ensembl.rb +64 -0
  4. data/lib/ensembl/core/activerecord.rb +1914 -0
  5. data/lib/ensembl/core/collection.rb +60 -0
  6. data/lib/ensembl/core/project.rb +264 -0
  7. data/lib/ensembl/core/slice.rb +693 -0
  8. data/lib/ensembl/core/transcript.rb +425 -0
  9. data/lib/ensembl/core/transform.rb +97 -0
  10. data/lib/ensembl/db_connection.rb +216 -0
  11. data/lib/ensembl/variation/activerecord.rb +253 -0
  12. data/lib/ensembl/variation/variation.rb +163 -0
  13. data/test/unit/data/seq_c6qbl.fa +10 -0
  14. data/test/unit/data/seq_cso19_coding.fa +16 -0
  15. data/test/unit/data/seq_cso19_transcript.fa +28 -0
  16. data/test/unit/data/seq_drd3_gene.fa +838 -0
  17. data/test/unit/data/seq_drd3_transcript.fa +22 -0
  18. data/test/unit/data/seq_drd4_transcript.fa +24 -0
  19. data/test/unit/data/seq_forward_composite.fa +1669 -0
  20. data/test/unit/data/seq_par_boundary.fa +169 -0
  21. data/test/unit/data/seq_rnd3_transcript.fa +47 -0
  22. data/test/unit/data/seq_ub2r1_coding.fa +13 -0
  23. data/test/unit/data/seq_ub2r1_gene.fa +174 -0
  24. data/test/unit/data/seq_ub2r1_transcript.fa +26 -0
  25. data/test/unit/data/seq_y.fa +2 -0
  26. data/test/unit/ensembl_genomes/test_collection.rb +51 -0
  27. data/test/unit/ensembl_genomes/test_gene.rb +52 -0
  28. data/test/unit/ensembl_genomes/test_slice.rb +71 -0
  29. data/test/unit/ensembl_genomes/test_variation.rb +17 -0
  30. data/test/unit/release_50/core/test_project.rb +215 -0
  31. data/test/unit/release_50/core/test_project_human.rb +58 -0
  32. data/test/unit/release_50/core/test_relationships.rb +66 -0
  33. data/test/unit/release_50/core/test_sequence.rb +175 -0
  34. data/test/unit/release_50/core/test_slice.rb +121 -0
  35. data/test/unit/release_50/core/test_transcript.rb +108 -0
  36. data/test/unit/release_50/core/test_transform.rb +223 -0
  37. data/test/unit/release_50/variation/test_activerecord.rb +143 -0
  38. data/test/unit/release_50/variation/test_variation.rb +84 -0
  39. data/test/unit/release_53/core/test_gene.rb +66 -0
  40. data/test/unit/release_53/core/test_project.rb +96 -0
  41. data/test/unit/release_53/core/test_project_human.rb +65 -0
  42. data/test/unit/release_53/core/test_slice.rb +47 -0
  43. data/test/unit/release_53/core/test_transform.rb +63 -0
  44. data/test/unit/release_53/variation/test_activerecord.rb +145 -0
  45. data/test/unit/release_53/variation/test_variation.rb +71 -0
  46. data/test/unit/release_56/core/test_gene.rb +66 -0
  47. data/test/unit/release_56/core/test_project.rb +96 -0
  48. data/test/unit/release_56/core/test_slice.rb +54 -0
  49. data/test/unit/release_56/core/test_transform.rb +63 -0
  50. data/test/unit/release_56/variation/test_activerecord.rb +142 -0
  51. data/test/unit/release_56/variation/test_variation.rb +68 -0
  52. data/test/unit/test_connection.rb +66 -0
  53. data/test/unit/test_releases.rb +136 -0
  54. metadata +128 -0
@@ -0,0 +1,623 @@
1
+ = Ruby Ensembl Core API tutorial
2
+ By Jan Aerts. Copy-paste-modified from the excellent perl API tutorial at
3
+ http://www.ensembl.org/info/software/core/core_tutorial.html (with permission of the core Ensembl team).
4
+
5
+ Based on release 50.
6
+
7
+ == Introduction
8
+ This tutorial describes how to use the Ensembl Core Ruby API. It is intended to be an introduction and demonstration of the general API concepts. This tutorial is not comprehensive, but it will hopefully enable to reader to become quickly productive, and facilitate a rapid understanding of the core system. This tutorial assumes at least some familiarity with Ruby.
9
+
10
+ The Ruby API provides a level of abstraction over the Ensembl Core databases. To external users the API may be useful to automate the extraction of particular data. As a brief introduction this tutorial focuses primarily on the retrieval of data from the Ensembl Core databases.
11
+
12
+ The Ruby API is only one of many ways of accessing the data stored in Ensembl. Additionally there is a genome browser web interface, and the BioMart system. BioMart may be a more appropriate tool for certain types of data mining.
13
+
14
+ This API is for read-only querying of the database.
15
+
16
+ == Other sources of information
17
+ The Ensembl Core API has a decent set of code documentation in the form of standard Ruby RDOC. This is documentation is mixed in with the actual code, but can be automatically extracted and formatted using some software tools. One version of this documentation is available at the website you're looking at.
18
+
19
+ If you have your RUBYLIB environment variable set correctly, you can use the command ri. For example the following command will bring up some documentation about the Slice class and each of its methods:
20
+
21
+ ri Ensembl::Core::Slice
22
+
23
+ For additional information you can contact Jan Aerts (jan.aerts@sanger.ac.uk) or preferably send an email to the bioruby mailing list (see http://www.bioruby.org).
24
+
25
+ == Obtaining and installing the code
26
+ The Ensembl Ruby API is made available as a gem. See the github site for more information (http://github.com/jandot/ruby-ensembl-api/wikis/home).
27
+
28
+ Basically, it comes down to:
29
+ sudo gem install jandot-ruby-ensembl-api --source http://gems.github.com
30
+
31
+ == Code conventions
32
+ Several naming conventions are used throughout the API. Learning these conventions will aid in your understanding of the code.
33
+
34
+ Variable names are underscore-separated all lower-case words.
35
+ slice_1
36
+ exon_1
37
+ gene_a
38
+
39
+ Class and package names are CamelCase words that begin with capital letters.
40
+
41
+ Ensembl::Core::Gene
42
+ Ensembl::Core::Exon
43
+ Ensembl::Core::CoordSystem
44
+ Ensembl::Core::SeqRegion
45
+
46
+ Method names are entirely lower-case, underscore separated words. Methods are called on an object or class by appending a period to that object or class and adding the method name.
47
+
48
+ Ensembl::Core::Slice.genes
49
+ transcript_a.five_prime_utr_seq
50
+
51
+ Class methods are responsible for the creation of various objects. Most of this is standard ActiveRecord behaviour and will be discussed below.
52
+
53
+ == ActiveRecord
54
+
55
+ Most of the API is based on ActiveRecord to get data from that database. In general, each table is described by a class with the same name: the coord_system table is covered by the Ensembl::Core::CoordSystem class, the seq_region table is covered by the Ensembl::Core::SeqRegion class, etc. As a result, accessors are available for all columns in each table. For example, the seq_region table has the following columns: seq_region_id, name, coord_system_id and length. Through ActiveRecord, these column names become available as attributes of Ensembl::Core::SeqRegion objects:
56
+
57
+ puts my_seq_region.seq_region_id
58
+ puts my_seq_region.name
59
+ puts my_seq_region.coord_system_id
60
+ puts my_seq_region.length.to_s
61
+
62
+ ActiveRecord makes it easy to extract data from those tables using the collection of find methods. There are three types of find methods (e.g. for the Ensembl::Core::CoordSystem class):
63
+
64
+ * find based on primary key in table:
65
+
66
+ my_coord_system = CoordSystem.find(5)
67
+
68
+ * find_by_sql:
69
+
70
+ my_coord_system = CoordSystem.find_by_sql('SELECT * FROM coord_system WHERE name = 'chromosome'")
71
+
72
+ * find_by_<insert_your_column_name_here>
73
+
74
+ my_coord_system1 = CoordSystem.find_by_name('chromosome')
75
+ my_coord_system2 = CoordSystem.find_by_rank(3)
76
+
77
+ To find out which find_by_<column> methods are available, you can list the column names using the column_names class methods:
78
+
79
+ puts Ensembl::Core::CoordSystem.column_names.join("\t")
80
+
81
+ For more information on the find methods, see ar.rubyonrails.org/classes/ActiveRecord/Base.html#M000344
82
+
83
+ The relationships between different tables are accessible through the classes as well. For example, to loop over all seq_regions belonging to a coord_system (a coord_system "has many" seq_regions):
84
+
85
+ chr_coord_system = CoordSystem.find_by_name('chromosome')
86
+ chr_coord_system.seq_regions.each do |seq_region|
87
+ puts seq_region.name
88
+ end
89
+
90
+ Of course, you can go the other way as well (a seq_region "belongs to" a coord_system):
91
+
92
+ chr4 = SeqRegion.find_by_name('4')
93
+ puts chr4.coord_system.name #--> 'chromosome'
94
+
95
+ To find out what relationships exist for a given class, you can use the reflect_on_all_associations class methods:
96
+
97
+ puts SeqRegion.reflect_on_all_associations(:has_many).collect{|a| a.name.to_s}.join("\n")
98
+ puts SeqRegion.reflect_on_all_associations(:has_one).collect{|a| a.name.to_s}.join("\n")
99
+ puts SeqRegion.reflect_on_all_associations(:belongs_to).collect{|a| a.name.to_s}.join("\n")
100
+
101
+ == Connecting to the Ensembl database and a minimal script
102
+
103
+ All data used and created by Ensembl is stored in MySQL relational databases. If you want to access this database the first thing you have to do is to connect to it. This is done behind the scenes using the ActiveRecord module.
104
+
105
+ First, we need to tell our computer where they can find the API code. This information is contained in the RUBYLIB environment variable. Suppose you have save the API in /usr/local/lib/ruby/ensembl-api (with subdirectories lib/, test/, samples/, ...), you could set the environment variable on a bash shell like this:
106
+ export RUBYLIB=$RUBYLIB:/usr/local/lib/ruby/ensembl-api/lib
107
+
108
+ Next, we need to import all Ruby modules that we will be using. Every Ensembl script that you will write will contain a use statement like the following:
109
+
110
+ require 'ensembl'
111
+
112
+ Alternatively, if you installed the API as a gem, you would write:
113
+
114
+ require 'rubygems'
115
+ require_gem 'ensembl-api'
116
+
117
+
118
+ Ensembl stores its data in a separate database for each species and each release of that species. The Ruby Ensembl API does a lot automatically, so you only have to know the species name to connect to the release 45 version of its core database. This name should be provided in snake_case (all lowercase connected by underscore):
119
+
120
+ Ensembl::Core::CoreDBConnection.connect('homo_sapiens')
121
+
122
+ With the connection established, you'll be able to get objects from the database, e.g.
123
+
124
+ chromosome_4 = Ensembl::Core::SeqRegion.find_by_name('4')
125
+
126
+ You have to include the 'Ensembl::Core::' bit to every call to a class. However, if you include the line
127
+
128
+ include Ensembl::Core
129
+
130
+ just after you "require 'ensembl'", you don't have to anymore. The rest of this tutorial expects you to have done the include command. So a very short but complete ruby script could look like this:
131
+
132
+ require 'ensembl'
133
+ include Ensembl::Core
134
+ CoreDBConnection.connect('homo_sapiens')
135
+ chromosome_4 = SeqRegion.find_by_name('4')
136
+ puts chromosome_4.name
137
+
138
+ == Slices
139
+
140
+ A Slice object represents a single continuous region of a genome. Slices can be used to obtain sequence, features or other information from a particular region of interest. There are several ways to obtain a slice, but we will start with the Ensembl::Core::Slice#fetch_by_region method which is the most commonly used. This class method takes numerous arguments but most of them are optional. In order, the arguments are: coord_system_name, seq_region_name, start, end, strand, coord_system_version. The following are several examples of how to use the Ensembl::Core::Slice#fetch_by_region method:
141
+
142
+ * Obtain a slice covering the entire chromosome X
143
+
144
+ slice = Slice.fetch_by_region('chromosome', 'X')
145
+
146
+ * Obtain a slice covering the entire clone AL359765.6
147
+
148
+ slice = Slice.fetch_by_region('clone', 'AL359765.6')
149
+
150
+ * Obtain a slice covering an entire NT contig
151
+
152
+ slice = Slice.fetch_by_region('supercontig', 'NT_011333')
153
+
154
+ * Obtain a slice covering the region from 1MB to 2MB (inclusively) of chromosome 20
155
+
156
+ slice = Slice.fetch_by_region('chromosome', '20', 1000000, 2000000)
157
+
158
+ Another useful way to obtain a slice is with respect to a gene, e.g. with 5kb flanking sequence:
159
+
160
+ slice = Slice.fetch_by_gene_stable_id('ENSG00000099889', 5000)
161
+
162
+ This will return a slice that contains the sequence of the gene specified by its stable Ensembl ID. It also returns 5000bp of flanking sequence at both the 5' and 3' ends, which is useful if you are interested in the environs that a gene inhabits. You needn't have the flanking sequence it you don't want it -- in this case set the number of flanking bases to zero or simply omit the second argument entirely. Note that the fetch_by_gene_stable_id() method always returns a slice on the forward strand even if the gene is on the reverse strand.
163
+
164
+ To retrieve a set of slices from a particular coordinate system the fetch_all method can be used:
165
+
166
+ * Retrieve slices of every chromosome in the database
167
+
168
+ slices = Slice.fetch_all('chromosome')
169
+
170
+ * Retrieve slices of every BAC clone in the database
171
+
172
+ slices = Slice.fetch_all('clone')
173
+
174
+ For certain types of analysis it is necessary to break up regions into smaller manageable pieces. The method Slice#split can be used to break up larger slices into smaller component slices. The following code creates an array of subslices of chromosome 1, with the (maximal) length of each slice 100000 bp and an overlap of 250 bp.
175
+
176
+ big_slice = Slice.fetch_by_region('chromosome', 1)
177
+ subslices = big_slice.split(100000, 250)
178
+
179
+ To obtain sequence from a slice the Slice#seq method can be used:
180
+
181
+ seq = slice.seq
182
+ puts seq
183
+
184
+ We can query the slice for information about itself:
185
+
186
+ seq_region = slice.seq_region.name
187
+ coord_system = slice.seq_region.coord_system.name
188
+ start = slice.start
189
+ stop = slice.stop
190
+ strand = slice.strand
191
+
192
+ puts "Slice: #{coord_system} #{seq_region} #{start}-#{stop} (#{strand})"
193
+
194
+ Many classes can provide a set of features which overlap a slice. The slice itself also provides a means to obtain features which overlap its region. To obtain a list of genes which overlap a slice:
195
+
196
+ slice_a = Slice.fetch_by_region('chromosome','X')
197
+ genes = slice_a.genes
198
+
199
+ *CAUTION*: The slice concept is a little bit different from that in the perl API. If you ask a gene for its slice using the perl API, you get a slice covering the _whole_ of the chromosome. In contrast, the slice created by the ruby API only contains that bit covered by the gene. The Ensembl::Core::SeqRegion class is used to refer to whole things. I just found it much more intuitive like that...
200
+
201
+ == Features
202
+
203
+ Features are objects in the database which have a defined location on the genome. All features in Ensembl include the Ensembl::Core::Sliceable mixin and have the following location defining attributes: start, end, strand, slice.
204
+
205
+ All feature objects can be retrieved using their #find method of their class or any of the generic #find_by_() methods (see the ActiveRecord bit of this tutorial). The following example illustrates how Transcript features and DnaDnaAlignFeature features can be obtained from the database. All features in the database can be retrieved in similar ways from their own object adaptors.
206
+
207
+ * Get a slice of chromosome 20, 10MB-11MB
208
+
209
+ slice = Slice.fetch_by_region('chromosome', '20', 10000000, 11000000 )
210
+
211
+ * Fetch all of the transcripts overlapping chromosome 20, 10MB-11MB
212
+
213
+ transcripts = slice.transcripts
214
+ transcripts.each do |transcript|
215
+ name = transcript.stable_id
216
+ internal_id = transcript.id
217
+ start = transcript.start
218
+ stop = transcript.stop
219
+ strand = transcript.strand
220
+
221
+ puts "Transcript #{name} [#{internal_id}] #{start}-#{stop} (#{strand})"
222
+ end
223
+
224
+ * Fetch all of the DNA-DNA alignments overlapping chromosome 20, 10MB-11MB
225
+
226
+ dafs = slice.dna_align_features
227
+ dafs.each do |daf|
228
+ name = daf.hit_name
229
+ internal_id = daf.id
230
+ start = daf.start
231
+ stop = daf.stop
232
+ strand = daf.strand
233
+
234
+ puts "DNA alignment #{name} [#{internal_id}] #{start}-#{stop} (#{strand})"
235
+ end
236
+
237
+ * Fetch a transcript by its internal identifier
238
+
239
+ transcript = Transcript.find(100)
240
+
241
+ * Fetch a DnaAlignFeature by its internal identifiers
242
+
243
+ daf = DnaAlignFeature.find(100)
244
+
245
+ All features also have the transform method which are described in detail in a later section of this tutorial.
246
+
247
+ === Features across coordinate systems
248
+
249
+ In the Ensembl database, some features might be related to one coordinate system, while other features are related to another one (for more information on coordinate systems, see below). For example, there are three coordinate systems in cow: contigs, scaffolds and chromosomes. Scaffold Chr4.003.122 does not have any simple_features on it. However, the equivalent regions in the contig and chromosome coordinate systems have 37 and 85 (=total of 122), respectively. If you therefore ask that scaffold to list its simple_features, you wouldn't get any. A workaround for this, is to first create a slice for this scaffold, and ask that _slice_ for its simple_features.
250
+
251
+ scaffold = SeqRegion.find_by_name('Chr4.003.122')
252
+ puts scaffold.simple_features.length #--> 0
253
+ slice = Slice.fetch_by_region('scaffold','Chr4.003.122')
254
+ puts slice.simple_features.length #--> 122
255
+
256
+ or even:
257
+ puts scaffold.slice.simple_features.length #--> 122
258
+
259
+ The reason this works, is that any retrieval for a slice also checks what coordinate systems that type of feature is annotated on.
260
+
261
+ == Genes, Transcripts, and Exons
262
+
263
+ Genes, exons and transcripts are also features and can be treated in the same way as any other feature within Ensembl. A transcript in Ensembl is a grouping of exons. A gene in Ensembl is a grouping of transcripts which share any overlapping (or partially overlapping) exons. Transcripts also have an associated Translation object which defines the UTR and CDS composition of the transcript. Introns are not defined explicitly in the database but can be obtained by the Ensembl::Core::Transcript#introns method (not implemented yet).
264
+
265
+ Important: like all Ensembl features the start of an exon is always less than or equal to the end of the exon, regardless of the strand it is on. The start of the transcript is the start of the first exon of a transcript on the forward strand or the start of the last exon of a transcript on the reverse strand. The start and end of a gene are defined to be the lowest start value of its transcripts and the highest end value respectively.
266
+
267
+ Genes, translations, transcripts and exons all have stable identifiers. These are identifiers that are assigned to Ensembl's predictions, and maintained in subsequent releases. For example, if a transcript (or a sufficiently similar transcript) is re-predicted in a future release then it will be assigned the same stable identifier as its predecessor.
268
+
269
+ The following is an example of the retrieval of a set of genes, transcripts and exons:
270
+
271
+ slice = Slice.fetch_by_region('chromosome','X',1000000,10000000)
272
+ puts slice.display_name
273
+
274
+ slice.genes.each do |gene|
275
+ puts "\t" + gene.stable_id
276
+
277
+ gene.transcripts.each do |transcript|
278
+ puts "\t\t" + transcript.stable_id
279
+
280
+ transcript.exons.each do |exon|
281
+ puts "\t\t\t" + exon.id.to_s
282
+ end
283
+ end
284
+ end
285
+
286
+ In addition to the methods which are present on every feature, the transcript class has many other methods which are commonly used. Several methods can be used to obtain transcript related sequences. At the time of writing this tutorial, these methods return strings rather than bioruby Bio::Sequence objects. The following example demonstrates the use of some of these methods:
287
+
288
+ * The Ensembl::Core::Transcript#seq method returns the concatenation of the exon sequences. This is the cDNA of the transcript:
289
+
290
+ puts "cDNA: " + transcript.seq
291
+
292
+ * The Ensembl::Core::Transcript#cds_seq method returns only the CDS of the transcript
293
+
294
+ puts "CDS: " + transcript.cds_seq
295
+
296
+ * UTR sequences are obtained via the five_prime_utr_seq and three_prime_utr_seq methods
297
+
298
+ fiv_utr = transcript.five_prime_utr_seq
299
+ thr_utr = transcript.three_prime_utr_seq
300
+
301
+ puts "5' UTR: " + ( fiv_utr.nil? ? 'None' : fiv_utr )
302
+ puts "3' UTR: " + ( thr_utr.nil? ? 'None' : thr_utr )
303
+
304
+ * The peptide sequence is obtained from the Ensembl::Core::Transcript#protein_seq method. If the transcript is non-coding, undef is returned.
305
+
306
+ peptide = transcript.protein_seq
307
+
308
+ puts "Translation: " + ( peptide.nil? ? 'None' : peptide )
309
+
310
+ == Translations and ProteinFeatures
311
+
312
+ Translation objects and peptide sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. The following example obtains the peptide sequence of a Ensembl::Core::Transcript and the Ensembl::Core::Translation's stable identifier:
313
+
314
+ stable_id = 'ENST00000044768'
315
+
316
+ transcript = Transcript.find_by_stable_id(stable_id)
317
+
318
+ puts transcript.stable_id
319
+ puts transcript.translation.stable_id
320
+
321
+ --
322
+ NOTE TO SELF: the following bit is not implemented yet...
323
+
324
+ ProteinFeatures are features which are on an amino acid sequence rather than a nucleotide sequence. The method get_all_ProteinFeatures() can be used to obtain a set of protein features from a Translation object.
325
+
326
+ $translation = $transcript->translation();
327
+
328
+ my @pfeatures = @{ $translation->get_all_ProteinFeatures() };
329
+ while ( my $pfeature = shift @pfeatures ) {
330
+ my $logic_name = $pfeature->analysis()->logic_name();
331
+
332
+ printf(
333
+ "%d-%d %s %s %s\n",
334
+ $pfeature->start(), $pfeature->end(), $logic_name,
335
+ $pfeature->interpro_ac(),
336
+ $pfeature->idesc()
337
+ );
338
+ }
339
+
340
+ If only the protein features created by a particular analysis are desired the name of the analysis can be provided as an argument. To obtain the subset of features which are considered to be 'domain' features the convenience method get_all_DomainFeatures() can be used:
341
+
342
+ my $seg_features = $translation->get_all_ProteinFeatures('Seg');
343
+ my $domain_features = $translation->get_all_DomainFeatures();
344
+ ++
345
+
346
+ == PredictionTranscripts
347
+
348
+ PredictionTranscripts are the results of ab initio gene finding programs that are stored in Ensembl. Example programs include Genscan and SNAP. Prediction transcripts have the same interface as normal transcripts and thus they can be used in the same way.
349
+
350
+ prediction_transcripts = slice.prediction_transcripts
351
+ prediction_transcripts.each do |pt|
352
+ exons = pt.prediction_exons
353
+ type = pt.analysis.logic_name
354
+
355
+ puts "#{type} prediction has #{exons.length.to_s} exons"
356
+
357
+ exons.each do |exon|
358
+ puts exon.to_yaml
359
+ end
360
+ end
361
+
362
+ == Alignment Features
363
+
364
+ Two types of alignments are stored in the core Ensembl database: alignments of DNA sequence to the genome and alignments of peptide sequence to the genome. These can be retrieved as Ensembl::Core::DnaAlignFeatures and Ensembl::Core::ProteinAlignFeatures respectively. A single gapped alignment is represented by a single feature with a cigar line. A cigar line is a compact representation of a gapped alignment as single string containing letters M (match) D (deletion), and I (insertion) prefixed by integer lengths (the number may be omitted if it is 1).
365
+ --
366
+ NOTE TO SELF: not implemented yet
367
+ A gapped alignment feature can be broken into its component ungapped alignments by the method ungapped_features() which returns a list of FeaturePair objects.
368
+ ++
369
+ The following example shows the retrieval of some alignment features.
370
+
371
+ * Retrieve dna-dna alignment features from the slice region
372
+
373
+ features = slice.dna_align_features('Vertrna')
374
+ features.each do |f|
375
+ puts f.to_yaml
376
+ end
377
+
378
+ * Retrieve protein-dna alignment features from the slice region
379
+
380
+ features = slice.protein_align_features('Swall')
381
+ features.each do |f|
382
+ puts f.to_yaml
383
+ end
384
+
385
+ == Repeats
386
+
387
+ Repetitive regions found by RepeatMasker and TRF (Tandem Repeat Finder) are represented in the Ensembl database as RepeatFeatures. Short non-repetitive regions between repeats are found by the program Dust and are also stored as RepeatFeatures. RepeatFeatures can be retrieved and used in the same way as other Ensembl features.
388
+
389
+ repeats = slice.repeats
390
+ repeats.each do |r|
391
+ puts r.display_id + "\t" + repeat.start.to_s + "\t" + repeat.stop.to_s
392
+ end
393
+
394
+ --
395
+ NOTE TO SELF: not implemented yet
396
+ RepeatFeatures are used to perform repeat masking of the genomic sequence. Hard or soft-masked genomic sequence can be retrieved from Slice objects using the Slice#repeatmasked_seq method. Hard-masking replaces sequence in repeat regions with Ns. Soft-masking replaces sequence in repeat regions with lower-case sequence.
397
+
398
+ unmasked_seq = slice.seq
399
+ hardmasked_seq = slice.repeatmasked_seq
400
+ softmasked_seq = slice.repeatmasked_seq(undef, 1)
401
+
402
+ * Soft-mask sequence using TRF results only
403
+
404
+ tandem_masked_seq = slice.repeatmasked_seq(['TRF'], 1)
405
+ ++
406
+
407
+ == Markers
408
+
409
+ Markers are imported into the Ensembl database from UniSTS and several other sources. A marker in Ensembl consists of a pair of primer sequences, an expected product size and a set of associated identifiers known as synonyms. Markers are placed on the genome electronically using an analysis program such as ePCR and their genomic positions are retrievable as MarkerFeatures. Map locations (genetic, radiation hybrid and in situ hybridization) for markers obtained from actual experimental evidence are also accessible.
410
+
411
+ Markers can be fetched by their name. The Marker#find_all_by_name returns an array, and Marker#find_by_name returns the first element of that array, i.e. a marker object.
412
+
413
+ marker = Marker.find_by_name('D9S1038E')
414
+
415
+ * Display the various names associated with the same marker
416
+
417
+ marker.marker_synonyms.each do |ms|
418
+ if ms.source.nil?
419
+ puts ms.name
420
+ else
421
+ puts ms.source + ':' + ms.name
422
+ end
423
+ end
424
+
425
+ * Display the primer info
426
+
427
+ puts "left primer: " + marker.left_primer.to_s
428
+ puts "right primer: " + marker.right_primer.to_s
429
+ puts "product size: " + marker.min_primer_dist.to_s + '-' + marker.max_primer_dist.to_s
430
+
431
+ * Display out genetic/RH/FISH map information
432
+
433
+ puts "Map locations:"
434
+ marker.marker_map_locations.each do |mapping|
435
+ puts mapping.map.map_name + "\t" + mapping.chromosome_name + "\t" + mapping.position.to_s
436
+ end
437
+
438
+ MarkerFeatures, which represent genomic positions of markers, can be retrieved and manipulated in the same way as other Ensembl features.
439
+
440
+ * Obtain the positions for an already retrieved marker
441
+
442
+ marker.marker_features.each do |mf|
443
+ puts mf.slice.display_name
444
+ end
445
+
446
+ * Retrieve all marker features in a given region
447
+
448
+ marker_features = slice.marker_features
449
+ marker_features.each do |mf|
450
+ puts mf.slice.display_name
451
+ end
452
+
453
+ == MiscFeatures
454
+
455
+ MiscFeatures are features with arbitrary attributes which are placed into arbitrary groupings. MiscFeatures can be retrieved as any other feature and are classified into distinct sets by a set code. Generally it only makes sense to retrieve all features which have a particular set code because very diverse types of MiscFeatures are stored in the database.
456
+
457
+ MiscFeature attributes are represented by Attribute objects and can be retrieved via a get_all_Attributes() method.
458
+
459
+ The following example retrieves all MiscFeatures representing ENCODE regions on a given slice and prints out their attributes:
460
+
461
+ encode_regions = slice.misc_features('encode')
462
+ encode_regions.each do |er|
463
+ attributes = er.misc_attribs
464
+ attributes.each do |a|
465
+ puts a.to_s
466
+ end
467
+ end
468
+
469
+ This example retrieves all misc features representing a BAC clone via its name and prints out their location and other information:
470
+
471
+ clones = MiscFeature.find_all_by_attribute_type_value('name', 'RP11-62N12')
472
+ clones.each do |clone|
473
+ slice = clone.slice
474
+ puts slice.to_yaml
475
+
476
+ attributes = clone.misc_attribs
477
+ attributes.each do |a|
478
+ puts a.to_s
479
+ end
480
+ end
481
+
482
+ == External References
483
+
484
+ Ensembl cross references its genes, transcripts and translations with identifiers from other databases. A cross reference is referenced by a Xref object. The following code snippet retrieves and prints Xrefs for a gene, its transcripts and its translations:
485
+
486
+ * Get the 'COG6' gene from human
487
+
488
+ cog6 = Gene.find_by_name('COG6')
489
+ puts 'GENE: ' + cog6.stable_id + " (internal id: " + cog6.id.to_s + ")"
490
+
491
+ cog6.xrefs.each do |x|
492
+ puts x.to_s
493
+ end
494
+
495
+ cog6.transcripts.each do |t|
496
+ puts 'TRANSCRIPT: ' + t.stable_id
497
+ t.xrefs.each do |x|
498
+ puts "\s\s" + x.to_s
499
+ end
500
+
501
+ # Watch out: pseudogenes have no translation
502
+ if ! t.translation.nil?
503
+ translation = t.translation
504
+ puts "\tTRANSLATION: " + translation.stable_id
505
+ translation.xrefs.each do |x|
506
+ puts "\t\s\s" + x.to_s
507
+ end
508
+ end
509
+ end
510
+
511
+ Often it is useful to obtain all of the Xrefs associated with a gene and its associated transcripts and translation as in the above example. As a shortcut to calling #xrefs on all of the above objects the Gene#all_xrefs method can be used instead. The above example could be shortened by using the following:
512
+
513
+ cog6.all_xrefs.each do |x|
514
+ puts x.to_s
515
+ end
516
+
517
+ This returns all xrefs for the gene itself, including those for all transcripts and translations.
518
+
519
+ == Coordinates
520
+
521
+ We have already discussed the fact that slices and features have coordinates, but we have not defined exactly what these coordinates mean.
522
+
523
+ Ensembl, and many other bioinformatics applications, use inclusive coordinates which start at 1. The first nucleotide of a DNA sequence is 1 and the first amino acid of a peptide sequence is also 1. The length of a sequence is defined as end - start + 1.
524
+
525
+ In some rare cases inserts are specified with a start which is one greater than the end. For example a feature with a start of 10 and an end of 9 would be a zero length feature between base pairs 9 and 10.
526
+
527
+ Slice coordinates are relative to the start of the underlying DNA sequence region (a Ensembl::Core::SeqRegion object). The strand of the slice represents its orientation relative to the default orientation of the sequence region. By convention the start of the slice is always less than the end, and does not vary with its strandedness. Most slices you will encounter will have a strand of 1, and this is what we will consider in our examples. It is legal to create a slice which extends past the boundaries of a sequence region.
528
+
529
+ == Coordinate Systems, Sequence Regions and Slices
530
+
531
+ Sequences stored in Ensembl are associated with coordinate systems. What the coordinate systems are varies from species to species. For example, the homo_sapiens database has the following coordinate systems: contig, clone, supercontig, chromosome. Sequence and features may be retrieved from any coordinate system despite the fact they are only stored internally in a single coordinate system. The database stores the relationship between these coordinate systems and the API provides means to convert between them. The API has a Ensembl::Core::CoordSystem object and object adaptor, however, these are most often used internally. The following example fetches a chromosome coordinate system object from the database:
532
+
533
+ chr_coord_system = CoordSystem.find_by_name('chromosome')
534
+ puts "Coordinate system: " + chr_coord_system.name + ":" + chr_coord_system.version
535
+
536
+ A coordinate system is uniquely defined by its name and version. Most coordinate systems do not have a version, and the ones that do have a default version, so it is usually sufficient to use only the name when requesting a coordinate system. For example, chromosome coordinate systems have a version which is the assembly that defined the construction of the coordinate system. The version of the human chromosome coordinate system might be something like NCBI35 or NCBI36, depending on the version of the Core databases used.
537
+
538
+ Ensembl::Core::SeqRegion objects have an associated Ensembl::Core::CoordSystem object and a #name method that returns its name which uniquely defines them. You may have noticed that the coordinate system of the sequence region was specified when obtaining a slice in the #fetch_by_region method. Similarly the version may also be specified (though it can almost always be omitted):
539
+
540
+ slice = Slice.fetch_by_region('chromosome', 'X', 1000000, 10000000, 'NCBI36')
541
+
542
+ To obtain all sequence regions for a given coordinate system, just call the Ensembl::Core::CoordSystem#seq_regions method.
543
+
544
+ coord_system = CoordSystem.find_by_name('chromosome')
545
+ chromomsomes = coord_system.seq_regions
546
+ chromosomes.each do |chr|
547
+ puts chr.name
548
+ end
549
+
550
+ Sometimes it is useful to obtain full slices of every sequence region in a given coordinate system; this may be done using the Slice#fetch_all method:
551
+
552
+ chromosomes = Slice.fetch_all('chromosome')
553
+ clones = Slice.fetch_all('clone')
554
+
555
+ Now suppose that you wish to write code which is independent of the species used. Not all species have the same coordinate systems; the available coordinate systems depends on the style of assembly used for that species (WGS, clone-based, etc.). You can obtain the list of available coordinate systems for a species using the Ensembl::Core::CoordSystem#find(:all) method and there is also a special pseudo-coordinate system named toplevel. The toplevel coordinate system is not a real coordinate system, but is used to refer to the highest level coordinate system in a given region. The toplevel coordinate system is particularly useful in genomes that are incompletely assembled. For example, the latest zebrafish genome consists of a set of assembled chromosomes, and a set of supercontigs that are not part of any chromosome. In this example, the toplevel coordinate system sometimes refers to the chromosome coordinate system and sometimes to the supercontig coordinate system depending on the region it is used in.
556
+
557
+ * List all coordinate systems in this database:
558
+
559
+ coord_systems = CoordSystem.find(:all)
560
+ coord_systems.each do |coord_system|
561
+ puts coord_system.name + "\t" + coord_system.version
562
+ end
563
+
564
+ * Get all slices on the highest coordinate system:
565
+
566
+ slices = Slice.fetch_all('top_level')
567
+
568
+ == Transform
569
+
570
+ Features on a seq_region in a given coordinate system may be moved to another coordinate system. This is useful if you are working with a particular coordinate system but you are interested in obtaining the features coordinates in another coordinate system.
571
+
572
+ TheEnsembl::Core::Sliceable#transform method (available to all features) can be used to move a feature to any coordinate system which is in the database. The feature will be a clone of the original feature, but with a different seq_region associated with it, as well as seq_region_start, seq_region_end and seq_region_strand.
573
+
574
+ #Suppose original_feature is on the 'chromosome' coordinate system
575
+ new_feature = original_feature.transform('clone')
576
+ if new_feature.nil?
577
+ puts "Feature is not defined in clonal coordinate system"
578
+ else
579
+ puts "Feature's clonal position:"
580
+ puts new_feature.seq_region.name
581
+ puts new_feature.seq_region_start.to_s + ".." + new_feature_seq_region_end
582
+ end
583
+
584
+ To print out the position of a feature (i.e. concatenating the seq_region name, start, end), it's easier to create a slice of it first, and then calling the Ensembl::Core::Slice#display_name method:
585
+
586
+ puts new_feature.slice.display_name
587
+
588
+ The transform method returns a copy of the original feature in the new coordinate system, or nil if the feature is not defined in that coordinate system. A feature is considered to be undefined in a coordinate system if it overlaps an undefined region or if it crosses a coordinate system boundary. Take for example the tiling path relationship between chromosome and contig coordinate systems:
589
+
590
+ |~~~~~~~| (Feature A) |~~~~| (Feature B)
591
+
592
+ (ctg 1) [=============]
593
+ (ctg 2) (------==========] (ctg 2)
594
+ (ctg 3) (--============] (ctg3)
595
+
596
+ Both Feature A and Feature B are defined in the chromosomal coordinate system described by the tiling path of contigs. However, Feature A is not defined in the contig coordinate system because it spans both Contig 1 and Contig 2. Feature B, on the other hand, is still defined in the contig coordinate system.
597
+
598
+ The special toplevel coordinate system can also be used in this instance to move the feature to the highest possible coordinate system in a given region:
599
+
600
+ new_feature = original_feature.transform('toplevel')
601
+ puts new_feature.slice.display_name
602
+
603
+ *NOTE*: In contrast to the perl API, there is no #transfer method.
604
+
605
+ == Project
606
+
607
+ When moving features between coordinate systems it is usually sufficient to use the Ensembl::Core::Sliceable#transform method. Sometimes, however, it is necessary to obtain coordinates in a another coordinate system even when a coordinate system boundary is crossed. Even though the feature is considered to be undefined in this case, the feature's coordinates can still be obtained in the requested coordinate system using the Slice#project method.
608
+
609
+ While #transform is a method only available to features, both slices and features have their own #project methods, which take the same arguments and have the same return values. The #project method takes a coordinate system name as an argument and returns an array of Slice and Gap objects. The following example illustrates the use of the #project method on a slice. The #project method on a feature can be used in the same way. As with the feature #transform method the pseudo coordinate system toplevel can be used to indicate you wish to project to the highest possible level.
610
+
611
+ original_slice = Slice.fetch_by_region('chromosome', '4', 329500, 380000)
612
+ target_slices = @source_slice_contigs_with_strand.project('contig')
613
+ target_slices.each do |ts|
614
+ puts ts.display_name
615
+ end
616
+
617
+ The above returns (for Bos taurus):
618
+ contig::AAFC03092598:60948:61145:1
619
+ contig::AAFC03118261:25411:37082:1
620
+ contig::AAFC03092594:1:3622:-1
621
+ contig:gap:50
622
+ contig::AAFC03092597:820:35709:-1
623
+ contig::AAFC03032210:13347:13415:1