bioroebe 0.10.80 → 0.11.24
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of bioroebe might be problematic. Click here for more details.
- checksums.yaml +4 -4
- data/README.md +1204 -772
- data/bioroebe.gemspec +3 -3
- data/doc/README.gen +1203 -771
- data/doc/todo/bioroebe_todo.md +391 -365
- data/lib/bioroebe/aminoacids/aminoacid_substitution.rb +1 -9
- data/lib/bioroebe/aminoacids/codon_percentage.rb +1 -9
- data/lib/bioroebe/aminoacids/deduce_aminoacid_sequence.rb +1 -9
- data/lib/bioroebe/aminoacids/display_aminoacid_table.rb +1 -0
- data/lib/bioroebe/aminoacids/show_hydrophobicity.rb +1 -6
- data/lib/bioroebe/base/colours_for_base/colours_for_base.rb +18 -8
- data/lib/bioroebe/base/commandline_application/commandline_arguments.rb +13 -11
- data/lib/bioroebe/base/commandline_application/misc.rb +18 -8
- data/lib/bioroebe/base/misc.rb +16 -0
- data/lib/bioroebe/base/prototype/misc.rb +1 -1
- data/lib/bioroebe/codons/show_codon_tables.rb +6 -2
- data/lib/bioroebe/codons/show_codon_usage.rb +2 -1
- data/lib/bioroebe/constants/aminoacids_and_proteins.rb +1 -0
- data/lib/bioroebe/constants/database_constants.rb +1 -1
- data/lib/bioroebe/constants/files_and_directories.rb +20 -1
- data/lib/bioroebe/constants/misc.rb +20 -0
- data/lib/bioroebe/count/count_amount_of_nucleotides.rb +3 -0
- data/lib/bioroebe/crystal/README.md +2 -0
- data/lib/bioroebe/crystal/to_rna.cr +19 -0
- data/lib/bioroebe/data/README.md +11 -8
- data/lib/bioroebe/data/electron_microscopy/pos_example.pos +396 -0
- data/lib/bioroebe/data/electron_microscopy/test_particles.star +36 -0
- data/lib/bioroebe/{shell/tk.rb → electron_microscopy/electron_microscopy_module.rb} +15 -10
- data/lib/bioroebe/electron_microscopy/simple_star_file_generator.rb +4 -9
- data/lib/bioroebe/fasta_and_fastq/show_fasta_headers.rb +27 -12
- data/lib/bioroebe/genome/README.md +4 -0
- data/lib/bioroebe/genome/genome.rb +67 -0
- data/lib/bioroebe/gui/gtk3/protein_to_DNA/protein_to_DNA.rb +18 -18
- data/lib/bioroebe/gui/gtk3/random_sequence/random_sequence.rb +19 -11
- data/lib/bioroebe/gui/shared_code/protein_to_DNA/protein_to_DNA_module.rb +14 -14
- data/lib/bioroebe/misc/ruler.rb +1 -0
- data/lib/bioroebe/parsers/genbank_parser.rb +353 -24
- data/lib/bioroebe/parsers/gff.rb +1 -9
- data/lib/bioroebe/pdb/parse_pdb_file.rb +1 -9
- data/lib/bioroebe/project/project.rb +1 -1
- data/lib/bioroebe/python/README.md +1 -0
- data/lib/bioroebe/python/__pycache__/mymodule.cpython-39.pyc +0 -0
- data/lib/bioroebe/python/gui/gtk3/all_in_one.css +4 -0
- data/lib/bioroebe/python/gui/gtk3/all_in_one.py +59 -0
- data/lib/bioroebe/python/gui/gtk3/widget1.py +20 -0
- data/lib/bioroebe/python/gui/tkinter/all_in_one.py +91 -0
- data/lib/bioroebe/python/mymodule.py +8 -0
- data/lib/bioroebe/python/protein_to_dna.py +33 -0
- data/lib/bioroebe/python/shell/shell.py +19 -0
- data/lib/bioroebe/python/to_rna.py +14 -0
- data/lib/bioroebe/python/toplevel_methods/open_in_browser.py +20 -0
- data/lib/bioroebe/python/toplevel_methods/palindromes.py +42 -0
- data/lib/bioroebe/python/toplevel_methods/rds.py +13 -0
- data/lib/bioroebe/python/toplevel_methods/three_delimiter.py +34 -0
- data/lib/bioroebe/python/toplevel_methods/time_and_date.py +43 -0
- data/lib/bioroebe/python/toplevel_methods/to_camelcase.py +11 -0
- data/lib/bioroebe/requires/require_the_bioroebe_project.rb +3 -1
- data/lib/bioroebe/sequence/nucleotide_module/nucleotide_module.rb +28 -25
- data/lib/bioroebe/sequence/protein.rb +105 -3
- data/lib/bioroebe/sequence/sequence.rb +61 -2
- data/lib/bioroebe/shell/menu.rb +3451 -3366
- data/lib/bioroebe/shell/misc.rb +51 -4311
- data/lib/bioroebe/shell/readline/readline.rb +1 -1
- data/lib/bioroebe/shell/shell.rb +11192 -28
- data/lib/bioroebe/siRNA/siRNA.rb +81 -1
- data/lib/bioroebe/string_matching/find_longest_substring.rb +3 -2
- data/lib/bioroebe/taxonomy/class_methods.rb +3 -8
- data/lib/bioroebe/taxonomy/constants.rb +4 -3
- data/lib/bioroebe/taxonomy/edit.rb +2 -1
- data/lib/bioroebe/taxonomy/help/help.rb +10 -10
- data/lib/bioroebe/taxonomy/info/check_available.rb +15 -9
- data/lib/bioroebe/taxonomy/info/info.rb +17 -2
- data/lib/bioroebe/taxonomy/info/is_dna.rb +46 -36
- data/lib/bioroebe/taxonomy/interactive.rb +139 -95
- data/lib/bioroebe/taxonomy/menu.rb +27 -18
- data/lib/bioroebe/taxonomy/parse_fasta.rb +3 -1
- data/lib/bioroebe/taxonomy/shared.rb +1 -0
- data/lib/bioroebe/taxonomy/taxonomy.rb +1 -0
- data/lib/bioroebe/toplevel_methods/aminoacids_and_proteins.rb +31 -24
- data/lib/bioroebe/toplevel_methods/databases.rb +1 -1
- data/lib/bioroebe/toplevel_methods/fasta_and_fastq.rb +101 -63
- data/lib/bioroebe/toplevel_methods/misc.rb +17 -16
- data/lib/bioroebe/toplevel_methods/nucleotides.rb +22 -5
- data/lib/bioroebe/toplevel_methods/open_in_browser.rb +2 -0
- data/lib/bioroebe/toplevel_methods/palindromes.rb +1 -2
- data/lib/bioroebe/toplevel_methods/taxonomy.rb +2 -2
- data/lib/bioroebe/toplevel_methods/to_camelcase.rb +5 -0
- data/lib/bioroebe/utility_scripts/align_open_reading_frames.rb +1 -9
- data/lib/bioroebe/utility_scripts/check_for_mismatches/check_for_mismatches.rb +1 -9
- data/lib/bioroebe/utility_scripts/compacter.rb +1 -9
- data/lib/bioroebe/utility_scripts/compseq/compseq.rb +1 -9
- data/lib/bioroebe/utility_scripts/create_batch_entrez_file.rb +1 -9
- data/lib/bioroebe/utility_scripts/dot_alignment.rb +1 -9
- data/lib/bioroebe/utility_scripts/move_file_to_its_correct_location.rb +1 -4
- data/lib/bioroebe/utility_scripts/showorf/constants.rb +0 -5
- data/lib/bioroebe/utility_scripts/showorf/reset.rb +1 -4
- data/lib/bioroebe/version/version.rb +2 -2
- data/lib/bioroebe/www/embeddable_interface.rb +101 -52
- data/lib/bioroebe/www/sinatra/sinatra.rb +186 -70
- data/lib/bioroebe/yaml/aminoacids/amino_acids_long_name_to_one_letter.yml +2 -2
- data/lib/bioroebe/yaml/configuration/browser.yml +1 -1
- data/lib/bioroebe/yaml/genomes/README.md +3 -4
- data/lib/bioroebe/yaml/restriction_enzymes/restriction_enzymes.yml +3 -3
- metadata +32 -35
- data/doc/setup.rb +0 -1655
- data/lib/bioroebe/genbank/genbank_parser.rb +0 -291
- data/lib/bioroebe/shell/add.rb +0 -108
- data/lib/bioroebe/shell/assign.rb +0 -360
- data/lib/bioroebe/shell/chop_and_cut.rb +0 -281
- data/lib/bioroebe/shell/constants.rb +0 -166
- data/lib/bioroebe/shell/download.rb +0 -335
- data/lib/bioroebe/shell/enable_and_disable.rb +0 -158
- data/lib/bioroebe/shell/enzymes.rb +0 -310
- data/lib/bioroebe/shell/fasta.rb +0 -345
- data/lib/bioroebe/shell/gtk.rb +0 -76
- data/lib/bioroebe/shell/history.rb +0 -132
- data/lib/bioroebe/shell/initialize.rb +0 -217
- data/lib/bioroebe/shell/loop.rb +0 -74
- data/lib/bioroebe/shell/prompt.rb +0 -107
- data/lib/bioroebe/shell/random.rb +0 -289
- data/lib/bioroebe/shell/reset.rb +0 -335
- data/lib/bioroebe/shell/scan_and_parse.rb +0 -135
- data/lib/bioroebe/shell/search.rb +0 -337
- data/lib/bioroebe/shell/sequences.rb +0 -200
- data/lib/bioroebe/shell/show_report_and_display.rb +0 -2901
- data/lib/bioroebe/shell/startup.rb +0 -127
- data/lib/bioroebe/shell/taxonomy.rb +0 -14
- data/lib/bioroebe/shell/user_input.rb +0 -88
- data/lib/bioroebe/shell/xorg.rb +0 -45
data/README.md
CHANGED
@@ -2,13 +2,13 @@
|
|
2
2
|
[![forthebadge](https://forthebadge.com/images/badges/made-with-ruby.svg)](https://www.ruby-lang.org/en/)
|
3
3
|
[![Gem Version](https://badge.fury.io/rb/bioroebe.svg)](https://badge.fury.io/rb/bioroebe)
|
4
4
|
|
5
|
-
This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">
|
5
|
+
This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">03.08.2022</span> (dd.mm.yyyy notation), at <span style="color: steelblue; font-weight: bold">23:23:28</span> o'clock.
|
6
6
|
|
7
7
|
# The Bioroebe Project
|
8
8
|
|
9
9
|
## Bioroebe
|
10
10
|
|
11
|
-
<img src="
|
11
|
+
<img src="https://i.imgur.com/mAoP7AP.png">
|
12
12
|
<img src="https://i.imgur.com/YqYxRBZ.png" style="margin: 4px; margin-left: 12px;"/>
|
13
13
|
<img src="https://i.imgur.com/k7mMlg2.png" style="margin: 4px; margin-left: 12px;"/>
|
14
14
|
|
@@ -335,41 +335,6 @@ so I opted to go the yaml route. But if people want to use a hash
|
|
335
335
|
instead, they can do so, too - see the <b>API</b> for codon tables
|
336
336
|
lateron. Simply define your own constants and pass them to the
|
337
337
|
appropriate methods.
|
338
|
-
|
339
|
-
## Support for other programming languages
|
340
|
-
|
341
|
-
The main programming language for the bioroebe project is **ruby**.
|
342
|
-
Ruby, from a language design point of view, is a great programming
|
343
|
-
language - not necessarily all of ruby, but the subset that I use.
|
344
|
-
It is very easy to quickly prototype ideas via ruby.
|
345
|
-
|
346
|
-
However had, ruby is known to **not** be among the fastest programming
|
347
|
-
languages about on this planet; so, it makes sense to use other
|
348
|
-
languages too from this point of view. Additionally there are some
|
349
|
-
software stacks in use in **other** programming languages, such as
|
350
|
-
matplotlib and various more.
|
351
|
-
|
352
|
-
Thus, it is important to **support other programming languages** as
|
353
|
-
well, if there are useful libraries. The bioroebe project, after
|
354
|
-
all, tries to be **practical**: it focuses on getting things done,
|
355
|
-
no matter the language.
|
356
|
-
|
357
|
-
This means that support for other programming languages can be
|
358
|
-
found in this project as well, often using system() or similar
|
359
|
-
functionality to tap into these other programming languages. Do
|
360
|
-
not be surprised when that happens - the bioroebe project will
|
361
|
-
also try to act as a **practical glue** towards functionality
|
362
|
-
enabled via other projects. We want to get things done, no
|
363
|
-
matter the programming language at hand!
|
364
|
-
|
365
|
-
Whenever possible, though, the bioroebe project will try to be
|
366
|
-
flexible in this regard, so ideally the same solution should
|
367
|
-
work for many different programming languages.
|
368
|
-
|
369
|
-
While Ruby is the primary language for this project, since as
|
370
|
-
of 2021 I will try to officially support **java**, **jruby**
|
371
|
-
and the **GraalVM**. This is on my TODO list, though - stay
|
372
|
-
tuned for more updates in this regard.
|
373
338
|
|
374
339
|
## Readline support in the BioRoebe project
|
375
340
|
|
@@ -553,16 +518,16 @@ the DNA-to-Protein translation is somewhat simply kept as a
|
|
553
518
|
Once you are inside a **running Bioshell**, you can do other **commands**
|
554
519
|
such as this one here:
|
555
520
|
|
556
|
-
random # ← This will generate a random DNA sequence.
|
521
|
+
random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.
|
557
522
|
|
558
523
|
To **assign** a DNA sequence, do:
|
559
524
|
|
560
525
|
assign ATAGGGCTTTT
|
561
526
|
|
562
|
-
Note that since the year 2016
|
563
|
-
the one above, without any other commands/words, then we will assume
|
527
|
+
Note that since as of the year <b>2016</b>, if you input a nucleotide sequence
|
528
|
+
like the one above, without any other commands/words, then we will assume
|
564
529
|
that you did mean to do an assignment as-is anyway. The "assign" part
|
565
|
-
then becomes superfluous.
|
530
|
+
then becomes superfluous and can be omitted.
|
566
531
|
|
567
532
|
This is how this is simply done, by omitting the "assign" part of the
|
568
533
|
above instruction altogether:
|
@@ -1073,18 +1038,18 @@ The text **banana** thus has the following suffixes:
|
|
1073
1038
|
|
1074
1039
|
This subsection deals with some aspects of **HMMs**.
|
1075
1040
|
|
1076
|
-
Why are HMMs useful in biology? They can be used to represent protein
|
1077
|
-
families
|
1041
|
+
Why are HMMs useful in biology? They can be used to <b>represent protein
|
1042
|
+
families</b>, for example (via <b>pHMMs</b> - profile hidden markov models).
|
1078
1043
|
|
1079
1044
|
Furthermore, they can show some bias in the mutation rate that can be
|
1080
1045
|
observed. Different genomes are known to have different hotspots where
|
1081
|
-
mutations are more likely to happen. These are
|
1082
|
-
may be useful.
|
1046
|
+
mutations are more likely to happen, for various reasons. These are
|
1047
|
+
examples where a HMM may be useful.
|
1083
1048
|
|
1084
|
-
HMMs are usually based on the Shannon model where you assign different
|
1049
|
+
HMMs are usually based on the <b>Shannon model</b> where you assign different
|
1085
1050
|
probabilities to "change" events. An example that was mentioned back
|
1086
|
-
in 1948 was the english alphabet - some letters, and combinations
|
1087
|
-
letters, are more commonly seen. Shannon gave the example of "E"
|
1051
|
+
in <b>1948</b> was the english alphabet - some letters, and combinations
|
1052
|
+
of letters, are more commonly seen. Shannon gave the example of "E"
|
1088
1053
|
versus "W", as shown in the following graph (a **finite state
|
1089
1054
|
graph**):
|
1090
1055
|
|
@@ -1098,40 +1063,47 @@ DNA sequence, a 10-mer would be equivalent to **10 base pairs**.
|
|
1098
1063
|
The individual transition states are based on an assumption of
|
1099
1064
|
"randomness", but ensuring that these are truly random is not
|
1100
1065
|
necessarily trivial. Computers do not really 'generate' true
|
1101
|
-
randomness, at the least not when they are working solo
|
1102
|
-
can even 'predict' some randomness here or there
|
1103
|
-
|
1104
|
-
|
1105
|
-
|
1106
|
-
|
1107
|
-
of
|
1108
|
-
|
1109
|
-
given position, but this is not
|
1110
|
-
|
1111
|
-
|
1112
|
-
|
1113
|
-
|
1114
|
-
|
1115
|
-
|
1066
|
+
randomness, at the least not when they are working solo, "on
|
1067
|
+
their own". You can even 'predict' some randomness here or there
|
1068
|
+
via various techniques - see vulnerabilities such as <b>Specter</b>
|
1069
|
+
or similar variants where software can read from areas of the
|
1070
|
+
memory that should be inaccessible to them. Some of this is based
|
1071
|
+
on co-predictions. For distributed computers, you may often use
|
1072
|
+
random noise or decay of atoms as 'a source of randomness'. For
|
1073
|
+
any DNA nucleotide sequence, we would assume that each base pair
|
1074
|
+
has a 25% chance to exist at any given position, but this is not
|
1075
|
+
necessarily true, again for various reasons.
|
1076
|
+
|
1077
|
+
An interesting thought is ... why is <b>ATP</b> so important?
|
1078
|
+
Yes, of course due to it being 'the energy currency in a cell' but ..
|
1079
|
+
why is this ATP, aka adenine? Why not GTP, aka guanine or any of
|
1080
|
+
the other two nucleotides? (GTP is used too, but why? Why not
|
1081
|
+
CTP and TTP?) I can not answer this question; there may
|
1082
|
+
be many reasons, including differential chemical storage power
|
1083
|
+
as well as mere random chance event in evolution, but for whatever
|
1116
1084
|
the reason, you will not find a complete 25% percentage value
|
1117
1085
|
for every given "slot" in DNA, depending on the organism.
|
1118
1086
|
|
1119
1087
|
From a practical point of view, how can we approach Hidden Markov
|
1120
|
-
Models?
|
1088
|
+
Models and use them?
|
1121
1089
|
|
1122
|
-
Let's take the following sequence:
|
1090
|
+
Let's take the following simple sequence:
|
1123
1091
|
|
1124
1092
|
ACGTACGC
|
1125
1093
|
|
1126
1094
|
From this sequence we can see that the <b>3-mer</b> "ACG"
|
1127
1095
|
is followed by either a T, or a C. Have a look at the sequence
|
1128
|
-
to see if you can identify the two ACG subsequences
|
1096
|
+
again to see if you can identify the two ACG subsequences
|
1097
|
+
there. You can see one at the start, and the other one
|
1098
|
+
following a bit later, hence why we come to the conclusion
|
1099
|
+
that either a T or a C will follow this <b>3-mer</b>.
|
1129
1100
|
|
1130
|
-
The probability of either T or C
|
1131
|
-
for A and G to follow there
|
1132
|
-
be ignored.
|
1101
|
+
The probability of either T or C to occur on <b>that</b>
|
1102
|
+
position, thus, is 0.5 (50%); for A and G to follow there
|
1103
|
+
is 0% so the latter two can be ignored.
|
1133
1104
|
|
1134
|
-
Thus, we could use a ruby Hash as follows
|
1105
|
+
Thus, we could use a ruby Hash as follows that should
|
1106
|
+
describe these probabilities:
|
1135
1107
|
|
1136
1108
|
probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well
|
1137
1109
|
|
@@ -1217,34 +1189,6 @@ each edge.
|
|
1217
1189
|
Parsimony assumes that substitutions are rare and that back-mutations
|
1218
1190
|
do not occur.
|
1219
1191
|
|
1220
|
-
## Random stuff
|
1221
|
-
|
1222
|
-
You can generate random DNA sequences in the shell:
|
1223
|
-
|
1224
|
-
random dna 20
|
1225
|
-
random dna 25
|
1226
|
-
random dna 30
|
1227
|
-
|
1228
|
-
This will generate random DNA sequences, with a length
|
1229
|
-
of 20, 25, 30, respectively. This may not be very useful
|
1230
|
-
but it was important that this functionality is made
|
1231
|
-
available somewhere.
|
1232
|
-
|
1233
|
-
You can also use some toplevel-methods to generate, e. g.
|
1234
|
-
20 random aminoacids:
|
1235
|
-
|
1236
|
-
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
1237
|
-
|
1238
|
-
Note that there may exist other APIs within the Bioroebe project
|
1239
|
-
that do the same as well.
|
1240
|
-
|
1241
|
-
If you would like to use a ruby-gtk3 widget have a look
|
1242
|
-
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
1243
|
-
It works with aminoacids, DNA and RNA, and allows the user to
|
1244
|
-
create random sequences. (If you need weighted randomness then
|
1245
|
-
you currently have to use the commandline variant. Perhaps I may
|
1246
|
-
add support into the GUI directly for this one day.)
|
1247
|
-
|
1248
1192
|
## Displaying the main sequence with delimiter characters
|
1249
1193
|
|
1250
1194
|
From within the <b>bioshell</b>, you can use some alternative ways to
|
@@ -1486,24 +1430,9 @@ You can simulate this via the following API:
|
|
1486
1430
|
Bioroebe.cleave_with_trypsin(sequence_goes_in_here)
|
1487
1431
|
Bioroebe.cleave :with_trypsin, sequence_goes_in_here
|
1488
1432
|
|
1489
|
-
Currently (July 2021) only support for Trypsin is included, but
|
1433
|
+
Currently (<b>July 2021</b>) only support for Trypsin is included, but
|
1490
1434
|
in the long run the goal is to add as many digestive (peptide-bond
|
1491
1435
|
cleaving) enzymes here as possible.
|
1492
|
-
|
1493
|
-
## Freezing the main sequence - and unfreezing it again
|
1494
|
-
|
1495
|
-
You can **freeze** the BioShell, meaning that it will no longer allow
|
1496
|
-
for the main sequence to be modified, via:
|
1497
|
-
|
1498
|
-
freeze
|
1499
|
-
|
1500
|
-
To unfreeze again, issue:
|
1501
|
-
|
1502
|
-
unfreeze
|
1503
|
-
|
1504
|
-
This functionality has been added because the shell may sometimes be
|
1505
|
-
quite eager to change the main sequence, so we needed a way to disable
|
1506
|
-
any further modifications (until "unfreeze" is issued that is).
|
1507
1436
|
|
1508
1437
|
## MUMmer
|
1509
1438
|
|
@@ -2714,18 +2643,6 @@ This may look as follows:
|
|
2714
2643
|
|
2715
2644
|
<img src="https://i.imgur.com/gAZg8qG.png" style="margin: 1em; margin-left: 3em">
|
2716
2645
|
|
2717
|
-
## Obtaining a subsequence from a Bioroebe::Sequence object
|
2718
|
-
|
2719
|
-
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
2720
|
-
|
2721
|
-
There are several ways how to obtain a subsequence from
|
2722
|
-
this. One variant will be shown next, by making use of
|
2723
|
-
the method called **.subseq()**.
|
2724
|
-
|
2725
|
-
Example:
|
2726
|
-
|
2727
|
-
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
2728
|
-
|
2729
2646
|
## Bioroebe::Protein
|
2730
2647
|
|
2731
2648
|
This class is a subclass of class **Bioroebe::Sequence**. The
|
@@ -2740,15 +2657,26 @@ functionality is also available in another method.
|
|
2740
2657
|
For now keep this in mind; at some later point I may decide whether
|
2741
2658
|
this class is to be kept or not.
|
2742
2659
|
|
2743
|
-
|
2660
|
+
In July 2022 I noticed that the bio-gem has the following method:
|
2744
2661
|
|
2745
|
-
|
2746
|
-
any of the following:
|
2662
|
+
p Bio::AminoAcid['A'] # => "Ala"
|
2747
2663
|
|
2748
|
-
|
2749
|
-
|
2750
|
-
|
2751
|
-
|
2664
|
+
I liked this functionality, but class Bioroebe::Protein already
|
2665
|
+
has a [] method which is used to instantiate a new
|
2666
|
+
instance of class Bioroebe::Protein. So, a toplevel method
|
2667
|
+
was added instead.
|
2668
|
+
|
2669
|
+
Usage example:
|
2670
|
+
|
2671
|
+
Bioroebe::Aminoacids.one_to_three('A') # => Ala
|
2672
|
+
|
2673
|
+
So this is the equivalent to what the bio-gem does, more or
|
2674
|
+
less.
|
2675
|
+
|
2676
|
+
If you want to find out the name of a one-letter aminoacid
|
2677
|
+
you can also use this method:
|
2678
|
+
|
2679
|
+
Bioroebe::Protein.name('A') # => "alanine"
|
2752
2680
|
|
2753
2681
|
## Decoding aminoacids
|
2754
2682
|
|
@@ -2934,27 +2862,6 @@ Note that presently (April 2020) not all of PROSITE may be supported
|
|
2934
2862
|
via this regex, but in the long run the plan is to support all
|
2935
2863
|
of PROSITE's regex expression.
|
2936
2864
|
|
2937
|
-
## Determining how many stop codons existing in a given sequence
|
2938
|
-
|
2939
|
-
You can use **bin/n_stop_codons_in_this_sequence** to determine
|
2940
|
-
how many stop codons exist in a given sequence at hand.
|
2941
|
-
|
2942
|
-
Usage example from the commandline:
|
2943
|
-
|
2944
|
-
n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
|
2945
|
-
|
2946
|
-
You can also separate these via a ' ' spacer on the commandline of
|
2947
|
-
course:
|
2948
|
-
|
2949
|
-
n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
|
2950
|
-
|
2951
|
-
Internally this makes use of the method called
|
2952
|
-
<b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
|
2953
|
-
aliased names. Usage example for the method, just as in the
|
2954
|
-
first example shown above:
|
2955
|
-
|
2956
|
-
Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
|
2957
|
-
|
2958
2865
|
## AT and GC content
|
2959
2866
|
![alt text][cat1]
|
2960
2867
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3176,47 +3083,45 @@ can try to use:
|
|
3176
3083
|
On class Bioroebe::Sequence. More customizability may be added
|
3177
3084
|
to that method in this regard, if users need this.
|
3178
3085
|
|
3179
|
-
|
3086
|
+
### Obtaining a subsequence from a Bioroebe::Sequence object
|
3180
3087
|
|
3181
|
-
|
3182
|
-
the **bioshell**.
|
3088
|
+
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
3183
3089
|
|
3184
|
-
|
3090
|
+
There are several ways how to obtain a subsequence from
|
3091
|
+
this. One variant will be shown next, by making use of
|
3092
|
+
the method called **.subseq()**.
|
3185
3093
|
|
3186
|
-
|
3094
|
+
Example:
|
3187
3095
|
|
3188
|
-
|
3096
|
+
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
3189
3097
|
|
3190
|
-
You can
|
3191
|
-
code:
|
3098
|
+
You can also randomize the sequence, via .randomize().
|
3192
3099
|
|
3193
|
-
|
3100
|
+
Example:
|
3194
3101
|
|
3195
|
-
|
3196
|
-
returned representing that nucleotide sequence.
|
3102
|
+
x = Bioroebe::Sequence.new; x.randomize
|
3197
3103
|
|
3198
|
-
|
3199
|
-
should be generated.
|
3104
|
+
This is similar to the method in Bioruby here:
|
3200
3105
|
|
3201
|
-
|
3106
|
+
https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243
|
3202
3107
|
|
3203
|
-
|
3204
|
-
such as by issuing the following command:
|
3108
|
+
## The Hydropathy index
|
3205
3109
|
|
3206
|
-
|
3110
|
+
You can display the hydropathy index for aminoacids from within
|
3111
|
+
the **bioshell**.
|
3207
3112
|
|
3208
|
-
|
3113
|
+
Simply issue:
|
3209
3114
|
|
3210
|
-
|
3115
|
+
hydropathy?
|
3211
3116
|
|
3212
|
-
|
3117
|
+
## The GFF file format
|
3213
3118
|
|
3214
|
-
|
3119
|
+
From within the **bioshell** you can analyze .gff and .gff3 files,
|
3120
|
+
such as by issuing the following command:
|
3215
3121
|
|
3216
|
-
|
3122
|
+
gff3? foobar.gff3
|
3217
3123
|
|
3218
|
-
|
3219
|
-
compositions of the same nucleotide.
|
3124
|
+
Evidently for this to work the file at hand has to exist.
|
3220
3125
|
|
3221
3126
|
## The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)
|
3222
3127
|
|
@@ -3353,47 +3258,6 @@ nucleotides by issuing:
|
|
3353
3258
|
|
3354
3259
|
show_individual_weight_of_the_four_dna_nucleotides
|
3355
3260
|
|
3356
|
-
## Truncating output in the bioroebe-shell
|
3357
|
-
![alt text][cat1]
|
3358
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3359
|
-
|
3360
|
-
**DNA/RNA sequences** can become very long and then become
|
3361
|
-
quite difficult to view, read and handle on the commandline.
|
3362
|
-
|
3363
|
-
Normally the bioroebe shell will truncate output of DNA sequences
|
3364
|
-
that are "too long". This is mostly done so that working with
|
3365
|
-
very long sequences becomes a bit more convenient.
|
3366
|
-
|
3367
|
-
Sometimes this can become an antifeature, though, so the user
|
3368
|
-
must be able to toggle this at his or her own discretion.
|
3369
|
-
|
3370
|
-
By default, the bioroebe-shell (bioshell) will always try
|
3371
|
-
to truncate output, but you can toggle this behaviour by
|
3372
|
-
issuing:
|
3373
|
-
|
3374
|
-
do not truncate
|
3375
|
-
|
3376
|
-
In theory, other "do not" actions are also supported, or will
|
3377
|
-
be supported in the future; right now (Oct 2019) this is a bit
|
3378
|
-
limited.
|
3379
|
-
|
3380
|
-
From the toplevel, you can use this method:
|
3381
|
-
|
3382
|
-
Bioroebe.do_not_truncate
|
3383
|
-
|
3384
|
-
The above instruction will toggle the truncate behaviour
|
3385
|
-
to not truncate, ever.
|
3386
|
-
|
3387
|
-
If you need to do so within the bioshell, this is the way:
|
3388
|
-
|
3389
|
-
no_truncate
|
3390
|
-
|
3391
|
-
Or simply
|
3392
|
-
|
3393
|
-
truncate
|
3394
|
-
|
3395
|
-
This will toggle, like a switch.
|
3396
|
-
|
3397
3261
|
## Rosalind Challenges
|
3398
3262
|
![alt text][cat1]
|
3399
3263
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3530,31 +3394,6 @@ investing more time into Rosalind. Let's focus on solving
|
|
3530
3394
|
real, existing problems instead - at the least as far as
|
3531
3395
|
the Bioroebe project is concerned.
|
3532
3396
|
|
3533
|
-
## Numbers as input in the bioshell
|
3534
|
-
![alt text][cat1]
|
3535
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3536
|
-
|
3537
|
-
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
3538
|
-
|
3539
|
-
This will attempt to <b>display the first 3 nucleotides</b> of
|
3540
|
-
the assigned **main sequence**. It will only work if you have
|
3541
|
-
assigned a sequence prior to that, though.
|
3542
|
-
|
3543
|
-
Examples:
|
3544
|
-
|
3545
|
-
3
|
3546
|
-
33
|
3547
|
-
15
|
3548
|
-
|
3549
|
-
## transeq
|
3550
|
-
![alt text][cat1]
|
3551
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3552
|
-
|
3553
|
-
You can convert a DNA sequence into an aminoacid sequence by
|
3554
|
-
doing this:
|
3555
|
-
|
3556
|
-
transeq
|
3557
|
-
|
3558
3397
|
## Align two different sequences
|
3559
3398
|
![alt text][cat1]
|
3560
3399
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3866,22 +3705,6 @@ does not (yet?) have support for comparing two genomes to
|
|
3866
3705
|
one another and generate a visual map indicating the findings
|
3867
3706
|
there.
|
3868
3707
|
|
3869
|
-
## Do not create directories on startup of the shell
|
3870
|
-
|
3871
|
-
By default the bioshell will try to create some directories
|
3872
|
-
on startup. This may not always be desired by the user
|
3873
|
-
though, so an option has to exist to disable this functionality.
|
3874
|
-
|
3875
|
-
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
3876
|
-
keeps track of whether directories on startup of the shell will
|
3877
|
-
be created.
|
3878
|
-
|
3879
|
-
To disable this behaviour on startup of the bioshell, try
|
3880
|
-
something like this:
|
3881
|
-
|
3882
|
-
bioshell --do-not-create-directories-on-startup
|
3883
|
-
bioshell --do-not-create-directories
|
3884
|
-
|
3885
3708
|
## class Bioroebe::MoveFileToItsCorrectLocation
|
3886
3709
|
|
3887
3710
|
This class will move a bio-file to its "correct" location, with respect
|
@@ -3924,15 +3747,6 @@ synonymous, aka aliases):
|
|
3924
3747
|
ruler2 25 # ← use 25 characters per line
|
3925
3748
|
ruler2 50 # ← use 50 characters per line
|
3926
3749
|
|
3927
|
-
## Generating a random nucleotide sequence based on frequencies
|
3928
|
-
|
3929
|
-
If you ever need to generate a nucleotide frequency then you can use
|
3930
|
-
the following method:
|
3931
|
-
|
3932
|
-
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
|
3933
|
-
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
|
3934
|
-
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
|
3935
|
-
|
3936
3750
|
## The Mouse
|
3937
3751
|
|
3938
3752
|
This subsection is about the **mouse**, in particular relevant
|
@@ -4050,57 +3864,24 @@ has". Genes in itself are not that well-defined, so they are not necessarily
|
|
4050
3864
|
the primary means of complexity. Think of this more as an interactome,
|
4051
3865
|
where RNAs play a major dynamic role as well.
|
4052
3866
|
|
4053
|
-
## Bioroebe::
|
3867
|
+
## class Bioroebe::DisplayOpenReadingFrames
|
4054
3868
|
|
4055
|
-
|
4056
|
-
|
4057
|
-
|
3869
|
+
**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
|
3870
|
+
will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
|
3871
|
+
**class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
|
3872
|
+
flexible. It shall also support **sixpack** and **showorf** from the
|
3873
|
+
**Emboss online tools**. (In fact, supporting these two use cases
|
3874
|
+
was the original reason as to why this class has been created.)
|
4058
3875
|
|
4059
|
-
|
4060
|
-
HMMs (Hidden Markov Models) one day.
|
3876
|
+
Where does the code to this class reside?
|
4061
3877
|
|
4062
|
-
|
3878
|
+
It can be found here:
|
4063
3879
|
|
4064
|
-
|
4065
|
-
|
3880
|
+
bioroebe/utility_scripts/display_open_reading_frames/
|
3881
|
+
require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
|
4066
3882
|
|
4067
|
-
|
4068
|
-
|
4069
|
-
the Hash into the method generate_sequence_based_on_this_profile() -
|
4070
|
-
or you use the default Hash, which is stored in the constant
|
4071
|
-
called **PER_POSITION_HASH**.
|
4072
|
-
|
4073
|
-
That profile should be a Hash, with keys pointing to A, T, C, G
|
4074
|
-
and the values being an Array of likelihood chance there,
|
4075
|
-
as a number, such as 140. These values are also called
|
4076
|
-
**scores**. Each score contains a number for each position
|
4077
|
-
that indicates how likely it is to find the given
|
4078
|
-
nucleotide at that location.
|
4079
|
-
|
4080
|
-
You can also use this class to generate a random DNA string,
|
4081
|
-
similar to the method called
|
4082
|
-
**Bioroebe.generate_random_dna_sequence()**. The difference
|
4083
|
-
is that class ProfilePattern allows for a bit more fine-tuned
|
4084
|
-
control. The class will likely be extended in the future too.
|
4085
|
-
|
4086
|
-
## class Bioroebe::DisplayOpenReadingFrames
|
4087
|
-
|
4088
|
-
**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
|
4089
|
-
will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
|
4090
|
-
**class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
|
4091
|
-
flexible. It shall also support **sixpack** and **showorf** from the
|
4092
|
-
**Emboss online tools**. (In fact, supporting these two use cases
|
4093
|
-
was the original reason as to why this class has been created.)
|
4094
|
-
|
4095
|
-
Where does the code to this class reside?
|
4096
|
-
|
4097
|
-
It can be found here:
|
4098
|
-
|
4099
|
-
bioroebe/utility_scripts/display_open_reading_frames/
|
4100
|
-
require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
|
4101
|
-
|
4102
|
-
The display of this class is typically aimed for the commandline,
|
4103
|
-
but it is planned to use the class on the www too (via sinatra).
|
3883
|
+
The display of this class is typically aimed for the commandline,
|
3884
|
+
but it is planned to use the class on the www too (via sinatra).
|
4104
3885
|
|
4105
3886
|
Take note that this class also reports how many ORFs (open reading
|
4106
3887
|
frames) have been found. The number displayed here differs from
|
@@ -4462,28 +4243,6 @@ the BioRoebe-Shell, then you can use either of the following:
|
|
4462
4243
|
|
4463
4244
|
seq?
|
4464
4245
|
seq_with_tab?
|
4465
|
-
|
4466
|
-
## Prompt (the shell prompt9
|
4467
|
-
|
4468
|
-
You can set a <b>custom prompt</b>, via the keywords
|
4469
|
-
"prompt" or "set_prompt".
|
4470
|
-
|
4471
|
-
To display the <b>current working directory</b>, do:
|
4472
|
-
|
4473
|
-
prompt pwd
|
4474
|
-
|
4475
|
-
To revert to the old default again, do this:
|
4476
|
-
|
4477
|
-
prompt REVERT
|
4478
|
-
prompt revert
|
4479
|
-
prompt DEFAULT
|
4480
|
-
prompt default
|
4481
|
-
|
4482
|
-
If you do not want to set any prompt, do:
|
4483
|
-
|
4484
|
-
prompt none
|
4485
|
-
|
4486
|
-
|
4487
4246
|
|
4488
4247
|
## Leader and Trailer
|
4489
4248
|
|
@@ -4971,17 +4730,17 @@ For now, here is the list:
|
|
4971
4730
|
|
4972
4731
|
## The T-Bacteriophages
|
4973
4732
|
|
4974
|
-
The following table only shows a short summary for the
|
4733
|
+
The following table only shows a short summary for the <b>T-phages</b>.
|
4975
4734
|
|
4976
|
-
name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size
|
4977
|
-
|
4978
|
-
T1 | medium | 50 | 150 x 15 | 13 | 180
|
4979
|
-
T2 | small | 65 x 80 | 120 x 20 | 21 | 120
|
4980
|
-
T3 | large | 45 | invisible | 13 | 300
|
4981
|
-
T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300
|
4982
|
-
T5 | small | 100 | tiny | 40 | 300
|
4983
|
-
T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300
|
4984
|
-
T7 | large | 45 | invisible | 13 | 300
|
4735
|
+
name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size | n genes
|
4736
|
+
-------------------|--------------|---------------------------|----------------|----------------------------|-------------|------------
|
4737
|
+
T1 | medium | 50 | 150 x 15 | 13 | 180 |
|
4738
|
+
T2 | small | 65 x 80 | 120 x 20 | 21 | 120 |
|
4739
|
+
T3 | large | 45 | invisible | 13 | 300 |
|
4740
|
+
T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300 | 300
|
4741
|
+
T5 | small | 100 | tiny | 40 | 300 |
|
4742
|
+
T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300 |
|
4743
|
+
T7 | large | 45 | invisible | 13 | 300 |
|
4985
4744
|
|
4986
4745
|
The next table will show some phage genomes.
|
4987
4746
|
|
@@ -5392,215 +5151,6 @@ that format.
|
|
5392
5151
|
Presently (**May 2020**) there is no support for the mmCIF format
|
5393
5152
|
in the Bioroebe project, but this will eventually change.
|
5394
5153
|
|
5395
|
-
## Working with PDB files (.pdb)
|
5396
|
-
![alt text][cat1]
|
5397
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
5398
|
-
|
5399
|
-
The **PDB**, founded in the year **1971**, holds lots of **atomic
|
5400
|
-
structures of proteins**.
|
5401
|
-
|
5402
|
-
In **July 2016** it contained **121000 structures**.
|
5403
|
-
|
5404
|
-
In **February 2018** it contained **~124000 structures**
|
5405
|
-
(from X-ray crystallography), and about **~12000 NMR
|
5406
|
-
structures**. <b>NMR</b> is limited to about <b>350 amino
|
5407
|
-
acids maximum length</b>, give or take.
|
5408
|
-
|
5409
|
-
In **April 2020** the PDB contained **163141 structures**.
|
5410
|
-
|
5411
|
-
We can see that more and more structures are available
|
5412
|
-
nowadays - a trend that will most likely continue or
|
5413
|
-
even accelerate. (Let's hope the quality also remains
|
5414
|
-
high.)
|
5415
|
-
|
5416
|
-
A typical .pdb file contains entries such as this:
|
5417
|
-
|
5418
|
-
RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
|
5419
|
-
ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
|
5420
|
-
ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
|
5421
|
-
ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
|
5422
|
-
ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
|
5423
|
-
ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
|
5424
|
-
|
5425
|
-
(Not the first line; **RTyp** is just an explanation for the ATOM
|
5426
|
-
entries below that line).
|
5427
|
-
|
5428
|
-
The sequence starts from the N-terminal residue for proteins; see
|
5429
|
-
the <b>Atm</b> entry at <b>Num 1</b>.
|
5430
|
-
|
5431
|
-
The **meaning of these entries** is as follows:
|
5432
|
-
|
5433
|
-
1) RTyp: Record Type
|
5434
|
-
2) Num: Serial number of the atom. Each atom has a unique serial number.
|
5435
|
-
3) Atm: Atom name (in IUPAC format).
|
5436
|
-
4) Res: Residue name (IUPAC format).
|
5437
|
-
5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
|
5438
|
-
6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
|
5439
|
-
7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
|
5440
|
-
10) Occ: Occupancy factor
|
5441
|
-
11) Temp: Temperature factor (atoms disordered in the crystal have high
|
5442
|
-
temperature factors; they are "wobbly" with a high factor.
|
5443
|
-
This is also called the B-factor).
|
5444
|
-
12) PDB: The PDB data file unique identifier.
|
5445
|
-
13) Line: Line (record) number in the data file.
|
5446
|
-
|
5447
|
-
Typically the entry on the most right area, the last one, specifies
|
5448
|
-
which atom it is. A **H** stands for a hydrogen atom; the other atoms
|
5449
|
-
are "heavy" atoms (heavier than hydrogen most definitely).
|
5450
|
-
|
5451
|
-
Most .pdb files will contain **SEQRES** entries. These entries will list
|
5452
|
-
the primary sequence of the polymeric molecules present in the entry.
|
5453
|
-
You can notice this by looking at the standard 3-character code
|
5454
|
-
used by SEQRES here, for the canonical amino acids. So, for instance,
|
5455
|
-
the amino acids that will be mentioned in a SEQRES entry are
|
5456
|
-
ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
|
5457
|
-
PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
|
5458
|
-
method **Bioroebe.three_to_one()** to convert back to the
|
5459
|
-
one-letter chain such as follows:
|
5460
|
-
|
5461
|
-
Bioroebe.three_to_one('PHE') # => "F"
|
5462
|
-
|
5463
|
-
The data in a .pdb file need not necessarily only be a protein, with
|
5464
|
-
a specific aminoacid sequence. It may also include DNA. An example
|
5465
|
-
for such a molecule is
|
5466
|
-
<b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
|
5467
|
-
which includes a protein chain and a DNA chain.
|
5468
|
-
|
5469
|
-
As far as the **bioroebe project** is concerned, you can parse .pdb files
|
5470
|
-
via the following class:
|
5471
|
-
|
5472
|
-
Bioroebe::ParsePdbFile.new
|
5473
|
-
Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
|
5474
|
-
Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
|
5475
|
-
|
5476
|
-
This class also allows some shortcuts for integrated .pdb files,
|
5477
|
-
that is files that are bundled with the bioroebe project:
|
5478
|
-
|
5479
|
-
Bioroebe::ParsePdbFile.new ':1fat'
|
5480
|
-
|
5481
|
-
This requires a String because ruby symbols may not start with
|
5482
|
-
a number. Note that this also works through the commandline,
|
5483
|
-
such as:
|
5484
|
-
|
5485
|
-
parse_pdb_file :1fat
|
5486
|
-
|
5487
|
-
A shell such as bash does not understand ruby symbols, so instead
|
5488
|
-
a string will be passed in, being :1fat. The ParsePdbFile will
|
5489
|
-
handle this correctly internally.
|
5490
|
-
|
5491
|
-
Note that a small bug was fixed in the file parse_pdb_file.rb;
|
5492
|
-
some entries were skipped due to an erroneous loop in the ruby
|
5493
|
-
file. This was corrected in **May 2020**.
|
5494
|
-
|
5495
|
-
In **March 2021** the ability to use entries such as ':1fat'
|
5496
|
-
was removed again; the code remains though. The reason why
|
5497
|
-
this was removed was that the .pdb files are quite large,
|
5498
|
-
so distributing them via the bioroebe project makes no real
|
5499
|
-
sense. Consider simply downloading the .pdb files; you
|
5500
|
-
can use this from the bioshell or via something
|
5501
|
-
like:
|
5502
|
-
|
5503
|
-
pdb 5TIM
|
5504
|
-
|
5505
|
-
Note that you can also return the aminoacid-sequence from a
|
5506
|
-
.pdb file directly, since as of **May 2020**.
|
5507
|
-
|
5508
|
-
Example for this:
|
5509
|
-
|
5510
|
-
Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
5511
|
-
|
5512
|
-
The first argument should be **the path to the (local)
|
5513
|
-
.pdb file at hand**. (In theory support for remote .pdb
|
5514
|
-
files could also be added easily, but right now this
|
5515
|
-
is not possible, so you have to download it first.)
|
5516
|
-
|
5517
|
-
The **specification for .pdb files** can be read at the following
|
5518
|
-
two remote resources:
|
5519
|
-
|
5520
|
-
http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
|
5521
|
-
http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
|
5522
|
-
|
5523
|
-
Note that the parse_pdb_file.rb can also do some additional
|
5524
|
-
things, such as calculating the maximum distance between
|
5525
|
-
atoms in that file, via the method
|
5526
|
-
**.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
|
5527
|
-
|
5528
|
-
If you wish to report the secondary structures from a given .pdb file
|
5529
|
-
then you can use the following class:
|
5530
|
-
|
5531
|
-
require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
|
5532
|
-
|
5533
|
-
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
|
5534
|
-
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
|
5535
|
-
|
5536
|
-
If you wish to obtain the FASTA sequence of a particular remote
|
5537
|
-
.pdb file then you can use this API:
|
5538
|
-
|
5539
|
-
x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
5540
|
-
|
5541
|
-
Keep in mind that this is the FASTA sequence; the .pdb file itself
|
5542
|
-
has another format, and contains a lot more information, such as
|
5543
|
-
the various ATOM entries.
|
5544
|
-
|
5545
|
-
Since as of **June 2020** the command **fetch** also works from
|
5546
|
-
within the Bioshell, similar to how pymol **works**. This allows
|
5547
|
-
us to quickly download a remote .pdb file.
|
5548
|
-
|
5549
|
-
fetch 2BTS
|
5550
|
-
|
5551
|
-
You can also use the following toplevel-API to download a remote
|
5552
|
-
.pdb file:
|
5553
|
-
|
5554
|
-
Bioroebe.download_this_pdb
|
5555
|
-
Bioroebe.download_this_pdb '355D'
|
5556
|
-
Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
|
5557
|
-
|
5558
|
-
Note that this will be automatically moved to the "correct" default
|
5559
|
-
position in the bioroebe-project, under the **pdb/** subdirectory.
|
5560
|
-
|
5561
|
-
You can also invoke this script from the commandline via
|
5562
|
-
**bin/download_this_pdb**, like in this way:
|
5563
|
-
|
5564
|
-
download_this_pdb 355D
|
5565
|
-
|
5566
|
-
This works with several .pdb files in one go as well:
|
5567
|
-
|
5568
|
-
download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
|
5569
|
-
|
5570
|
-
They would all be downloaded one after the other. Be aware that
|
5571
|
-
this will overwrite the old .pdb files on that position, so
|
5572
|
-
if you don't want this, I recommend to do a backup on the
|
5573
|
-
**pdb/** subdirectory before invoking the above call.
|
5574
|
-
|
5575
|
-
You can also turn the FASTA sequence stored in a .pdb file into
|
5576
|
-
a .fasta file, via **--create-fasta-file**.
|
5577
|
-
|
5578
|
-
Usage examples:
|
5579
|
-
|
5580
|
-
parsedb 1NR6 --create-fasta-file
|
5581
|
-
parsedb 2F9Q --create-fasta-file
|
5582
|
-
parsedb 3TDA --create-fasta-file
|
5583
|
-
parsedb 2HI4 --create-fasta-file
|
5584
|
-
parsedb 2V0M --create-fasta-file
|
5585
|
-
|
5586
|
-
So if you have a file called <b>1NR6.pdb</b> and you use
|
5587
|
-
the first input, a .fasta file will be created. If such
|
5588
|
-
a .pdb file does not exist then this will not work, so
|
5589
|
-
make sure to download the .pdb file before invoking
|
5590
|
-
this commandline-flag.
|
5591
|
-
|
5592
|
-
Last but not least, the following table shall document the
|
5593
|
-
PDB format - it is not yet complete, but it is intended
|
5594
|
-
to add the remaining datasets eventually:
|
5595
|
-
|
5596
|
-
Record Name Describes
|
5597
|
-
MODRES Modifications to standard residues
|
5598
|
-
HET Nonstandard residues (as well as ligands, ions and water)
|
5599
|
-
HETNAM Full chemical name of the residue
|
5600
|
-
HETSYM Synonyms for the residue
|
5601
|
-
FORMUL Chemical formula of the residue
|
5602
|
-
KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
|
5603
|
-
|
5604
5154
|
## Sugars and glyco-patterns
|
5605
5155
|
|
5606
5156
|
I am currently having to do an assignment related to glyco-patterns
|
@@ -5764,6 +5314,9 @@ like this:
|
|
5764
5314
|
|
5765
5315
|
<img src="https://i.imgur.com/vr2kEBz.png" style="margin: 1em; margin-left: 3em">
|
5766
5316
|
|
5317
|
+
Since as of <b>July 2022</b> invalid amino acids will be automatically
|
5318
|
+
filtered away before being assigned to the input.
|
5319
|
+
|
5767
5320
|
## Colourizing hydrophilic and hydrophobic aminoacids on the commandline
|
5768
5321
|
|
5769
5322
|
Via class **Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids** you
|
@@ -5777,35 +5330,36 @@ Example output for this:
|
|
5777
5330
|
|
5778
5331
|
This subsection contains some information about proteases.
|
5779
5332
|
|
5780
|
-
|
5333
|
+
Trypsin:
|
5781
5334
|
https://en.wikipedia.org/wiki/Trypsin
|
5782
|
-
cuts at
|
5335
|
+
<b>cuts at</b>: Trypsin cuts peptide chains mainly at the carboxyl
|
5783
5336
|
side of the amino acids lysine or arginine.
|
5784
5337
|
|
5785
|
-
|
5338
|
+
Chymotrypsin:
|
5786
5339
|
https://en.wikipedia.org/wiki/Chymotrypsin
|
5787
|
-
cuts at
|
5340
|
+
<b>cuts at</b>: Chymotrypsin preferentially cleaves peptide amide
|
5788
5341
|
bonds where the side chain of the amino acid N-terminal
|
5789
|
-
to the scissile amide bond is a large hydrophobic amino
|
5790
|
-
acid (tyrosine, tryptophan, and phenylalanine).
|
5342
|
+
to the scissile amide bond is <b>a large hydrophobic amino</b>
|
5343
|
+
acid (specifically: tyrosine, tryptophan, and phenylalanine).
|
5344
|
+
Chymotrypsin will cleave proteins on the <b>carboxyl side</b>
|
5345
|
+
of aromatic or large hydrophobic amino acids.
|
5791
5346
|
|
5792
|
-
|
5347
|
+
Thrombin:
|
5793
5348
|
https://en.wikipedia.org/wiki/Thrombin
|
5794
|
-
cuts at
|
5349
|
+
<b>cuts at</b>: Thrombin acts as a serine protease that converts
|
5795
5350
|
soluble fibrinogen into insoluble strands of fibrin. It
|
5796
5351
|
catalyzes the hydrolysis of <b>Arg-Gly</b> bonds in
|
5797
5352
|
particular peptide sequences only.
|
5798
5353
|
|
5799
|
-
|
5354
|
+
Plasmin:
|
5800
5355
|
https://en.wikipedia.org/wiki/Plasmin
|
5801
|
-
cuts at
|
5356
|
+
<b>cuts at</b>: Plasmin is a serine protease.
|
5802
5357
|
|
5803
|
-
|
5358
|
+
Papain:
|
5804
5359
|
https://en.wikipedia.org/wiki/Papain
|
5805
|
-
cuts at
|
5806
|
-
|
5807
|
-
|
5808
|
-
not followed by a valine.
|
5360
|
+
<b>cuts at</b>: Papain prefers to cleave after an arginine or
|
5361
|
+
lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile,
|
5362
|
+
Phe, Trp, Tyr) and not followed by a valine.
|
5809
5363
|
|
5810
5364
|
factor Xa:
|
5811
5365
|
|
@@ -5817,8 +5371,8 @@ Some proteins may permanently reside in the lumen of the
|
|
5817
5371
|
Often such proteins will have a special signal sequence attached
|
5818
5372
|
to their **C-terminal part**, such as **KDEL** (Lys-Asp-Glu-Leu).
|
5819
5373
|
|
5820
|
-
KDEL is not the only signal that may be used, though. Some
|
5821
|
-
may use different signals, such as:
|
5374
|
+
<b>KDEL</b> is not the only signal that may be used, though. Some
|
5375
|
+
species may use different signals, such as:
|
5822
5376
|
|
5823
5377
|
aminoacids | species
|
5824
5378
|
-------------|------------------------------------------------------------
|
@@ -5828,8 +5382,9 @@ may use different signals, such as:
|
|
5828
5382
|
ADEL | Schizosaccharomyces pombe (fission yeast)
|
5829
5383
|
SDEL | Plasmodium falciparum
|
5830
5384
|
|
5831
|
-
If you work with the bioshell then you can simply use this
|
5832
|
-
to query whether the given aminoacid sequence has a KDEL
|
5385
|
+
If you work with the <b>bioshell</b> then you can simply use this
|
5386
|
+
method to query whether the given aminoacid sequence has a KDEL
|
5387
|
+
sequence:
|
5833
5388
|
|
5834
5389
|
KDEL?
|
5835
5390
|
|
@@ -6240,8 +5795,6 @@ Next, do something such as this:
|
|
6240
5795
|
This will show the distribution of the oligos.
|
6241
5796
|
|
6242
5797
|
## Number of chromomes in different species
|
6243
|
-
![alt text][cat1]
|
6244
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
6245
5798
|
|
6246
5799
|
Name of the organism | Latin name | Number of chromosomes
|
6247
5800
|
---------------------|--------------|-----------------------
|
@@ -6319,112 +5872,6 @@ So this is what would be returned:
|
|
6319
5872
|
|
6320
5873
|
Bioroebe::DetectMinimalCodon[["TTT", "TTC"]] # => ["TTY"]
|
6321
5874
|
|
6322
|
-
## Codon Usage
|
6323
|
-
|
6324
|
-
This **paragraph** deals with some aspects of **codon usage** in different
|
6325
|
-
organisms.
|
6326
|
-
|
6327
|
-
Let us first define the term <b>codon usage</b>. In order to do so,
|
6328
|
-
we also have to define what a <b>codon</b> is, so let's start with that.
|
6329
|
-
|
6330
|
-
A <span style="color: darkgreen; font-weight: bold">codon</span> is
|
6331
|
-
essentially the basic code used in DNA to denote which particular
|
6332
|
-
**aminoacid** corresponds to these (three) nucleotide base pairs.
|
6333
|
-
A codon is thus **a series of three nucleotides, also called
|
6334
|
-
a <b>triplet</b>.
|
6335
|
-
|
6336
|
-
When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
|
6337
|
-
abbreviated as <b>dsDNA</b>. The codon is, however had, only found
|
6338
|
-
in a single stranded molecule, even within dsDNA. Since some parts of
|
6339
|
-
a **dsDNA** in any given genome gives rise to a, more or less, complementary
|
6340
|
-
copy into **mRNA**, the codons that are actually used, are found in the
|
6341
|
-
corresponding mRNA. (Remember that mRNA differs from DNA in that there
|
6342
|
-
will be Uracil rather than Thymine; otherwise it is the same, sequence-wise.
|
6343
|
-
Of course it uses another sugar (Ribose), but remember we are here mostly
|
6344
|
-
interested in the **information-containing part**, not the full chemical
|
6345
|
-
structure.)
|
6346
|
-
|
6347
|
-
The codon is thus found on the mRNA and since mRNA is mostly
|
6348
|
-
single-stranded, the codon is a component of the mRNA. It is
|
6349
|
-
where the two subunits of the ribosome are assembled (or more
|
6350
|
-
accurately, the smaller subunit scans along the mRNA until it
|
6351
|
-
detects a start codon). Mind you, this subsection will not go into
|
6352
|
-
all relevant details, so just keep in mind that the codon is the
|
6353
|
-
part that will eventually be "translated" at the ribosome into
|
6354
|
-
a corresponding aminoacid, excluding stop codons at the end.
|
6355
|
-
|
6356
|
-
Now - different organisms use **different frequencies of codons**.
|
6357
|
-
**Codon usage** thus describes the fact that many proteins in
|
6358
|
-
these different organisms make use of certain codons with a
|
6359
|
-
**substantially higher frequency than other codons**. We can
|
6360
|
-
use statistics to infer this on a global (proteome) level
|
6361
|
-
too.
|
6362
|
-
|
6363
|
-
Remember that the genetic code is **degenerate**, meaning that
|
6364
|
-
you have a few aminoacids that are encoded only by one codon
|
6365
|
-
(<b>Tryptophan</b> and <b>Methionin</b>), whereas the other
|
6366
|
-
aminoacids are encoded by more than one codon - thus, at the
|
6367
|
-
very least two codons. Note that the latter codons, if they
|
6368
|
-
code for the **same** aminoacid, are also called <b>synonymous
|
6369
|
-
codons</b>.
|
6370
|
-
|
6371
|
-
This means that if you have any given aminoacid chain, you can have
|
6372
|
-
several different sequences (and codons in these sequences, which
|
6373
|
-
ultimtely means that you can have different DNA sequences code for
|
6374
|
-
the very same aminoacid chain).
|
6375
|
-
|
6376
|
-
Usually the third base of a codon has the least influence on
|
6377
|
-
codon meaning. This is also called <b>wobbling</b> - since
|
6378
|
-
the anticodon loop on the tRNA is in the reverse direction,
|
6379
|
-
and the wobble position refers to the tRNA, this means that
|
6380
|
-
the wobble-position is at the 5'-end of the tRNA anticodon.
|
6381
|
-
|
6382
|
-
Now a few words about functionality related to codons and codon
|
6383
|
-
usage in the Bioroebe project.
|
6384
|
-
|
6385
|
-
Say that you have a long DNA sequence; let's pick a sample
|
6386
|
-
for now, such as:
|
6387
|
-
|
6388
|
-
ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
6389
|
-
|
6390
|
-
You can analyze the codons used via class **ShowCodonUsage**:
|
6391
|
-
|
6392
|
-
show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
6393
|
-
|
6394
|
-
This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
|
6395
|
-
It will report the top 5 codons in use and also output the
|
6396
|
-
frequency hash on the commandline.
|
6397
|
-
|
6398
|
-
You can use this from ruby too, via this toplevel method:
|
6399
|
-
|
6400
|
-
Bioroebe.codon_frequencies_of_this_sequence(ARGV)
|
6401
|
-
|
6402
|
-
If you want to look at the actual codon frequencies used
|
6403
|
-
by different organisms, have a look here:
|
6404
|
-
|
6405
|
-
http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
|
6406
|
-
|
6407
|
-
This is an excellent resource.
|
6408
|
-
|
6409
|
-
## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
|
6410
|
-
|
6411
|
-
If you quickly wish to determine the aminoacid composition, as a
|
6412
|
-
Hash, you can use **bin/aminoacid_frequencies**.
|
6413
|
-
|
6414
|
-
Example from the commandline for this:
|
6415
|
-
|
6416
|
-
aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
|
6417
|
-
|
6418
|
-
Example from within bioroebe itself (and thus ruby):
|
6419
|
-
|
6420
|
-
require 'bioroebe/frequencies.rb'
|
6421
|
-
|
6422
|
-
Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
|
6423
|
-
|
6424
|
-
The latter will return a Hash that you can then further make use for, such as:
|
6425
|
-
|
6426
|
-
{"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
|
6427
|
-
|
6428
5875
|
## The Levensthein distance
|
6429
5876
|
|
6430
5877
|
The <b>Levensthein distance</b> - also called a '**string metric**' - was formulated
|
@@ -6842,6 +6289,34 @@ change A: teal or C: slateblue to some other colour; these are HTML
|
|
6842
6289
|
colours, so it is recommended to use the names of these HTML
|
6843
6290
|
colours).
|
6844
6291
|
|
6292
|
+
In <b>July 2022</b> the method <b>Bioroebe.colourize_this_fasta_sequence</b>
|
6293
|
+
was extended slightly. You can now attach a "ruler" to the output, that
|
6294
|
+
is a numbered series that shows the nucleotide position, on the commandline.
|
6295
|
+
|
6296
|
+
Example for this:
|
6297
|
+
|
6298
|
+
puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
|
6299
|
+
'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
|
6300
|
+
'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
|
6301
|
+
) { :with_ruler }
|
6302
|
+
|
6303
|
+
By default this will use a white colour on black background. If you want to
|
6304
|
+
modify the foreground colour you can pass the colour name to the method,
|
6305
|
+
such as via:
|
6306
|
+
|
6307
|
+
puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
|
6308
|
+
'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
|
6309
|
+
'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
|
6310
|
+
) { :with_ruler_steelblue_colour }
|
6311
|
+
|
6312
|
+
The following image shows how this can be used on the commandline:
|
6313
|
+
|
6314
|
+
<img src="https://i.imgur.com/ucVEVnK.png" style="margin: 1em; border: 3px solid black">
|
6315
|
+
|
6316
|
+
At a later time this may be extended to allow for use in a webpage,
|
6317
|
+
that is to embed these strings directly into HTML or .php or
|
6318
|
+
.cgi.
|
6319
|
+
|
6845
6320
|
If you wish to show a **chunked display** of the dataset (nucleotides
|
6846
6321
|
normally) then you can use the following API:
|
6847
6322
|
|
@@ -7365,16 +6840,6 @@ This would notify the bioshell that only nucleotides from position
|
|
7365
6840
|
51 to (including) position 3251 will be colourized, when doing another
|
7366
6841
|
"ORF?" invocation.
|
7367
6842
|
|
7368
|
-
## Longest substring
|
7369
|
-
|
7370
|
-
Within the Bioroebe::Shell you can determine the longest substring,
|
7371
|
-
including gaps, like s:'
|
7372
|
-
|
7373
|
-
longest_substring? ATTATTGTT | ATTATTCTT'
|
7374
|
-
|
7375
|
-
Note that this will make use of the diff-lcs gem, which uses
|
7376
|
-
the McIlroy-Hunt algorithm.
|
7377
|
-
|
7378
6843
|
## Restriction Enzymes
|
7379
6844
|
|
7380
6845
|
This **subsection** will eventually be expanded to explain various things about
|
@@ -8733,6 +8198,22 @@ The images that can be generated via this may look as follows:
|
|
8733
8198
|
|
8734
8199
|
<img src="https://i.imgur.com/fWwD1fj.png" style="margin: 1em; margin-left: 2em">
|
8735
8200
|
|
8201
|
+
Let's look at another example.
|
8202
|
+
|
8203
|
+
Say you input the following sequences there:
|
8204
|
+
|
8205
|
+
AGVV
|
8206
|
+
AGVV
|
8207
|
+
AGVV
|
8208
|
+
AGVV
|
8209
|
+
AGGV
|
8210
|
+
AGGV
|
8211
|
+
AGGV
|
8212
|
+
|
8213
|
+
The resulting image that is generated is:
|
8214
|
+
|
8215
|
+
<img src="https://i.imgur.com/3wWApIQ.png" style="margin: 1em; margin-left: 2em">
|
8216
|
+
|
8736
8217
|
## The Kozak Sequence
|
8737
8218
|
|
8738
8219
|
The ribosome usually scans for a **AUG** codon. But there are
|
@@ -8872,85 +8353,6 @@ Usage Example:
|
|
8872
8353
|
|
8873
8354
|
pfasta insulin_mRNA.fasta --toprotein
|
8874
8355
|
|
8875
|
-
## Determining the codon frequencies from the commandline
|
8876
|
-
|
8877
|
-
In April 2022 I noticed that one use case is to show the codon
|
8878
|
-
frequencies of a given sequence - typically a nucleotide sequence.
|
8879
|
-
|
8880
|
-
For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
|
8881
|
-
So, following that logic, a new executable was added at
|
8882
|
-
**bin/codon_frequency**. This will show the Hash of the codon
|
8883
|
-
frequencies, as a String, on the commandline.
|
8884
|
-
|
8885
|
-
Usage example:
|
8886
|
-
|
8887
|
-
codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
|
8888
|
-
|
8889
|
-
The output of this would be the following:
|
8890
|
-
|
8891
|
-
AUU: 2
|
8892
|
-
CGU: 2
|
8893
|
-
ACG: 1
|
8894
|
-
AUC: 1
|
8895
|
-
GAC: 1
|
8896
|
-
UGA: 1
|
8897
|
-
CUG: 1
|
8898
|
-
ACA: 1
|
8899
|
-
GUC: 1
|
8900
|
-
|
8901
|
-
## Showing the codon frequency via countcodon
|
8902
|
-
|
8903
|
-
https://www.kazusa.or.jp/codon/countcodon.html offers a rather useful
|
8904
|
-
functionality via a simple web-interface, in that you can pass in a mRNA
|
8905
|
-
sequence, and it will then show the codon frequency/likelihood of that
|
8906
|
-
sequence - all codons in that sequence, that is. This can be extended
|
8907
|
-
to all protein-coding genes in a given genome, and will thus be useful
|
8908
|
-
for a researcher who may be interested in determining the codon frequency
|
8909
|
-
in general, across all genes in that given genome.
|
8910
|
-
|
8911
|
-
You can test it with an input sequence. For instance, the following
|
8912
|
-
sequence:
|
8913
|
-
|
8914
|
-
ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
|
8915
|
-
|
8916
|
-
Would yield this result:
|
8917
|
-
|
8918
|
-
fields: [triplet] [frequency: per thousand] ([number])
|
8919
|
-
|
8920
|
-
UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
|
8921
|
-
UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
|
8922
|
-
UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
|
8923
|
-
UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
|
8924
|
-
|
8925
|
-
CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
|
8926
|
-
CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
|
8927
|
-
CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
|
8928
|
-
CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
|
8929
|
-
|
8930
|
-
AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
|
8931
|
-
AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
|
8932
|
-
AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
|
8933
|
-
AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
|
8934
|
-
|
8935
|
-
GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
|
8936
|
-
GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
|
8937
|
-
GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
|
8938
|
-
GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
|
8939
|
-
|
8940
|
-
At any rate, the individual functionality for that is also available
|
8941
|
-
within the Bioroebe project since as of **April 2022**.
|
8942
|
-
|
8943
|
-
The method that does so is:
|
8944
|
-
|
8945
|
-
Bioroebe.frequency_per_thousand
|
8946
|
-
Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
|
8947
|
-
|
8948
|
-
At a later time sinatra-bindings as well as ruby-gtk3 bindings will
|
8949
|
-
be added, and possibly ruby-libui bindings as well, for windows
|
8950
|
-
support. What is missing is support for different codon tables in
|
8951
|
-
different species, but that may be added at a later time as well
|
8952
|
-
- for now it seemed more important to offer the functionality.
|
8953
|
-
|
8954
8356
|
## class Bioroebe::Protein
|
8955
8357
|
|
8956
8358
|
**class Bioroebe::Protein** can be used to store a protein sequence.
|
@@ -9183,6 +8585,1036 @@ time being it is what it is. At a later point in time test cases
|
|
9183
8585
|
may be added to check whether it performs correctly or whether it
|
9184
8586
|
does not.
|
9185
8587
|
|
8588
|
+
The other rules, also published in 2004, are the Reynolds rules. Code
|
8589
|
+
support was added to the Bioroebe project in <b>June 2022</b>, but
|
8590
|
+
it was not tested yet, so the implementation may be incorrect.
|
8591
|
+
|
8592
|
+
## The Bioroebe::Shell interface
|
8593
|
+
|
8594
|
+
The following subsection specifically handles information
|
8595
|
+
pertaining to the <b>Bioroebe::Shell</b> interface of the
|
8596
|
+
<b>bioroebe project</b>. It is also called <b>bioshell</b>,
|
8597
|
+
to simplify spelling it.
|
8598
|
+
|
8599
|
+
### Numbers as input in the bioshell
|
8600
|
+
![alt text][cat1]
|
8601
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8602
|
+
|
8603
|
+
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
8604
|
+
|
8605
|
+
This will attempt to <b>display the first 3 nucleotides</b> of
|
8606
|
+
the assigned **main sequence**. It will only work if you have
|
8607
|
+
assigned a sequence prior to that, though.
|
8608
|
+
|
8609
|
+
Examples:
|
8610
|
+
|
8611
|
+
3
|
8612
|
+
33
|
8613
|
+
15
|
8614
|
+
|
8615
|
+
### transeq
|
8616
|
+
![alt text][cat1]
|
8617
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8618
|
+
|
8619
|
+
You can convert a DNA sequence into an aminoacid sequence by
|
8620
|
+
doing this:
|
8621
|
+
|
8622
|
+
transeq
|
8623
|
+
|
8624
|
+
### Shuffling the DNA/RNA string in the bioshell
|
8625
|
+
![alt text][cat1]
|
8626
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8627
|
+
|
8628
|
+
Via
|
8629
|
+
|
8630
|
+
shuffle
|
8631
|
+
|
8632
|
+
you can <b>randomly rearrange the main DNA/RNA string</b>
|
8633
|
+
that is used by the <b>Bioroebe::Shell</b>.
|
8634
|
+
|
8635
|
+
This can be useful if you just wish to quickly "test"
|
8636
|
+
new compositions of the same nucleotide.
|
8637
|
+
|
8638
|
+
### Permanently disabling showing the startup-introduction of the Bioshell
|
8639
|
+
![alt text][cat1]
|
8640
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8641
|
+
|
8642
|
+
If you do not want to see the start-up intro, you can try
|
8643
|
+
any of the following:
|
8644
|
+
|
8645
|
+
bioshell --permanently-disable-startup-intro
|
8646
|
+
bioshell --permanently-disable-startup-notice
|
8647
|
+
bioshell --permanently-no-startup-intro
|
8648
|
+
bioshell --permanently-no-startup-info
|
8649
|
+
|
8650
|
+
### Longest substring
|
8651
|
+
![alt text][cat1]
|
8652
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8653
|
+
|
8654
|
+
Within the Bioroebe::Shell you can determine the longest substring,
|
8655
|
+
including gaps, like s:'
|
8656
|
+
|
8657
|
+
longest_substring? ATTATTGTT | ATTATTCTT'
|
8658
|
+
|
8659
|
+
Note that this will make use of the diff-lcs gem, which uses
|
8660
|
+
the McIlroy-Hunt algorithm.
|
8661
|
+
|
8662
|
+
### Do not create directories on startup of the shell
|
8663
|
+
![alt text][cat1]
|
8664
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8665
|
+
|
8666
|
+
By default the <b>bioshell</b> will try to create some directories
|
8667
|
+
on startup. This may not always be desired by the user, though,
|
8668
|
+
so an option has to exist to <b>disable</b> this functionality.
|
8669
|
+
|
8670
|
+
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
8671
|
+
keeps track of whether directories on startup of the shell will
|
8672
|
+
be created.
|
8673
|
+
|
8674
|
+
To disable this behaviour on startup of the bioshell, try
|
8675
|
+
something like this:
|
8676
|
+
|
8677
|
+
bioshell --do-not-create-directories-on-startup
|
8678
|
+
bioshell --do-not-create-directories
|
8679
|
+
|
8680
|
+
### Generating and assigning a random amount of nucleotides
|
8681
|
+
![alt text][cat1]
|
8682
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8683
|
+
|
8684
|
+
Via:
|
8685
|
+
|
8686
|
+
random 555
|
8687
|
+
|
8688
|
+
you can "generate" 555 random nucleotides (DNA that is) and
|
8689
|
+
assign it to the main sequence in use by the bioshell. This
|
8690
|
+
is mostly a convenience feature, if you want to debug something
|
8691
|
+
quickly.
|
8692
|
+
|
8693
|
+
### Determining the log directory for the Bioroebe::Shell component
|
8694
|
+
![alt text][cat1]
|
8695
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8696
|
+
|
8697
|
+
Via:
|
8698
|
+
|
8699
|
+
bioshell_log_dir?
|
8700
|
+
|
8701
|
+
you can determine the log-directory output for the bioshell
|
8702
|
+
component. On my home system this will default to
|
8703
|
+
<b>/home/Temp/bioroebe/bioshell/</b>.
|
8704
|
+
|
8705
|
+
### Prompt (the shell prompt of the bioshell)
|
8706
|
+
![alt text][cat1]
|
8707
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8708
|
+
|
8709
|
+
You can set a <b>custom prompt</b> in the bioshell, via
|
8710
|
+
the keywords "<b>prompt</b>" or "<b>set_prompt</b>".
|
8711
|
+
|
8712
|
+
To display the <b>current working directory</b>, do:
|
8713
|
+
|
8714
|
+
prompt pwd
|
8715
|
+
|
8716
|
+
To revert to the old default again, do this:
|
8717
|
+
|
8718
|
+
prompt REVERT
|
8719
|
+
prompt revert
|
8720
|
+
prompt DEFAULT
|
8721
|
+
prompt default
|
8722
|
+
|
8723
|
+
If you do not want to set any prompt, do:
|
8724
|
+
|
8725
|
+
prompt none
|
8726
|
+
|
8727
|
+
### Random stuff - generating random DNA sequences in the bioshell
|
8728
|
+
![alt text][cat1]
|
8729
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8730
|
+
|
8731
|
+
You can <b>generate random DNA sequences</b> in the
|
8732
|
+
<b>bioshell</b> via:
|
8733
|
+
|
8734
|
+
random dna 20
|
8735
|
+
random dna 25
|
8736
|
+
random dna 30
|
8737
|
+
# or simpler
|
8738
|
+
random 20
|
8739
|
+
random 25
|
8740
|
+
random 30
|
8741
|
+
|
8742
|
+
This will generate random DNA sequences, with a length
|
8743
|
+
of 20, 25, 30, respectively. This may not be very useful
|
8744
|
+
but it was important that this functionality is made
|
8745
|
+
available somewhere. Sometimes you may not even care
|
8746
|
+
about the sequence and just use the a "filler" sequence,
|
8747
|
+
so randomness has to be part of the Bioroebe project
|
8748
|
+
as well.
|
8749
|
+
|
8750
|
+
You can also use some toplevel-methods to generate, e. g.
|
8751
|
+
20 random aminoacids. Have a look at the following
|
8752
|
+
<b>toplevel API</b>:
|
8753
|
+
|
8754
|
+
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
8755
|
+
|
8756
|
+
Note that there may exist other APIs within the Bioroebe project
|
8757
|
+
that do the same as well.
|
8758
|
+
|
8759
|
+
If you would like to use a ruby-gtk3 widget have a look
|
8760
|
+
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
8761
|
+
It works with aminoacids, DNA and RNA, and allows the user to
|
8762
|
+
create random sequences. (If you need weighted randomness then
|
8763
|
+
you currently have to use the commandline variant. Perhaps I may
|
8764
|
+
add support into the GUI directly for this one day.)
|
8765
|
+
|
8766
|
+
### Deprecations within the Bioroebe::Shell
|
8767
|
+
![alt text][cat1]
|
8768
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8769
|
+
|
8770
|
+
Over the years the Bioroebe::Shell changed quite a bit.
|
8771
|
+
|
8772
|
+
This subsection here will list a few of these changes
|
8773
|
+
or rather, the deprecations.
|
8774
|
+
|
8775
|
+
**raw_sequence**: removed in June 2022 completely. It is
|
8776
|
+
simpler to handle sequences via Bioroebe::Sequence
|
8777
|
+
instead.
|
8778
|
+
|
8779
|
+
<b>@internal_hash[:array_sequences]</b> was no longer in
|
8780
|
+
use, so it was removed in July 2022.
|
8781
|
+
|
8782
|
+
### Chop off nucleotides within the Bioroebe::Shell
|
8783
|
+
![alt text][cat1]
|
8784
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8785
|
+
|
8786
|
+
You can use the following syntax to chop away until you find
|
8787
|
+
a particular substring, in the bioshell:
|
8788
|
+
|
8789
|
+
chop_to ATG
|
8790
|
+
|
8791
|
+
This functionality was specifically added to find the first
|
8792
|
+
ATG codon.
|
8793
|
+
|
8794
|
+
### Truncating output in the bioroebe-shell
|
8795
|
+
![alt text][cat1]
|
8796
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8797
|
+
|
8798
|
+
**DNA/RNA sequences** can become very long and then become
|
8799
|
+
quite difficult to view, read and handle on the commandline.
|
8800
|
+
|
8801
|
+
Normally the bioroebe shell will truncate output of DNA sequences
|
8802
|
+
that are "too long". This is mostly done so that working with
|
8803
|
+
very long sequences becomes a bit more convenient.
|
8804
|
+
|
8805
|
+
Sometimes this can become an antifeature, though, so the user
|
8806
|
+
must be able to toggle this at his or her own discretion.
|
8807
|
+
|
8808
|
+
By default, the bioroebe-shell (bioshell) will always try
|
8809
|
+
to truncate output, but you can toggle this behaviour by
|
8810
|
+
issuing:
|
8811
|
+
|
8812
|
+
do not truncate
|
8813
|
+
|
8814
|
+
In theory, other "do not" actions are also supported, or will
|
8815
|
+
be supported in the future; right now (Oct 2019) this is a bit
|
8816
|
+
limited.
|
8817
|
+
|
8818
|
+
From the toplevel, you can use this method:
|
8819
|
+
|
8820
|
+
Bioroebe.do_not_truncate
|
8821
|
+
|
8822
|
+
The above instruction will toggle the truncate behaviour
|
8823
|
+
to not truncate, ever.
|
8824
|
+
|
8825
|
+
If you need to do so within the bioshell, this is the way:
|
8826
|
+
|
8827
|
+
no_truncate
|
8828
|
+
|
8829
|
+
Or simply
|
8830
|
+
|
8831
|
+
truncate
|
8832
|
+
|
8833
|
+
This will toggle, like a switch.
|
8834
|
+
|
8835
|
+
### Working with .pdb files in the bioshell
|
8836
|
+
![alt text][cat1]
|
8837
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8838
|
+
|
8839
|
+
This subsection only very briefly mentions how to work with
|
8840
|
+
.pdb files in the bioshell. See other parts of this
|
8841
|
+
document for a more extensive overview how you can work
|
8842
|
+
with .pdb files via the Bioroebe project.
|
8843
|
+
|
8844
|
+
If you input something like this, if it ends with .pdb:
|
8845
|
+
|
8846
|
+
1fat.pdb
|
8847
|
+
|
8848
|
+
And if no such file currently exists at
|
8849
|
+
/home/Temp/bioroebe/pdb/1fat.pdb then it will be
|
8850
|
+
downloaded and moved towards
|
8851
|
+
**/home/Temp/bioroebe/pdb/**.
|
8852
|
+
|
8853
|
+
This feature exists just to simplify using the
|
8854
|
+
**bioshell**.
|
8855
|
+
|
8856
|
+
### Showing the stop codons in frame1, frame2 and frame2 in the bioshell
|
8857
|
+
![alt text][cat1]
|
8858
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8859
|
+
|
8860
|
+
When you have a given sequence assigned to the bioshell, such
|
8861
|
+
as via "random 99", you can then show all stop codons in
|
8862
|
+
frame1, frame2 and frame3.
|
8863
|
+
|
8864
|
+
The corresponding input for this will be:
|
8865
|
+
|
8866
|
+
stop_frame1?
|
8867
|
+
stop_frame2?
|
8868
|
+
stop_frame3?
|
8869
|
+
|
8870
|
+
An image shows this next, where we first did input "random 120",
|
8871
|
+
before issuing the above-mentioned instructions one after
|
8872
|
+
the other:
|
8873
|
+
|
8874
|
+
<img src="https://i.imgur.com/HpHF4jq.png" style="margin: 1em; border: 1px solid black">
|
8875
|
+
|
8876
|
+
### Freezing the main sequence in the bioshell - and unfreezing it again
|
8877
|
+
![alt text][cat1]
|
8878
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8879
|
+
|
8880
|
+
You can **freeze** the BioShell, meaning that it will no longer
|
8881
|
+
allow for the main sequence to be modified, via the following
|
8882
|
+
command:
|
8883
|
+
|
8884
|
+
freeze
|
8885
|
+
|
8886
|
+
To <b>unfreeze</b> the sequence again, issue:
|
8887
|
+
|
8888
|
+
unfreeze
|
8889
|
+
|
8890
|
+
This functionality has been added because the shell may sometimes be
|
8891
|
+
quite eager to change the main sequence, so we needed a way to
|
8892
|
+
disable any further modifications (until "unfreeze" is issued
|
8893
|
+
that is).
|
8894
|
+
|
8895
|
+
## Support for other programming languages
|
8896
|
+
|
8897
|
+
The main programming language for the bioroebe project is **ruby**.
|
8898
|
+
Ruby, from a language design point of view, is a great programming
|
8899
|
+
language - not necessarily all of ruby, but the subset that I use.
|
8900
|
+
It is very easy to quickly prototype ideas via ruby.
|
8901
|
+
|
8902
|
+
However had, ruby is known to **not** be among the fastest programming
|
8903
|
+
languages about on this planet; so, it makes sense to use other
|
8904
|
+
languages too from this point of view. Additionally there are some
|
8905
|
+
software stacks in use in **other** programming languages, such as
|
8906
|
+
matplotlib and various more.
|
8907
|
+
|
8908
|
+
Thus, it is important to **support other programming languages** as
|
8909
|
+
well, if there are useful libraries. The bioroebe project, after
|
8910
|
+
all, tries to be **practical**: it focuses on getting things done,
|
8911
|
+
no matter the language.
|
8912
|
+
|
8913
|
+
This means that support for other programming languages can be
|
8914
|
+
found in this project as well, often using system() or similar
|
8915
|
+
functionality to tap into these other programming languages. Do
|
8916
|
+
not be surprised when that happens - the bioroebe project will
|
8917
|
+
also try to act as a **practical glue** towards functionality
|
8918
|
+
enabled via other projects. We want to get things done, no
|
8919
|
+
matter the programming language at hand!
|
8920
|
+
|
8921
|
+
Whenever possible, though, the bioroebe project will try to be
|
8922
|
+
flexible in this regard, so ideally the same solution should
|
8923
|
+
work for many different programming languages.
|
8924
|
+
|
8925
|
+
While Ruby is the primary language for this project, since as
|
8926
|
+
of 2021 I will try to officially support **java**, **jruby**
|
8927
|
+
and the **GraalVM**. This is on my TODO list, though - stay
|
8928
|
+
tuned for more updates in this regard. See also the
|
8929
|
+
subsection <b>Support for Python</b>.
|
8930
|
+
|
8931
|
+
## Support for Python
|
8932
|
+
|
8933
|
+
In <b>June 2022</b> I decided to add support for Python to bioroebe.
|
8934
|
+
|
8935
|
+
While people can - and should - easily use <b>biopython</b> instead,
|
8936
|
+
I simply wanted to see how much python-support I can add to
|
8937
|
+
bioroebe. This may lag behind some years compared to biopython,
|
8938
|
+
but I wanted to extend python support as well, so there you go.
|
8939
|
+
It is simply an additional option for the bioroebe project.
|
8940
|
+
<b>Ruby</b> will remain the primary language for the project,
|
8941
|
+
though, at the least for now.
|
8942
|
+
|
8943
|
+
## Bioroebe::ProfilePattern
|
8944
|
+
|
8945
|
+
This class can be used to generate nucleotide sequences that
|
8946
|
+
are not quite "random". For example, to generate sequences
|
8947
|
+
that may "simulate" a TATA box.
|
8948
|
+
|
8949
|
+
The idea for this class is to be extended into allowing
|
8950
|
+
HMMs (Hidden Markov Models) one day.
|
8951
|
+
|
8952
|
+
Usage example:
|
8953
|
+
|
8954
|
+
_ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
|
8955
|
+
_.generate_sequence_based_on_this_profile
|
8956
|
+
|
8957
|
+
Such a profile will encode the profile specifying the preferred sequence
|
8958
|
+
letters for each position in a section of DNA. You have to provide
|
8959
|
+
the Hash into the method generate_sequence_based_on_this_profile() -
|
8960
|
+
or you use the default Hash, which is stored in the constant
|
8961
|
+
called **PER_POSITION_HASH**.
|
8962
|
+
|
8963
|
+
That profile should be a Hash, with keys pointing to A, T, C, G
|
8964
|
+
and the values being an Array of likelihood chance there,
|
8965
|
+
as a number, such as 140. These values are also called
|
8966
|
+
**scores**. Each score contains a number for each position
|
8967
|
+
that indicates how likely it is to find the given
|
8968
|
+
nucleotide at that location.
|
8969
|
+
|
8970
|
+
You can also use this class to generate a random DNA string,
|
8971
|
+
similar to the method called
|
8972
|
+
**Bioroebe.generate_random_dna_sequence()**. The difference
|
8973
|
+
is that class ProfilePattern allows for a bit more fine-tuned
|
8974
|
+
control. The class will likely be extended in the future too.
|
8975
|
+
|
8976
|
+
## Generate DNA via Bioroebe.random_dna
|
8977
|
+
|
8978
|
+
You can "generate" random DNA strings by making use of the
|
8979
|
+
following code:
|
8980
|
+
|
8981
|
+
x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
|
8982
|
+
|
8983
|
+
As can be seen in the example above, after the #, a String will be
|
8984
|
+
returned representing that nucleotide sequence. In the case above
|
8985
|
+
it'll be 50 nucleotides in length.
|
8986
|
+
|
8987
|
+
The number given to <b>.random_dna()</b> tells the method how many
|
8988
|
+
nucleotides should be generated.
|
8989
|
+
|
8990
|
+
The method accepts a second argument, which should be a Hash.
|
8991
|
+
If it is a hash then the generated DNA will be based on the
|
8992
|
+
**probabilities** given to that Hash.
|
8993
|
+
|
8994
|
+
Let's look at specific example here:
|
8995
|
+
|
8996
|
+
Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"
|
8997
|
+
|
8998
|
+
As you can see, in the Hash defined above, the likelihood for
|
8999
|
+
incorporating a Guanine is much higher than for Adenine
|
9000
|
+
(70 : 10). This will be reflected in the generated DNA
|
9001
|
+
sequence which, as can be seen, contains many more
|
9002
|
+
Guanines than Adenines.
|
9003
|
+
|
9004
|
+
There is yet a third use case for the above. If you pass a **String**
|
9005
|
+
as the second argument rather than a Hash, then that String will be
|
9006
|
+
used as basis for generating the DNA string at hand.
|
9007
|
+
|
9008
|
+
Again, let's look at a specific example here:
|
9009
|
+
|
9010
|
+
Bioroebe.random_dna(10, 'ATCGATCGGG')
|
9011
|
+
|
9012
|
+
Here we add more G than A, T or C, so the new DNA sequence should
|
9013
|
+
contain these nucleotides as well.
|
9014
|
+
|
9015
|
+
More usage examples in this regard:
|
9016
|
+
|
9017
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
|
9018
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"
|
9019
|
+
|
9020
|
+
Note that this is similar to the .randomize() method in the bioruby
|
9021
|
+
project:
|
9022
|
+
|
9023
|
+
hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
|
9024
|
+
puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)
|
9025
|
+
|
9026
|
+
## Generating a random nucleotide sequence based on frequencies
|
9027
|
+
|
9028
|
+
If you ever need to generate a nucleotide frequency then you can use
|
9029
|
+
the following method:
|
9030
|
+
|
9031
|
+
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
|
9032
|
+
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
|
9033
|
+
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
|
9034
|
+
|
9035
|
+
## Parsing genbank (.gbk) files
|
9036
|
+
|
9037
|
+
You could use class <b>Bioroebe::GenbankParser</b> to parse .gbk files, at
|
9038
|
+
the least if you want to obtain the raw sequence, in FASTA format.
|
9039
|
+
|
9040
|
+
Example for this:
|
9041
|
+
|
9042
|
+
require 'bioroebe/genbank/genbank_parser.rb'
|
9043
|
+
result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
|
9044
|
+
result.dataset? # This method call will return the FASTA sequence.
|
9045
|
+
|
9046
|
+
Note that this currently (<b>July 2022</b>) only grabs one entry. In
|
9047
|
+
the upcoming rewrite in the future the parser will be able to parse
|
9048
|
+
all entries, and then present them to the user. Stay tuned in this
|
9049
|
+
regard.
|
9050
|
+
|
9051
|
+
## Parsers in general
|
9052
|
+
|
9053
|
+
The bioroebe project will store most parsers in the parsers/ subdirectory
|
9054
|
+
since as of <b>July 2022</b>.
|
9055
|
+
|
9056
|
+
Prior to that date different parsers were stored in different subdirectories,
|
9057
|
+
such as the parser for genbank-files being stored in the genbank/
|
9058
|
+
subdirectory. As I found this situation confusing, I settled for
|
9059
|
+
the parsers/ subdirectory since as of <b>July 2022</b>.
|
9060
|
+
|
9061
|
+
## Coomassie staining of proteins
|
9062
|
+
|
9063
|
+
Coomassie staining is typically done on proteins, giving them a blue
|
9064
|
+
or blueish colour. <b>Coomassie staining</b> is <b>the most popular
|
9065
|
+
anionic protein dye</b>.
|
9066
|
+
|
9067
|
+
This may look like this:
|
9068
|
+
|
9069
|
+
<img src="https://i.imgur.com/6eUN7HR.png" style="margin: 1em; border: 1px solid black">
|
9070
|
+
|
9071
|
+
This picture shows five different bands. The molecular weight of the
|
9072
|
+
marker can be seen on the very left hand side, in <b>kDa</b>. The
|
9073
|
+
larger fragments can be seen on top, so the farther the band has
|
9074
|
+
moved, the smaller the fragment must be (in kDa). That means that
|
9075
|
+
the larger proteins can be found on top; the smaller proteins on
|
9076
|
+
the bottom.
|
9077
|
+
|
9078
|
+
Some bands are missing, and this gives information - that is
|
9079
|
+
that a particular protein is missing. Probably it was not
|
9080
|
+
synthesized in the given tissue at hand.
|
9081
|
+
|
9082
|
+
The staining for a Coomassie Blue stain is typically done
|
9083
|
+
via G-250, with a 0.5% density prepared in
|
9084
|
+
50% methanol and 10% acetic acid. The staining duration is
|
9085
|
+
usually done for 5 minutes.
|
9086
|
+
|
9087
|
+
Note that the G-250 stain is the dimethyl derivative from
|
9088
|
+
R-250 - the <b>R</b> stands for <b>red</b> or <b>reddish</b>.
|
9089
|
+
Both dyes will bind via electrostatic interaction with <b>protonated
|
9090
|
+
basic amino acids</b>: that is <b>lysine</b>, <b>arginine</b>,
|
9091
|
+
and <b>histidine</b>. They can also bind via hydrophobic
|
9092
|
+
associations to aromatic residues.
|
9093
|
+
|
9094
|
+
Coomassie stains are in principle reversible. They are not
|
9095
|
+
as sensitive as silver staining, but significantly cheaper,
|
9096
|
+
which is one reason why they have become so popular.
|
9097
|
+
|
9098
|
+
Not every protein has all aminoacids, so staining may be difficult.
|
9099
|
+
For instance, the <b>glycomacropeptide</b> is the only known
|
9100
|
+
naturally occurring protein that contains no Phe (Phenylalanine; F).
|
9101
|
+
|
9102
|
+
A protein that lacks lysine, arginine, histidine or aromatic
|
9103
|
+
acids may be undetectable via Coomassie staining. However had,
|
9104
|
+
this does not seem to be a universal rule; some groups report
|
9105
|
+
that they even managed to stain "unstainable" proteins via
|
9106
|
+
Coomassie staining.
|
9107
|
+
|
9108
|
+
The paper at https://www.jbc.org/article/S0021-9258(17)39198-6/pdf,
|
9109
|
+
titled "Why Does Coomassie Brilliant Blue R Interact Differently
|
9110
|
+
with Different Proteins?" and published in the year 1985, tries
|
9111
|
+
to give some explanations to different groups yielding different
|
9112
|
+
results via Coomassie staining.
|
9113
|
+
|
9114
|
+
They specifically point out that "there is a striking correlation
|
9115
|
+
between intensity of response to Coomassie dyes and the basicity
|
9116
|
+
of a protein which depends on the number of lysine, histidine,
|
9117
|
+
and arginine residues, as well as the NH₂-terminal amino group"
|
9118
|
+
(aka the aminoterminus of the protein at hand). The concluding
|
9119
|
+
remark from that paper is that <b>"Coomassie R Interacts
|
9120
|
+
Differently with Different Proteins"</b>.
|
9121
|
+
|
9122
|
+
On class <b>Bioroebe::Protein</b> you can determine whether
|
9123
|
+
a given protein can be stained via coomassie through the
|
9124
|
+
following method:
|
9125
|
+
|
9126
|
+
.can_be_stained_via_coomassie?
|
9127
|
+
|
9128
|
+
This isn't an ideal check, so don't rely on it. It will simply
|
9129
|
+
check whether the sequence has at the least one lysine,
|
9130
|
+
or one histidine, or one arginine, or any of the aromatic
|
9131
|
+
amino acids.
|
9132
|
+
|
9133
|
+
## Codon Usage
|
9134
|
+
|
9135
|
+
This **paragraph** deals with some aspects of **codon usage** in different
|
9136
|
+
organisms.
|
9137
|
+
|
9138
|
+
Let us first define the term <b>codon usage</b> so we can base any further
|
9139
|
+
analysis on this definition. In order to do so, we also have to define
|
9140
|
+
what a <b>codon</b> is, so let's start with that actually.
|
9141
|
+
|
9142
|
+
A <span style="color: darkgreen; font-weight: bold">codon</span> is
|
9143
|
+
essentially the basic code used in DNA to denote which particular
|
9144
|
+
**aminoacid** corresponds to these (three) nucleotide base pairs.
|
9145
|
+
A codon is thus <b>a series of three nucleotides</b>, also called
|
9146
|
+
a <b>triplet</b>, such as <b>ATG</b>.
|
9147
|
+
|
9148
|
+
When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
|
9149
|
+
abbreviated as <b>dsDNA</b>. The codon is, however had, only found
|
9150
|
+
in a single stranded molecule, even within dsDNA. Since some parts of
|
9151
|
+
a **dsDNA** in any given genome give rise to a, more or less, complementary
|
9152
|
+
copy into **mRNA**, the codons that are actually used, are found in the
|
9153
|
+
corresponding mRNA as well, excluding the codon that codes for a stop
|
9154
|
+
signal (a so-called <b>stop codon</b>). (Remember that mRNA differs from
|
9155
|
+
DNA in that there will be Uracil rather than Thymine; otherwise it is
|
9156
|
+
the same, sequence-wise. Of course it uses another sugar (Ribose), but
|
9157
|
+
remember we are here mostly interested in the **information-containing
|
9158
|
+
part**, not the full chemical structure.)
|
9159
|
+
|
9160
|
+
The <b>codon</b> is thus found on the mRNA and since mRNA is mostly
|
9161
|
+
single-stranded, the codon is a component of the mRNA. The two subunits
|
9162
|
+
of the ribosome are assembled on a mRNA, at the least in prokaryotes (or
|
9163
|
+
more accurately, the smaller subunit scans along the mRNA until it
|
9164
|
+
<b>detects</b> a start codon). Mind you, this subsection will not go
|
9165
|
+
into all relevant details, so just keep in mind that the codon is the
|
9166
|
+
part that will eventually be "<i>translated</i>" at the ribosome into
|
9167
|
+
a corresponding aminoacid, excluding stop codons at the end.
|
9168
|
+
|
9169
|
+
Now - different organisms use **different frequencies of codons**.
|
9170
|
+
<b style="color:darkblue">Codon usage</b> thus describes the fact
|
9171
|
+
that many proteins in these different organisms make use of certain
|
9172
|
+
codons with a **substantially higher frequency than other codons**.
|
9173
|
+
We can use statistics to infer this on a global (proteome)
|
9174
|
+
level too.
|
9175
|
+
|
9176
|
+
Remember that the genetic code is **degenerate**, meaning that
|
9177
|
+
you have a few aminoacids that are encoded only by one codon
|
9178
|
+
(<b>Tryptophan</b> and <b>Methionine</b>), whereas the other
|
9179
|
+
aminoacids are encoded by more than one codon - thus, at the
|
9180
|
+
very least two codons. Note that the latter codons, if they
|
9181
|
+
code for the **same** aminoacid, are also called
|
9182
|
+
<b style="font-style: italic">synonymous codons</b>.
|
9183
|
+
|
9184
|
+
This means that if you have any given aminoacid chain, you can have
|
9185
|
+
several different sequences that would yield to the very same
|
9186
|
+
amino acid chain (and codons in these sequences, which
|
9187
|
+
ultimately means that you can have different DNA sequences
|
9188
|
+
code for the very same aminoacid chain).
|
9189
|
+
|
9190
|
+
Usually the third base of a codon has the least influence on
|
9191
|
+
codon meaning. This is also called <b>wobbling</b> - since
|
9192
|
+
the anticodon loop on the tRNA is in the reverse direction,
|
9193
|
+
and the wobble position refers to the tRNA, this means that
|
9194
|
+
the wobble-position is at the 5'-end of the tRNA anticodon.
|
9195
|
+
|
9196
|
+
Now a few words about functionality related to codons and codon
|
9197
|
+
usage in the Bioroebe project.
|
9198
|
+
|
9199
|
+
Say that you have a long DNA sequence; let's pick a sample
|
9200
|
+
for now, such as:
|
9201
|
+
|
9202
|
+
ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
9203
|
+
|
9204
|
+
You can analyze the codons used via class **ShowCodonUsage**
|
9205
|
+
and the corresponding entry at <b>bin/show_codon_usage</b>:
|
9206
|
+
|
9207
|
+
show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
9208
|
+
|
9209
|
+
This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
|
9210
|
+
It will report the top 5 codons in use and also output the
|
9211
|
+
frequency hash on the commandline.
|
9212
|
+
|
9213
|
+
On my computer at home the output it yields via the commandline,
|
9214
|
+
on a KDE konsole terminal, looks like this:
|
9215
|
+
|
9216
|
+
<img src="https://i.imgur.com/h55Thdu.png" style="margin: 1em; border: 3px solid black">
|
9217
|
+
|
9218
|
+
You can use this from within ruby code too, via the following
|
9219
|
+
toplevel method:
|
9220
|
+
|
9221
|
+
Bioroebe.codon_frequencies_of_this_sequence(ARGV)
|
9222
|
+
|
9223
|
+
To get the hash of the codon frequencies you can use the .hash? method:
|
9224
|
+
|
9225
|
+
hash = Bioroebe.codon_frequencies_of_this_sequence('ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG').hash?
|
9226
|
+
|
9227
|
+
If you want to look at the actual codon frequencies used
|
9228
|
+
by different organisms, have a look here:
|
9229
|
+
|
9230
|
+
http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
|
9231
|
+
|
9232
|
+
This is an excellent resource.
|
9233
|
+
|
9234
|
+
For instance, the <i>E. coli</i> K strain can be found here:
|
9235
|
+
|
9236
|
+
https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=83333&aa=9&style=N
|
9237
|
+
|
9238
|
+
## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
|
9239
|
+
|
9240
|
+
If you quickly wish to determine the aminoacid composition, as a
|
9241
|
+
Hash, you can use **bin/aminoacid_frequencies**.
|
9242
|
+
|
9243
|
+
Example from the commandline for this:
|
9244
|
+
|
9245
|
+
aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
|
9246
|
+
|
9247
|
+
Example from within bioroebe itself (and thus ruby):
|
9248
|
+
|
9249
|
+
require 'bioroebe/frequencies.rb'
|
9250
|
+
|
9251
|
+
Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
|
9252
|
+
|
9253
|
+
The latter will return a Hash that you can then further make use for, such as:
|
9254
|
+
|
9255
|
+
{"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
|
9256
|
+
|
9257
|
+
## Determining the codon frequencies from the commandline
|
9258
|
+
|
9259
|
+
In <b>April 2022</b> I noticed that one use case is to show the
|
9260
|
+
codon frequencies of a given sequence - typically a nucleotide sequence.
|
9261
|
+
|
9262
|
+
For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
|
9263
|
+
|
9264
|
+
So, following that logic, a new executable was added at
|
9265
|
+
**bin/codon_frequency**. This will show the Hash of the codon
|
9266
|
+
frequencies, as a String, on the commandline.
|
9267
|
+
|
9268
|
+
Usage example:
|
9269
|
+
|
9270
|
+
codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
|
9271
|
+
|
9272
|
+
The output of this would be the following:
|
9273
|
+
|
9274
|
+
AUU: 2
|
9275
|
+
CGU: 2
|
9276
|
+
ACG: 1
|
9277
|
+
AUC: 1
|
9278
|
+
GAC: 1
|
9279
|
+
UGA: 1
|
9280
|
+
CUG: 1
|
9281
|
+
ACA: 1
|
9282
|
+
GUC: 1
|
9283
|
+
|
9284
|
+
## Showing the codon frequency via countcodon
|
9285
|
+
|
9286
|
+
The excellent website at https://www.kazusa.or.jp/codon/countcodon.html offers
|
9287
|
+
a rather useful functionality via a simple web-interface, in that you can pass
|
9288
|
+
in a mRNA sequence, and it will then show the codon frequency/likelihood of
|
9289
|
+
that sequence - all codons in that sequence, that is. This can be extended
|
9290
|
+
to <b>all protein-coding genes in a given genome</b>, and will thus be
|
9291
|
+
useful for a researcher who may be interested in determining the codon
|
9292
|
+
frequency in general, across all genes in that given genome.
|
9293
|
+
|
9294
|
+
You can test it with an input sequence.
|
9295
|
+
|
9296
|
+
For instance, the following sequence:
|
9297
|
+
|
9298
|
+
ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
|
9299
|
+
|
9300
|
+
Would yield this result:
|
9301
|
+
|
9302
|
+
fields: [triplet] [frequency: per thousand] ([number])
|
9303
|
+
|
9304
|
+
UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
|
9305
|
+
UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
|
9306
|
+
UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
|
9307
|
+
UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
|
9308
|
+
|
9309
|
+
CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
|
9310
|
+
CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
|
9311
|
+
CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
|
9312
|
+
CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
|
9313
|
+
|
9314
|
+
AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
|
9315
|
+
AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
|
9316
|
+
AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
|
9317
|
+
AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
|
9318
|
+
|
9319
|
+
GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
|
9320
|
+
GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
|
9321
|
+
GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
|
9322
|
+
GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
|
9323
|
+
|
9324
|
+
At any rate, the individual functionality for that is also available
|
9325
|
+
within the Bioroebe project since as of **April 2022**.
|
9326
|
+
|
9327
|
+
The method that does so is:
|
9328
|
+
|
9329
|
+
Bioroebe.frequency_per_thousand
|
9330
|
+
Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
|
9331
|
+
|
9332
|
+
Sinatra-bindings exist to this functionality since as of July 2022,
|
9333
|
+
but they are not very well-polished. Ruby-gtk3 bindings may be
|
9334
|
+
added at a later time, and possibly ruby-libui bindings as well, for
|
9335
|
+
windows support. What is missing is support for different codon tables in
|
9336
|
+
different species, but that may be added at a later time as well - for now
|
9337
|
+
it seemed more important to offer the functionality.
|
9338
|
+
|
9339
|
+
## Working with PDB files (.pdb)
|
9340
|
+
|
9341
|
+
The **PDB**, founded in the year **1971**, holds lots of **atomic
|
9342
|
+
structures of proteins**.
|
9343
|
+
|
9344
|
+
For instance, in **July 2016** it contained **121000 structures**.
|
9345
|
+
|
9346
|
+
In **February 2018** it contained **~124000 structures**
|
9347
|
+
(from X-ray crystallography), and about **~12000 NMR
|
9348
|
+
structures**. <b>NMR</b> is limited to about <b>350 amino
|
9349
|
+
acids maximum length</b>, give or take.
|
9350
|
+
|
9351
|
+
In **April 2020** the PDB contained **163141 structures**.
|
9352
|
+
|
9353
|
+
We can see that more and more structures are available nowadays -
|
9354
|
+
a trend that will most likely continue or even accelerate.
|
9355
|
+
(Let's hope the quality also remains high.)
|
9356
|
+
|
9357
|
+
A typical .pdb file contains entries such as this:
|
9358
|
+
|
9359
|
+
RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
|
9360
|
+
ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
|
9361
|
+
ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
|
9362
|
+
ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
|
9363
|
+
ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
|
9364
|
+
ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
|
9365
|
+
|
9366
|
+
(Not the first line; **RTyp** is just an explanation for the ATOM
|
9367
|
+
entries below that line).
|
9368
|
+
|
9369
|
+
The sequence starts from the N-terminal residue for proteins; see
|
9370
|
+
the <b>Atm</b> entry at <b>Num 1</b>.
|
9371
|
+
|
9372
|
+
The **meaning of these entries** is as follows:
|
9373
|
+
|
9374
|
+
1) RTyp: Record Type
|
9375
|
+
2) Num: Serial number of the atom. Each atom has a unique serial number.
|
9376
|
+
3) Atm: Atom name (in IUPAC format).
|
9377
|
+
4) Res: Residue name (IUPAC format).
|
9378
|
+
5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
|
9379
|
+
6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
|
9380
|
+
7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
|
9381
|
+
10) Occ: Occupancy factor
|
9382
|
+
11) Temp: Temperature factor (atoms disordered in the crystal have high
|
9383
|
+
temperature factors; they are "wobbly" with a high factor.
|
9384
|
+
This is also called the B-factor).
|
9385
|
+
12) PDB: The PDB data file unique identifier.
|
9386
|
+
13) Line: Line (record) number in the data file.
|
9387
|
+
|
9388
|
+
Typically the entry on the most right area, the last one, specifies
|
9389
|
+
which atom it is. A **H** stands for a hydrogen atom; the other atoms
|
9390
|
+
are "heavy" atoms (heavier than hydrogen most definitely).
|
9391
|
+
|
9392
|
+
Most .pdb files will contain **SEQRES** entries. These entries will list
|
9393
|
+
the primary sequence of the polymeric molecules present in the entry.
|
9394
|
+
You can notice this by looking at the standard 3-character code
|
9395
|
+
used by SEQRES here, for the canonical amino acids. So, for instance,
|
9396
|
+
the amino acids that will be mentioned in a SEQRES entry are
|
9397
|
+
ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
|
9398
|
+
PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
|
9399
|
+
method **Bioroebe.three_to_one()** to convert back to the
|
9400
|
+
one-letter chain such as follows:
|
9401
|
+
|
9402
|
+
Bioroebe.three_to_one('PHE') # => "F"
|
9403
|
+
|
9404
|
+
The data in a .pdb file need not necessarily only be a protein, with
|
9405
|
+
a specific aminoacid sequence. It may also include DNA. An example
|
9406
|
+
for such a molecule is
|
9407
|
+
<b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
|
9408
|
+
which includes a protein chain and a DNA chain.
|
9409
|
+
|
9410
|
+
As far as the **bioroebe project** is concerned, you can parse .pdb files
|
9411
|
+
via the following class:
|
9412
|
+
|
9413
|
+
Bioroebe::ParsePdbFile.new
|
9414
|
+
Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
|
9415
|
+
Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
|
9416
|
+
|
9417
|
+
This class also allows some shortcuts for integrated .pdb files,
|
9418
|
+
that is files that are bundled with the bioroebe project:
|
9419
|
+
|
9420
|
+
Bioroebe::ParsePdbFile.new ':1fat'
|
9421
|
+
|
9422
|
+
This requires a String because ruby symbols may not start with
|
9423
|
+
a number. Note that this also works through the commandline,
|
9424
|
+
such as:
|
9425
|
+
|
9426
|
+
parse_pdb_file :1fat
|
9427
|
+
|
9428
|
+
A shell such as bash does not understand ruby symbols, so instead
|
9429
|
+
a string will be passed in, being :1fat. The ParsePdbFile will
|
9430
|
+
handle this correctly internally.
|
9431
|
+
|
9432
|
+
Note that a small bug was fixed in the file parse_pdb_file.rb;
|
9433
|
+
some entries were skipped due to an erroneous loop in the ruby
|
9434
|
+
file. This was corrected in **May 2020**.
|
9435
|
+
|
9436
|
+
In **March 2021** the ability to use entries such as ':1fat'
|
9437
|
+
was removed again; the code remains though. The reason why
|
9438
|
+
this was removed was that the .pdb files are quite large,
|
9439
|
+
so distributing them via the bioroebe project makes no real
|
9440
|
+
sense. Consider simply downloading the .pdb files; you
|
9441
|
+
can use this from the bioshell or via something
|
9442
|
+
like:
|
9443
|
+
|
9444
|
+
pdb 5TIM
|
9445
|
+
|
9446
|
+
Note that you can also return the aminoacid-sequence from a
|
9447
|
+
.pdb file directly, since as of **May 2020**.
|
9448
|
+
|
9449
|
+
Example for this:
|
9450
|
+
|
9451
|
+
Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
9452
|
+
|
9453
|
+
The first argument should be **the path to the (local)
|
9454
|
+
.pdb file at hand**. (In theory support for remote .pdb
|
9455
|
+
files could also be added easily, but right now this
|
9456
|
+
is not possible, so you have to download it first.)
|
9457
|
+
|
9458
|
+
The **specification for .pdb files** can be read at the following
|
9459
|
+
two remote resources:
|
9460
|
+
|
9461
|
+
http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
|
9462
|
+
http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
|
9463
|
+
|
9464
|
+
Note that the parse_pdb_file.rb can also do some additional
|
9465
|
+
things, such as calculating the maximum distance between
|
9466
|
+
atoms in that file, via the method
|
9467
|
+
**.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
|
9468
|
+
|
9469
|
+
If you wish to report the secondary structures from a given .pdb file
|
9470
|
+
then you can use the following class:
|
9471
|
+
|
9472
|
+
require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
|
9473
|
+
|
9474
|
+
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
|
9475
|
+
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
|
9476
|
+
|
9477
|
+
If you wish to obtain the FASTA sequence of a particular remote
|
9478
|
+
.pdb file then you can use this API:
|
9479
|
+
|
9480
|
+
x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
9481
|
+
|
9482
|
+
Keep in mind that this is the FASTA sequence; the .pdb file itself
|
9483
|
+
has another format, and contains a lot more information, such as
|
9484
|
+
the various ATOM entries.
|
9485
|
+
|
9486
|
+
Since as of **June 2020** the command **fetch** also works from
|
9487
|
+
within the Bioshell, similar to how pymol **works**. This allows
|
9488
|
+
us to quickly download a remote .pdb file.
|
9489
|
+
|
9490
|
+
fetch 2BTS
|
9491
|
+
|
9492
|
+
You can also use the following toplevel-API to download a remote
|
9493
|
+
.pdb file:
|
9494
|
+
|
9495
|
+
Bioroebe.download_this_pdb
|
9496
|
+
Bioroebe.download_this_pdb '355D'
|
9497
|
+
Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
|
9498
|
+
Bioroebe.download_this_pdb '1fat.pdb' # Lectin Phytohemagglutinin
|
9499
|
+
|
9500
|
+
This will refer to a remote URL such as
|
9501
|
+
https://files.rcsb.org/view/1FAT.pdb.
|
9502
|
+
|
9503
|
+
Note that this will be automatically moved to the "correct" default
|
9504
|
+
position in the bioroebe-project, under the **pdb/** subdirectory.
|
9505
|
+
|
9506
|
+
You can also invoke this script from the commandline via
|
9507
|
+
**bin/download_this_pdb**, like in this way:
|
9508
|
+
|
9509
|
+
download_this_pdb 355D
|
9510
|
+
|
9511
|
+
This works with several .pdb files in one go as well:
|
9512
|
+
|
9513
|
+
download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
|
9514
|
+
|
9515
|
+
They would all be downloaded one after the other. Be aware that
|
9516
|
+
this will overwrite the old .pdb files on that position, so
|
9517
|
+
if you don't want this, I recommend to do a backup on the
|
9518
|
+
**pdb/** subdirectory before invoking the above call.
|
9519
|
+
|
9520
|
+
You can also turn the FASTA sequence stored in a .pdb file into
|
9521
|
+
a .fasta file, via **--create-fasta-file**.
|
9522
|
+
|
9523
|
+
Usage examples:
|
9524
|
+
|
9525
|
+
parsedb 1NR6 --create-fasta-file
|
9526
|
+
parsedb 2F9Q --create-fasta-file
|
9527
|
+
parsedb 3TDA --create-fasta-file
|
9528
|
+
parsedb 2HI4 --create-fasta-file
|
9529
|
+
parsedb 2V0M --create-fasta-file
|
9530
|
+
|
9531
|
+
So if you have a file called <b>1NR6.pdb</b> and you use
|
9532
|
+
the first input, a .fasta file will be created. If such
|
9533
|
+
a .pdb file does not exist then this will not work, so
|
9534
|
+
make sure to download the .pdb file before invoking
|
9535
|
+
this commandline-flag.
|
9536
|
+
|
9537
|
+
Last but not least, the following table shall document the
|
9538
|
+
PDB format - it is not yet complete, but it is intended
|
9539
|
+
to add the remaining datasets eventually:
|
9540
|
+
|
9541
|
+
Record Name Describes
|
9542
|
+
MODRES Modifications to standard residues
|
9543
|
+
HET Nonstandard residues (as well as ligands, ions and water)
|
9544
|
+
HETNAM Full chemical name of the residue
|
9545
|
+
HETSYM Synonyms for the residue
|
9546
|
+
FORMUL Chemical formula of the residue
|
9547
|
+
KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
|
9548
|
+
|
9549
|
+
|
9550
|
+
## Determining how many stop codons existing in a given sequence
|
9551
|
+
|
9552
|
+
You can use **bin/n_stop_codons_in_this_sequence** to determine
|
9553
|
+
how many stop codons exist in a given sequence at hand.
|
9554
|
+
|
9555
|
+
Usage example from the commandline:
|
9556
|
+
|
9557
|
+
n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
|
9558
|
+
|
9559
|
+
You can also separate these via a ' ' spacer on the commandline of
|
9560
|
+
course:
|
9561
|
+
|
9562
|
+
n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
|
9563
|
+
|
9564
|
+
Internally this makes use of the method called
|
9565
|
+
<b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
|
9566
|
+
aliased names. Usage example for the method, just as in the
|
9567
|
+
first example shown above:
|
9568
|
+
|
9569
|
+
Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
|
9570
|
+
|
9571
|
+
## The Aliphatic Index of Globular Proteins
|
9572
|
+
|
9573
|
+
In a paper from 1980, Atsushi IKAI provided a formula with which one can
|
9574
|
+
calculate the aliphatic index of a globular protein, in a short paper
|
9575
|
+
titled "Thermostability and aliphatic index of globular proteins"
|
9576
|
+
(<b>PMID: 7462208</b>,
|
9577
|
+
<a href="https://www.jstage.jst.go.jp/article/biochemistry1922/88/6/88_6_1895/_article">
|
9578
|
+
see here</a>).
|
9579
|
+
|
9580
|
+
Atsushi provided a statistical analysis of proteins, and determined
|
9581
|
+
that the aliphatic index - which is defined as the relative volume
|
9582
|
+
of a protein occupied by <b>aliphatic side chains</b> (alanine, valine,
|
9583
|
+
isoleucine, and leucine) - of proteins of thermophilic bacteria
|
9584
|
+
is significantly higher than that of ordinary proteins.
|
9585
|
+
|
9586
|
+
Atsushi reasoned that the index may be regarded as a positive
|
9587
|
+
factor for the <b>increase of thermostability of globular
|
9588
|
+
proteins</b>. The enzymes of some organisms are more stable
|
9589
|
+
at higher temperature than the enzymes of other organisms,
|
9590
|
+
in particular among <b>thermostable proteins</b>.
|
9591
|
+
|
9592
|
+
Thus, there is a good correlation between the "aliphatic
|
9593
|
+
index" on the one hand, and the thermostability of proteins
|
9594
|
+
on the other hand.
|
9595
|
+
|
9596
|
+
Atsushi gave the following formula for calculating this:
|
9597
|
+
|
9598
|
+
Aliphatic Index = XA + aXV + b (xI+XL)
|
9599
|
+
|
9600
|
+
The four letters A, V, I and L refer to the four aminoacids
|
9601
|
+
Alanine, Valine, Isoleucine and Leucine. The two coefficients
|
9602
|
+
a and b are the relative volumes of the side chains of
|
9603
|
+
Alanine. A has a value range of 2.8-3.0 and
|
9604
|
+
b has a value range of 3.8-4.0.
|
9605
|
+
|
9606
|
+
The method called <b>.aliphatic_index()</b> is making use of that
|
9607
|
+
formula. As values for a and b the two values <b>2.9</b> and
|
9608
|
+
<b>3.9</b> have been taken. The code in the bioroebe project
|
9609
|
+
for this has been inspired by: https://github.com/wwood/bioruby-aliphatic_index
|
9610
|
+
|
9611
|
+
It yields the following usage example for bioruby:
|
9612
|
+
|
9613
|
+
Bio::Sequence::AA.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
|
9614
|
+
|
9615
|
+
In bioroebe, the equivalent would be:
|
9616
|
+
|
9617
|
+
Bioroebe::Protein.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
|
9186
9618
|
|
9187
9619
|
## Possibly useful links in regards to molecular biology and science in general
|
9188
9620
|
|