bioroebe 0.10.80 → 0.11.24
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of bioroebe might be problematic. Click here for more details.
- checksums.yaml +4 -4
- data/README.md +1204 -772
- data/bioroebe.gemspec +3 -3
- data/doc/README.gen +1203 -771
- data/doc/todo/bioroebe_todo.md +391 -365
- data/lib/bioroebe/aminoacids/aminoacid_substitution.rb +1 -9
- data/lib/bioroebe/aminoacids/codon_percentage.rb +1 -9
- data/lib/bioroebe/aminoacids/deduce_aminoacid_sequence.rb +1 -9
- data/lib/bioroebe/aminoacids/display_aminoacid_table.rb +1 -0
- data/lib/bioroebe/aminoacids/show_hydrophobicity.rb +1 -6
- data/lib/bioroebe/base/colours_for_base/colours_for_base.rb +18 -8
- data/lib/bioroebe/base/commandline_application/commandline_arguments.rb +13 -11
- data/lib/bioroebe/base/commandline_application/misc.rb +18 -8
- data/lib/bioroebe/base/misc.rb +16 -0
- data/lib/bioroebe/base/prototype/misc.rb +1 -1
- data/lib/bioroebe/codons/show_codon_tables.rb +6 -2
- data/lib/bioroebe/codons/show_codon_usage.rb +2 -1
- data/lib/bioroebe/constants/aminoacids_and_proteins.rb +1 -0
- data/lib/bioroebe/constants/database_constants.rb +1 -1
- data/lib/bioroebe/constants/files_and_directories.rb +20 -1
- data/lib/bioroebe/constants/misc.rb +20 -0
- data/lib/bioroebe/count/count_amount_of_nucleotides.rb +3 -0
- data/lib/bioroebe/crystal/README.md +2 -0
- data/lib/bioroebe/crystal/to_rna.cr +19 -0
- data/lib/bioroebe/data/README.md +11 -8
- data/lib/bioroebe/data/electron_microscopy/pos_example.pos +396 -0
- data/lib/bioroebe/data/electron_microscopy/test_particles.star +36 -0
- data/lib/bioroebe/{shell/tk.rb → electron_microscopy/electron_microscopy_module.rb} +15 -10
- data/lib/bioroebe/electron_microscopy/simple_star_file_generator.rb +4 -9
- data/lib/bioroebe/fasta_and_fastq/show_fasta_headers.rb +27 -12
- data/lib/bioroebe/genome/README.md +4 -0
- data/lib/bioroebe/genome/genome.rb +67 -0
- data/lib/bioroebe/gui/gtk3/protein_to_DNA/protein_to_DNA.rb +18 -18
- data/lib/bioroebe/gui/gtk3/random_sequence/random_sequence.rb +19 -11
- data/lib/bioroebe/gui/shared_code/protein_to_DNA/protein_to_DNA_module.rb +14 -14
- data/lib/bioroebe/misc/ruler.rb +1 -0
- data/lib/bioroebe/parsers/genbank_parser.rb +353 -24
- data/lib/bioroebe/parsers/gff.rb +1 -9
- data/lib/bioroebe/pdb/parse_pdb_file.rb +1 -9
- data/lib/bioroebe/project/project.rb +1 -1
- data/lib/bioroebe/python/README.md +1 -0
- data/lib/bioroebe/python/__pycache__/mymodule.cpython-39.pyc +0 -0
- data/lib/bioroebe/python/gui/gtk3/all_in_one.css +4 -0
- data/lib/bioroebe/python/gui/gtk3/all_in_one.py +59 -0
- data/lib/bioroebe/python/gui/gtk3/widget1.py +20 -0
- data/lib/bioroebe/python/gui/tkinter/all_in_one.py +91 -0
- data/lib/bioroebe/python/mymodule.py +8 -0
- data/lib/bioroebe/python/protein_to_dna.py +33 -0
- data/lib/bioroebe/python/shell/shell.py +19 -0
- data/lib/bioroebe/python/to_rna.py +14 -0
- data/lib/bioroebe/python/toplevel_methods/open_in_browser.py +20 -0
- data/lib/bioroebe/python/toplevel_methods/palindromes.py +42 -0
- data/lib/bioroebe/python/toplevel_methods/rds.py +13 -0
- data/lib/bioroebe/python/toplevel_methods/three_delimiter.py +34 -0
- data/lib/bioroebe/python/toplevel_methods/time_and_date.py +43 -0
- data/lib/bioroebe/python/toplevel_methods/to_camelcase.py +11 -0
- data/lib/bioroebe/requires/require_the_bioroebe_project.rb +3 -1
- data/lib/bioroebe/sequence/nucleotide_module/nucleotide_module.rb +28 -25
- data/lib/bioroebe/sequence/protein.rb +105 -3
- data/lib/bioroebe/sequence/sequence.rb +61 -2
- data/lib/bioroebe/shell/menu.rb +3451 -3366
- data/lib/bioroebe/shell/misc.rb +51 -4311
- data/lib/bioroebe/shell/readline/readline.rb +1 -1
- data/lib/bioroebe/shell/shell.rb +11192 -28
- data/lib/bioroebe/siRNA/siRNA.rb +81 -1
- data/lib/bioroebe/string_matching/find_longest_substring.rb +3 -2
- data/lib/bioroebe/taxonomy/class_methods.rb +3 -8
- data/lib/bioroebe/taxonomy/constants.rb +4 -3
- data/lib/bioroebe/taxonomy/edit.rb +2 -1
- data/lib/bioroebe/taxonomy/help/help.rb +10 -10
- data/lib/bioroebe/taxonomy/info/check_available.rb +15 -9
- data/lib/bioroebe/taxonomy/info/info.rb +17 -2
- data/lib/bioroebe/taxonomy/info/is_dna.rb +46 -36
- data/lib/bioroebe/taxonomy/interactive.rb +139 -95
- data/lib/bioroebe/taxonomy/menu.rb +27 -18
- data/lib/bioroebe/taxonomy/parse_fasta.rb +3 -1
- data/lib/bioroebe/taxonomy/shared.rb +1 -0
- data/lib/bioroebe/taxonomy/taxonomy.rb +1 -0
- data/lib/bioroebe/toplevel_methods/aminoacids_and_proteins.rb +31 -24
- data/lib/bioroebe/toplevel_methods/databases.rb +1 -1
- data/lib/bioroebe/toplevel_methods/fasta_and_fastq.rb +101 -63
- data/lib/bioroebe/toplevel_methods/misc.rb +17 -16
- data/lib/bioroebe/toplevel_methods/nucleotides.rb +22 -5
- data/lib/bioroebe/toplevel_methods/open_in_browser.rb +2 -0
- data/lib/bioroebe/toplevel_methods/palindromes.rb +1 -2
- data/lib/bioroebe/toplevel_methods/taxonomy.rb +2 -2
- data/lib/bioroebe/toplevel_methods/to_camelcase.rb +5 -0
- data/lib/bioroebe/utility_scripts/align_open_reading_frames.rb +1 -9
- data/lib/bioroebe/utility_scripts/check_for_mismatches/check_for_mismatches.rb +1 -9
- data/lib/bioroebe/utility_scripts/compacter.rb +1 -9
- data/lib/bioroebe/utility_scripts/compseq/compseq.rb +1 -9
- data/lib/bioroebe/utility_scripts/create_batch_entrez_file.rb +1 -9
- data/lib/bioroebe/utility_scripts/dot_alignment.rb +1 -9
- data/lib/bioroebe/utility_scripts/move_file_to_its_correct_location.rb +1 -4
- data/lib/bioroebe/utility_scripts/showorf/constants.rb +0 -5
- data/lib/bioroebe/utility_scripts/showorf/reset.rb +1 -4
- data/lib/bioroebe/version/version.rb +2 -2
- data/lib/bioroebe/www/embeddable_interface.rb +101 -52
- data/lib/bioroebe/www/sinatra/sinatra.rb +186 -70
- data/lib/bioroebe/yaml/aminoacids/amino_acids_long_name_to_one_letter.yml +2 -2
- data/lib/bioroebe/yaml/configuration/browser.yml +1 -1
- data/lib/bioroebe/yaml/genomes/README.md +3 -4
- data/lib/bioroebe/yaml/restriction_enzymes/restriction_enzymes.yml +3 -3
- metadata +32 -35
- data/doc/setup.rb +0 -1655
- data/lib/bioroebe/genbank/genbank_parser.rb +0 -291
- data/lib/bioroebe/shell/add.rb +0 -108
- data/lib/bioroebe/shell/assign.rb +0 -360
- data/lib/bioroebe/shell/chop_and_cut.rb +0 -281
- data/lib/bioroebe/shell/constants.rb +0 -166
- data/lib/bioroebe/shell/download.rb +0 -335
- data/lib/bioroebe/shell/enable_and_disable.rb +0 -158
- data/lib/bioroebe/shell/enzymes.rb +0 -310
- data/lib/bioroebe/shell/fasta.rb +0 -345
- data/lib/bioroebe/shell/gtk.rb +0 -76
- data/lib/bioroebe/shell/history.rb +0 -132
- data/lib/bioroebe/shell/initialize.rb +0 -217
- data/lib/bioroebe/shell/loop.rb +0 -74
- data/lib/bioroebe/shell/prompt.rb +0 -107
- data/lib/bioroebe/shell/random.rb +0 -289
- data/lib/bioroebe/shell/reset.rb +0 -335
- data/lib/bioroebe/shell/scan_and_parse.rb +0 -135
- data/lib/bioroebe/shell/search.rb +0 -337
- data/lib/bioroebe/shell/sequences.rb +0 -200
- data/lib/bioroebe/shell/show_report_and_display.rb +0 -2901
- data/lib/bioroebe/shell/startup.rb +0 -127
- data/lib/bioroebe/shell/taxonomy.rb +0 -14
- data/lib/bioroebe/shell/user_input.rb +0 -88
- data/lib/bioroebe/shell/xorg.rb +0 -45
data/doc/README.gen
CHANGED
@@ -5,7 +5,7 @@ ADD_TIME_STAMP
|
|
5
5
|
|
6
6
|
## Bioroebe
|
7
7
|
|
8
|
-
<img src="
|
8
|
+
<img src="https://i.imgur.com/mAoP7AP.png">
|
9
9
|
<img src="https://i.imgur.com/YqYxRBZ.png" style="margin: 4px; margin-left: 12px;"/>
|
10
10
|
<img src="https://i.imgur.com/k7mMlg2.png" style="margin: 4px; margin-left: 12px;"/>
|
11
11
|
|
@@ -332,41 +332,6 @@ so I opted to go the yaml route. But if people want to use a hash
|
|
332
332
|
instead, they can do so, too - see the <b>API</b> for codon tables
|
333
333
|
lateron. Simply define your own constants and pass them to the
|
334
334
|
appropriate methods.
|
335
|
-
|
336
|
-
## Support for other programming languages
|
337
|
-
|
338
|
-
The main programming language for the bioroebe project is **ruby**.
|
339
|
-
Ruby, from a language design point of view, is a great programming
|
340
|
-
language - not necessarily all of ruby, but the subset that I use.
|
341
|
-
It is very easy to quickly prototype ideas via ruby.
|
342
|
-
|
343
|
-
However had, ruby is known to **not** be among the fastest programming
|
344
|
-
languages about on this planet; so, it makes sense to use other
|
345
|
-
languages too from this point of view. Additionally there are some
|
346
|
-
software stacks in use in **other** programming languages, such as
|
347
|
-
matplotlib and various more.
|
348
|
-
|
349
|
-
Thus, it is important to **support other programming languages** as
|
350
|
-
well, if there are useful libraries. The bioroebe project, after
|
351
|
-
all, tries to be **practical**: it focuses on getting things done,
|
352
|
-
no matter the language.
|
353
|
-
|
354
|
-
This means that support for other programming languages can be
|
355
|
-
found in this project as well, often using system() or similar
|
356
|
-
functionality to tap into these other programming languages. Do
|
357
|
-
not be surprised when that happens - the bioroebe project will
|
358
|
-
also try to act as a **practical glue** towards functionality
|
359
|
-
enabled via other projects. We want to get things done, no
|
360
|
-
matter the programming language at hand!
|
361
|
-
|
362
|
-
Whenever possible, though, the bioroebe project will try to be
|
363
|
-
flexible in this regard, so ideally the same solution should
|
364
|
-
work for many different programming languages.
|
365
|
-
|
366
|
-
While Ruby is the primary language for this project, since as
|
367
|
-
of 2021 I will try to officially support **java**, **jruby**
|
368
|
-
and the **GraalVM**. This is on my TODO list, though - stay
|
369
|
-
tuned for more updates in this regard.
|
370
335
|
|
371
336
|
## Readline support in the BioRoebe project
|
372
337
|
|
@@ -550,16 +515,16 @@ the DNA-to-Protein translation is somewhat simply kept as a
|
|
550
515
|
Once you are inside a **running Bioshell**, you can do other **commands**
|
551
516
|
such as this one here:
|
552
517
|
|
553
|
-
random # ← This will generate a random DNA sequence.
|
518
|
+
random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.
|
554
519
|
|
555
520
|
To **assign** a DNA sequence, do:
|
556
521
|
|
557
522
|
assign ATAGGGCTTTT
|
558
523
|
|
559
|
-
Note that since the year 2016
|
560
|
-
the one above, without any other commands/words, then we will assume
|
524
|
+
Note that since as of the year <b>2016</b>, if you input a nucleotide sequence
|
525
|
+
like the one above, without any other commands/words, then we will assume
|
561
526
|
that you did mean to do an assignment as-is anyway. The "assign" part
|
562
|
-
then becomes superfluous.
|
527
|
+
then becomes superfluous and can be omitted.
|
563
528
|
|
564
529
|
This is how this is simply done, by omitting the "assign" part of the
|
565
530
|
above instruction altogether:
|
@@ -1070,18 +1035,18 @@ The text **banana** thus has the following suffixes:
|
|
1070
1035
|
|
1071
1036
|
This subsection deals with some aspects of **HMMs**.
|
1072
1037
|
|
1073
|
-
Why are HMMs useful in biology? They can be used to represent protein
|
1074
|
-
families
|
1038
|
+
Why are HMMs useful in biology? They can be used to <b>represent protein
|
1039
|
+
families</b>, for example (via <b>pHMMs</b> - profile hidden markov models).
|
1075
1040
|
|
1076
1041
|
Furthermore, they can show some bias in the mutation rate that can be
|
1077
1042
|
observed. Different genomes are known to have different hotspots where
|
1078
|
-
mutations are more likely to happen. These are
|
1079
|
-
may be useful.
|
1043
|
+
mutations are more likely to happen, for various reasons. These are
|
1044
|
+
examples where a HMM may be useful.
|
1080
1045
|
|
1081
|
-
HMMs are usually based on the Shannon model where you assign different
|
1046
|
+
HMMs are usually based on the <b>Shannon model</b> where you assign different
|
1082
1047
|
probabilities to "change" events. An example that was mentioned back
|
1083
|
-
in 1948 was the english alphabet - some letters, and combinations
|
1084
|
-
letters, are more commonly seen. Shannon gave the example of "E"
|
1048
|
+
in <b>1948</b> was the english alphabet - some letters, and combinations
|
1049
|
+
of letters, are more commonly seen. Shannon gave the example of "E"
|
1085
1050
|
versus "W", as shown in the following graph (a **finite state
|
1086
1051
|
graph**):
|
1087
1052
|
|
@@ -1095,40 +1060,47 @@ DNA sequence, a 10-mer would be equivalent to **10 base pairs**.
|
|
1095
1060
|
The individual transition states are based on an assumption of
|
1096
1061
|
"randomness", but ensuring that these are truly random is not
|
1097
1062
|
necessarily trivial. Computers do not really 'generate' true
|
1098
|
-
randomness, at the least not when they are working solo
|
1099
|
-
can even 'predict' some randomness here or there
|
1100
|
-
|
1101
|
-
|
1102
|
-
|
1103
|
-
|
1104
|
-
of
|
1105
|
-
|
1106
|
-
given position, but this is not
|
1107
|
-
|
1108
|
-
|
1109
|
-
|
1110
|
-
|
1111
|
-
|
1112
|
-
|
1063
|
+
randomness, at the least not when they are working solo, "on
|
1064
|
+
their own". You can even 'predict' some randomness here or there
|
1065
|
+
via various techniques - see vulnerabilities such as <b>Specter</b>
|
1066
|
+
or similar variants where software can read from areas of the
|
1067
|
+
memory that should be inaccessible to them. Some of this is based
|
1068
|
+
on co-predictions. For distributed computers, you may often use
|
1069
|
+
random noise or decay of atoms as 'a source of randomness'. For
|
1070
|
+
any DNA nucleotide sequence, we would assume that each base pair
|
1071
|
+
has a 25% chance to exist at any given position, but this is not
|
1072
|
+
necessarily true, again for various reasons.
|
1073
|
+
|
1074
|
+
An interesting thought is ... why is <b>ATP</b> so important?
|
1075
|
+
Yes, of course due to it being 'the energy currency in a cell' but ..
|
1076
|
+
why is this ATP, aka adenine? Why not GTP, aka guanine or any of
|
1077
|
+
the other two nucleotides? (GTP is used too, but why? Why not
|
1078
|
+
CTP and TTP?) I can not answer this question; there may
|
1079
|
+
be many reasons, including differential chemical storage power
|
1080
|
+
as well as mere random chance event in evolution, but for whatever
|
1113
1081
|
the reason, you will not find a complete 25% percentage value
|
1114
1082
|
for every given "slot" in DNA, depending on the organism.
|
1115
1083
|
|
1116
1084
|
From a practical point of view, how can we approach Hidden Markov
|
1117
|
-
Models?
|
1085
|
+
Models and use them?
|
1118
1086
|
|
1119
|
-
Let's take the following sequence:
|
1087
|
+
Let's take the following simple sequence:
|
1120
1088
|
|
1121
1089
|
ACGTACGC
|
1122
1090
|
|
1123
1091
|
From this sequence we can see that the <b>3-mer</b> "ACG"
|
1124
1092
|
is followed by either a T, or a C. Have a look at the sequence
|
1125
|
-
to see if you can identify the two ACG subsequences
|
1093
|
+
again to see if you can identify the two ACG subsequences
|
1094
|
+
there. You can see one at the start, and the other one
|
1095
|
+
following a bit later, hence why we come to the conclusion
|
1096
|
+
that either a T or a C will follow this <b>3-mer</b>.
|
1126
1097
|
|
1127
|
-
The probability of either T or C
|
1128
|
-
for A and G to follow there
|
1129
|
-
be ignored.
|
1098
|
+
The probability of either T or C to occur on <b>that</b>
|
1099
|
+
position, thus, is 0.5 (50%); for A and G to follow there
|
1100
|
+
is 0% so the latter two can be ignored.
|
1130
1101
|
|
1131
|
-
Thus, we could use a ruby Hash as follows
|
1102
|
+
Thus, we could use a ruby Hash as follows that should
|
1103
|
+
describe these probabilities:
|
1132
1104
|
|
1133
1105
|
probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well
|
1134
1106
|
|
@@ -1214,34 +1186,6 @@ each edge.
|
|
1214
1186
|
Parsimony assumes that substitutions are rare and that back-mutations
|
1215
1187
|
do not occur.
|
1216
1188
|
|
1217
|
-
## Random stuff
|
1218
|
-
|
1219
|
-
You can generate random DNA sequences in the shell:
|
1220
|
-
|
1221
|
-
random dna 20
|
1222
|
-
random dna 25
|
1223
|
-
random dna 30
|
1224
|
-
|
1225
|
-
This will generate random DNA sequences, with a length
|
1226
|
-
of 20, 25, 30, respectively. This may not be very useful
|
1227
|
-
but it was important that this functionality is made
|
1228
|
-
available somewhere.
|
1229
|
-
|
1230
|
-
You can also use some toplevel-methods to generate, e. g.
|
1231
|
-
20 random aminoacids:
|
1232
|
-
|
1233
|
-
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
1234
|
-
|
1235
|
-
Note that there may exist other APIs within the Bioroebe project
|
1236
|
-
that do the same as well.
|
1237
|
-
|
1238
|
-
If you would like to use a ruby-gtk3 widget have a look
|
1239
|
-
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
1240
|
-
It works with aminoacids, DNA and RNA, and allows the user to
|
1241
|
-
create random sequences. (If you need weighted randomness then
|
1242
|
-
you currently have to use the commandline variant. Perhaps I may
|
1243
|
-
add support into the GUI directly for this one day.)
|
1244
|
-
|
1245
1189
|
## Displaying the main sequence with delimiter characters
|
1246
1190
|
|
1247
1191
|
From within the <b>bioshell</b>, you can use some alternative ways to
|
@@ -1483,24 +1427,9 @@ You can simulate this via the following API:
|
|
1483
1427
|
Bioroebe.cleave_with_trypsin(sequence_goes_in_here)
|
1484
1428
|
Bioroebe.cleave :with_trypsin, sequence_goes_in_here
|
1485
1429
|
|
1486
|
-
Currently (July 2021) only support for Trypsin is included, but
|
1430
|
+
Currently (<b>July 2021</b>) only support for Trypsin is included, but
|
1487
1431
|
in the long run the goal is to add as many digestive (peptide-bond
|
1488
1432
|
cleaving) enzymes here as possible.
|
1489
|
-
|
1490
|
-
## Freezing the main sequence - and unfreezing it again
|
1491
|
-
|
1492
|
-
You can **freeze** the BioShell, meaning that it will no longer allow
|
1493
|
-
for the main sequence to be modified, via:
|
1494
|
-
|
1495
|
-
freeze
|
1496
|
-
|
1497
|
-
To unfreeze again, issue:
|
1498
|
-
|
1499
|
-
unfreeze
|
1500
|
-
|
1501
|
-
This functionality has been added because the shell may sometimes be
|
1502
|
-
quite eager to change the main sequence, so we needed a way to disable
|
1503
|
-
any further modifications (until "unfreeze" is issued that is).
|
1504
1433
|
|
1505
1434
|
## MUMmer
|
1506
1435
|
|
@@ -2711,18 +2640,6 @@ This may look as follows:
|
|
2711
2640
|
|
2712
2641
|
<img src="https://i.imgur.com/gAZg8qG.png" style="margin: 1em; margin-left: 3em">
|
2713
2642
|
|
2714
|
-
## Obtaining a subsequence from a Bioroebe::Sequence object
|
2715
|
-
|
2716
|
-
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
2717
|
-
|
2718
|
-
There are several ways how to obtain a subsequence from
|
2719
|
-
this. One variant will be shown next, by making use of
|
2720
|
-
the method called **.subseq()**.
|
2721
|
-
|
2722
|
-
Example:
|
2723
|
-
|
2724
|
-
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
2725
|
-
|
2726
2643
|
## Bioroebe::Protein
|
2727
2644
|
|
2728
2645
|
This class is a subclass of class **Bioroebe::Sequence**. The
|
@@ -2737,15 +2654,26 @@ functionality is also available in another method.
|
|
2737
2654
|
For now keep this in mind; at some later point I may decide whether
|
2738
2655
|
this class is to be kept or not.
|
2739
2656
|
|
2740
|
-
|
2657
|
+
In July 2022 I noticed that the bio-gem has the following method:
|
2741
2658
|
|
2742
|
-
|
2743
|
-
any of the following:
|
2659
|
+
p Bio::AminoAcid['A'] # => "Ala"
|
2744
2660
|
|
2745
|
-
|
2746
|
-
|
2747
|
-
|
2748
|
-
|
2661
|
+
I liked this functionality, but class Bioroebe::Protein already
|
2662
|
+
has a [] method which is used to instantiate a new
|
2663
|
+
instance of class Bioroebe::Protein. So, a toplevel method
|
2664
|
+
was added instead.
|
2665
|
+
|
2666
|
+
Usage example:
|
2667
|
+
|
2668
|
+
Bioroebe::Aminoacids.one_to_three('A') # => Ala
|
2669
|
+
|
2670
|
+
So this is the equivalent to what the bio-gem does, more or
|
2671
|
+
less.
|
2672
|
+
|
2673
|
+
If you want to find out the name of a one-letter aminoacid
|
2674
|
+
you can also use this method:
|
2675
|
+
|
2676
|
+
Bioroebe::Protein.name('A') # => "alanine"
|
2749
2677
|
|
2750
2678
|
## Decoding aminoacids
|
2751
2679
|
|
@@ -2931,27 +2859,6 @@ Note that presently (April 2020) not all of PROSITE may be supported
|
|
2931
2859
|
via this regex, but in the long run the plan is to support all
|
2932
2860
|
of PROSITE's regex expression.
|
2933
2861
|
|
2934
|
-
## Determining how many stop codons existing in a given sequence
|
2935
|
-
|
2936
|
-
You can use **bin/n_stop_codons_in_this_sequence** to determine
|
2937
|
-
how many stop codons exist in a given sequence at hand.
|
2938
|
-
|
2939
|
-
Usage example from the commandline:
|
2940
|
-
|
2941
|
-
n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
|
2942
|
-
|
2943
|
-
You can also separate these via a ' ' spacer on the commandline of
|
2944
|
-
course:
|
2945
|
-
|
2946
|
-
n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
|
2947
|
-
|
2948
|
-
Internally this makes use of the method called
|
2949
|
-
<b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
|
2950
|
-
aliased names. Usage example for the method, just as in the
|
2951
|
-
first example shown above:
|
2952
|
-
|
2953
|
-
Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
|
2954
|
-
|
2955
2862
|
## AT and GC content
|
2956
2863
|
![alt text][cat1]
|
2957
2864
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3173,47 +3080,45 @@ can try to use:
|
|
3173
3080
|
On class Bioroebe::Sequence. More customizability may be added
|
3174
3081
|
to that method in this regard, if users need this.
|
3175
3082
|
|
3176
|
-
|
3083
|
+
### Obtaining a subsequence from a Bioroebe::Sequence object
|
3177
3084
|
|
3178
|
-
|
3179
|
-
the **bioshell**.
|
3085
|
+
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
3180
3086
|
|
3181
|
-
|
3087
|
+
There are several ways how to obtain a subsequence from
|
3088
|
+
this. One variant will be shown next, by making use of
|
3089
|
+
the method called **.subseq()**.
|
3182
3090
|
|
3183
|
-
|
3091
|
+
Example:
|
3184
3092
|
|
3185
|
-
|
3093
|
+
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
3186
3094
|
|
3187
|
-
You can
|
3188
|
-
code:
|
3095
|
+
You can also randomize the sequence, via .randomize().
|
3189
3096
|
|
3190
|
-
|
3097
|
+
Example:
|
3191
3098
|
|
3192
|
-
|
3193
|
-
returned representing that nucleotide sequence.
|
3099
|
+
x = Bioroebe::Sequence.new; x.randomize
|
3194
3100
|
|
3195
|
-
|
3196
|
-
should be generated.
|
3101
|
+
This is similar to the method in Bioruby here:
|
3197
3102
|
|
3198
|
-
|
3103
|
+
https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243
|
3199
3104
|
|
3200
|
-
|
3201
|
-
such as by issuing the following command:
|
3105
|
+
## The Hydropathy index
|
3202
3106
|
|
3203
|
-
|
3107
|
+
You can display the hydropathy index for aminoacids from within
|
3108
|
+
the **bioshell**.
|
3204
3109
|
|
3205
|
-
|
3110
|
+
Simply issue:
|
3206
3111
|
|
3207
|
-
|
3112
|
+
hydropathy?
|
3208
3113
|
|
3209
|
-
|
3114
|
+
## The GFF file format
|
3210
3115
|
|
3211
|
-
|
3116
|
+
From within the **bioshell** you can analyze .gff and .gff3 files,
|
3117
|
+
such as by issuing the following command:
|
3212
3118
|
|
3213
|
-
|
3119
|
+
gff3? foobar.gff3
|
3214
3120
|
|
3215
|
-
|
3216
|
-
compositions of the same nucleotide.
|
3121
|
+
Evidently for this to work the file at hand has to exist.
|
3217
3122
|
|
3218
3123
|
## The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)
|
3219
3124
|
|
@@ -3350,47 +3255,6 @@ nucleotides by issuing:
|
|
3350
3255
|
|
3351
3256
|
show_individual_weight_of_the_four_dna_nucleotides
|
3352
3257
|
|
3353
|
-
## Truncating output in the bioroebe-shell
|
3354
|
-
![alt text][cat1]
|
3355
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3356
|
-
|
3357
|
-
**DNA/RNA sequences** can become very long and then become
|
3358
|
-
quite difficult to view, read and handle on the commandline.
|
3359
|
-
|
3360
|
-
Normally the bioroebe shell will truncate output of DNA sequences
|
3361
|
-
that are "too long". This is mostly done so that working with
|
3362
|
-
very long sequences becomes a bit more convenient.
|
3363
|
-
|
3364
|
-
Sometimes this can become an antifeature, though, so the user
|
3365
|
-
must be able to toggle this at his or her own discretion.
|
3366
|
-
|
3367
|
-
By default, the bioroebe-shell (bioshell) will always try
|
3368
|
-
to truncate output, but you can toggle this behaviour by
|
3369
|
-
issuing:
|
3370
|
-
|
3371
|
-
do not truncate
|
3372
|
-
|
3373
|
-
In theory, other "do not" actions are also supported, or will
|
3374
|
-
be supported in the future; right now (Oct 2019) this is a bit
|
3375
|
-
limited.
|
3376
|
-
|
3377
|
-
From the toplevel, you can use this method:
|
3378
|
-
|
3379
|
-
Bioroebe.do_not_truncate
|
3380
|
-
|
3381
|
-
The above instruction will toggle the truncate behaviour
|
3382
|
-
to not truncate, ever.
|
3383
|
-
|
3384
|
-
If you need to do so within the bioshell, this is the way:
|
3385
|
-
|
3386
|
-
no_truncate
|
3387
|
-
|
3388
|
-
Or simply
|
3389
|
-
|
3390
|
-
truncate
|
3391
|
-
|
3392
|
-
This will toggle, like a switch.
|
3393
|
-
|
3394
3258
|
## Rosalind Challenges
|
3395
3259
|
![alt text][cat1]
|
3396
3260
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3527,31 +3391,6 @@ investing more time into Rosalind. Let's focus on solving
|
|
3527
3391
|
real, existing problems instead - at the least as far as
|
3528
3392
|
the Bioroebe project is concerned.
|
3529
3393
|
|
3530
|
-
## Numbers as input in the bioshell
|
3531
|
-
![alt text][cat1]
|
3532
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3533
|
-
|
3534
|
-
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
3535
|
-
|
3536
|
-
This will attempt to <b>display the first 3 nucleotides</b> of
|
3537
|
-
the assigned **main sequence**. It will only work if you have
|
3538
|
-
assigned a sequence prior to that, though.
|
3539
|
-
|
3540
|
-
Examples:
|
3541
|
-
|
3542
|
-
3
|
3543
|
-
33
|
3544
|
-
15
|
3545
|
-
|
3546
|
-
## transeq
|
3547
|
-
![alt text][cat1]
|
3548
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3549
|
-
|
3550
|
-
You can convert a DNA sequence into an aminoacid sequence by
|
3551
|
-
doing this:
|
3552
|
-
|
3553
|
-
transeq
|
3554
|
-
|
3555
3394
|
## Align two different sequences
|
3556
3395
|
![alt text][cat1]
|
3557
3396
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3863,22 +3702,6 @@ does not (yet?) have support for comparing two genomes to
|
|
3863
3702
|
one another and generate a visual map indicating the findings
|
3864
3703
|
there.
|
3865
3704
|
|
3866
|
-
## Do not create directories on startup of the shell
|
3867
|
-
|
3868
|
-
By default the bioshell will try to create some directories
|
3869
|
-
on startup. This may not always be desired by the user
|
3870
|
-
though, so an option has to exist to disable this functionality.
|
3871
|
-
|
3872
|
-
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
3873
|
-
keeps track of whether directories on startup of the shell will
|
3874
|
-
be created.
|
3875
|
-
|
3876
|
-
To disable this behaviour on startup of the bioshell, try
|
3877
|
-
something like this:
|
3878
|
-
|
3879
|
-
bioshell --do-not-create-directories-on-startup
|
3880
|
-
bioshell --do-not-create-directories
|
3881
|
-
|
3882
3705
|
## class Bioroebe::MoveFileToItsCorrectLocation
|
3883
3706
|
|
3884
3707
|
This class will move a bio-file to its "correct" location, with respect
|
@@ -3921,15 +3744,6 @@ synonymous, aka aliases):
|
|
3921
3744
|
ruler2 25 # ← use 25 characters per line
|
3922
3745
|
ruler2 50 # ← use 50 characters per line
|
3923
3746
|
|
3924
|
-
## Generating a random nucleotide sequence based on frequencies
|
3925
|
-
|
3926
|
-
If you ever need to generate a nucleotide frequency then you can use
|
3927
|
-
the following method:
|
3928
|
-
|
3929
|
-
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
|
3930
|
-
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
|
3931
|
-
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
|
3932
|
-
|
3933
3747
|
## The Mouse
|
3934
3748
|
|
3935
3749
|
This subsection is about the **mouse**, in particular relevant
|
@@ -4047,57 +3861,24 @@ has". Genes in itself are not that well-defined, so they are not necessarily
|
|
4047
3861
|
the primary means of complexity. Think of this more as an interactome,
|
4048
3862
|
where RNAs play a major dynamic role as well.
|
4049
3863
|
|
4050
|
-
## Bioroebe::
|
3864
|
+
## class Bioroebe::DisplayOpenReadingFrames
|
4051
3865
|
|
4052
|
-
|
4053
|
-
|
4054
|
-
|
3866
|
+
**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
|
3867
|
+
will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
|
3868
|
+
**class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
|
3869
|
+
flexible. It shall also support **sixpack** and **showorf** from the
|
3870
|
+
**Emboss online tools**. (In fact, supporting these two use cases
|
3871
|
+
was the original reason as to why this class has been created.)
|
4055
3872
|
|
4056
|
-
|
4057
|
-
HMMs (Hidden Markov Models) one day.
|
3873
|
+
Where does the code to this class reside?
|
4058
3874
|
|
4059
|
-
|
3875
|
+
It can be found here:
|
4060
3876
|
|
4061
|
-
|
4062
|
-
|
3877
|
+
bioroebe/utility_scripts/display_open_reading_frames/
|
3878
|
+
require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
|
4063
3879
|
|
4064
|
-
|
4065
|
-
|
4066
|
-
the Hash into the method generate_sequence_based_on_this_profile() -
|
4067
|
-
or you use the default Hash, which is stored in the constant
|
4068
|
-
called **PER_POSITION_HASH**.
|
4069
|
-
|
4070
|
-
That profile should be a Hash, with keys pointing to A, T, C, G
|
4071
|
-
and the values being an Array of likelihood chance there,
|
4072
|
-
as a number, such as 140. These values are also called
|
4073
|
-
**scores**. Each score contains a number for each position
|
4074
|
-
that indicates how likely it is to find the given
|
4075
|
-
nucleotide at that location.
|
4076
|
-
|
4077
|
-
You can also use this class to generate a random DNA string,
|
4078
|
-
similar to the method called
|
4079
|
-
**Bioroebe.generate_random_dna_sequence()**. The difference
|
4080
|
-
is that class ProfilePattern allows for a bit more fine-tuned
|
4081
|
-
control. The class will likely be extended in the future too.
|
4082
|
-
|
4083
|
-
## class Bioroebe::DisplayOpenReadingFrames
|
4084
|
-
|
4085
|
-
**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
|
4086
|
-
will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
|
4087
|
-
**class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
|
4088
|
-
flexible. It shall also support **sixpack** and **showorf** from the
|
4089
|
-
**Emboss online tools**. (In fact, supporting these two use cases
|
4090
|
-
was the original reason as to why this class has been created.)
|
4091
|
-
|
4092
|
-
Where does the code to this class reside?
|
4093
|
-
|
4094
|
-
It can be found here:
|
4095
|
-
|
4096
|
-
bioroebe/utility_scripts/display_open_reading_frames/
|
4097
|
-
require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
|
4098
|
-
|
4099
|
-
The display of this class is typically aimed for the commandline,
|
4100
|
-
but it is planned to use the class on the www too (via sinatra).
|
3880
|
+
The display of this class is typically aimed for the commandline,
|
3881
|
+
but it is planned to use the class on the www too (via sinatra).
|
4101
3882
|
|
4102
3883
|
Take note that this class also reports how many ORFs (open reading
|
4103
3884
|
frames) have been found. The number displayed here differs from
|
@@ -4459,28 +4240,6 @@ the BioRoebe-Shell, then you can use either of the following:
|
|
4459
4240
|
|
4460
4241
|
seq?
|
4461
4242
|
seq_with_tab?
|
4462
|
-
|
4463
|
-
## Prompt (the shell prompt9
|
4464
|
-
|
4465
|
-
You can set a <b>custom prompt</b>, via the keywords
|
4466
|
-
"prompt" or "set_prompt".
|
4467
|
-
|
4468
|
-
To display the <b>current working directory</b>, do:
|
4469
|
-
|
4470
|
-
prompt pwd
|
4471
|
-
|
4472
|
-
To revert to the old default again, do this:
|
4473
|
-
|
4474
|
-
prompt REVERT
|
4475
|
-
prompt revert
|
4476
|
-
prompt DEFAULT
|
4477
|
-
prompt default
|
4478
|
-
|
4479
|
-
If you do not want to set any prompt, do:
|
4480
|
-
|
4481
|
-
prompt none
|
4482
|
-
|
4483
|
-
|
4484
4243
|
|
4485
4244
|
## Leader and Trailer
|
4486
4245
|
|
@@ -4968,17 +4727,17 @@ For now, here is the list:
|
|
4968
4727
|
|
4969
4728
|
## The T-Bacteriophages
|
4970
4729
|
|
4971
|
-
The following table only shows a short summary for the
|
4730
|
+
The following table only shows a short summary for the <b>T-phages</b>.
|
4972
4731
|
|
4973
|
-
name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size
|
4974
|
-
|
4975
|
-
T1 | medium | 50 | 150 x 15 | 13 | 180
|
4976
|
-
T2 | small | 65 x 80 | 120 x 20 | 21 | 120
|
4977
|
-
T3 | large | 45 | invisible | 13 | 300
|
4978
|
-
T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300
|
4979
|
-
T5 | small | 100 | tiny | 40 | 300
|
4980
|
-
T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300
|
4981
|
-
T7 | large | 45 | invisible | 13 | 300
|
4732
|
+
name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size | n genes
|
4733
|
+
-------------------|--------------|---------------------------|----------------|----------------------------|-------------|------------
|
4734
|
+
T1 | medium | 50 | 150 x 15 | 13 | 180 |
|
4735
|
+
T2 | small | 65 x 80 | 120 x 20 | 21 | 120 |
|
4736
|
+
T3 | large | 45 | invisible | 13 | 300 |
|
4737
|
+
T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300 | 300
|
4738
|
+
T5 | small | 100 | tiny | 40 | 300 |
|
4739
|
+
T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300 |
|
4740
|
+
T7 | large | 45 | invisible | 13 | 300 |
|
4982
4741
|
|
4983
4742
|
The next table will show some phage genomes.
|
4984
4743
|
|
@@ -5389,215 +5148,6 @@ that format.
|
|
5389
5148
|
Presently (**May 2020**) there is no support for the mmCIF format
|
5390
5149
|
in the Bioroebe project, but this will eventually change.
|
5391
5150
|
|
5392
|
-
## Working with PDB files (.pdb)
|
5393
|
-
![alt text][cat1]
|
5394
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
5395
|
-
|
5396
|
-
The **PDB**, founded in the year **1971**, holds lots of **atomic
|
5397
|
-
structures of proteins**.
|
5398
|
-
|
5399
|
-
In **July 2016** it contained **121000 structures**.
|
5400
|
-
|
5401
|
-
In **February 2018** it contained **~124000 structures**
|
5402
|
-
(from X-ray crystallography), and about **~12000 NMR
|
5403
|
-
structures**. <b>NMR</b> is limited to about <b>350 amino
|
5404
|
-
acids maximum length</b>, give or take.
|
5405
|
-
|
5406
|
-
In **April 2020** the PDB contained **163141 structures**.
|
5407
|
-
|
5408
|
-
We can see that more and more structures are available
|
5409
|
-
nowadays - a trend that will most likely continue or
|
5410
|
-
even accelerate. (Let's hope the quality also remains
|
5411
|
-
high.)
|
5412
|
-
|
5413
|
-
A typical .pdb file contains entries such as this:
|
5414
|
-
|
5415
|
-
RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
|
5416
|
-
ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
|
5417
|
-
ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
|
5418
|
-
ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
|
5419
|
-
ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
|
5420
|
-
ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
|
5421
|
-
|
5422
|
-
(Not the first line; **RTyp** is just an explanation for the ATOM
|
5423
|
-
entries below that line).
|
5424
|
-
|
5425
|
-
The sequence starts from the N-terminal residue for proteins; see
|
5426
|
-
the <b>Atm</b> entry at <b>Num 1</b>.
|
5427
|
-
|
5428
|
-
The **meaning of these entries** is as follows:
|
5429
|
-
|
5430
|
-
1) RTyp: Record Type
|
5431
|
-
2) Num: Serial number of the atom. Each atom has a unique serial number.
|
5432
|
-
3) Atm: Atom name (in IUPAC format).
|
5433
|
-
4) Res: Residue name (IUPAC format).
|
5434
|
-
5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
|
5435
|
-
6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
|
5436
|
-
7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
|
5437
|
-
10) Occ: Occupancy factor
|
5438
|
-
11) Temp: Temperature factor (atoms disordered in the crystal have high
|
5439
|
-
temperature factors; they are "wobbly" with a high factor.
|
5440
|
-
This is also called the B-factor).
|
5441
|
-
12) PDB: The PDB data file unique identifier.
|
5442
|
-
13) Line: Line (record) number in the data file.
|
5443
|
-
|
5444
|
-
Typically the entry on the most right area, the last one, specifies
|
5445
|
-
which atom it is. A **H** stands for a hydrogen atom; the other atoms
|
5446
|
-
are "heavy" atoms (heavier than hydrogen most definitely).
|
5447
|
-
|
5448
|
-
Most .pdb files will contain **SEQRES** entries. These entries will list
|
5449
|
-
the primary sequence of the polymeric molecules present in the entry.
|
5450
|
-
You can notice this by looking at the standard 3-character code
|
5451
|
-
used by SEQRES here, for the canonical amino acids. So, for instance,
|
5452
|
-
the amino acids that will be mentioned in a SEQRES entry are
|
5453
|
-
ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
|
5454
|
-
PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
|
5455
|
-
method **Bioroebe.three_to_one()** to convert back to the
|
5456
|
-
one-letter chain such as follows:
|
5457
|
-
|
5458
|
-
Bioroebe.three_to_one('PHE') # => "F"
|
5459
|
-
|
5460
|
-
The data in a .pdb file need not necessarily only be a protein, with
|
5461
|
-
a specific aminoacid sequence. It may also include DNA. An example
|
5462
|
-
for such a molecule is
|
5463
|
-
<b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
|
5464
|
-
which includes a protein chain and a DNA chain.
|
5465
|
-
|
5466
|
-
As far as the **bioroebe project** is concerned, you can parse .pdb files
|
5467
|
-
via the following class:
|
5468
|
-
|
5469
|
-
Bioroebe::ParsePdbFile.new
|
5470
|
-
Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
|
5471
|
-
Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
|
5472
|
-
|
5473
|
-
This class also allows some shortcuts for integrated .pdb files,
|
5474
|
-
that is files that are bundled with the bioroebe project:
|
5475
|
-
|
5476
|
-
Bioroebe::ParsePdbFile.new ':1fat'
|
5477
|
-
|
5478
|
-
This requires a String because ruby symbols may not start with
|
5479
|
-
a number. Note that this also works through the commandline,
|
5480
|
-
such as:
|
5481
|
-
|
5482
|
-
parse_pdb_file :1fat
|
5483
|
-
|
5484
|
-
A shell such as bash does not understand ruby symbols, so instead
|
5485
|
-
a string will be passed in, being :1fat. The ParsePdbFile will
|
5486
|
-
handle this correctly internally.
|
5487
|
-
|
5488
|
-
Note that a small bug was fixed in the file parse_pdb_file.rb;
|
5489
|
-
some entries were skipped due to an erroneous loop in the ruby
|
5490
|
-
file. This was corrected in **May 2020**.
|
5491
|
-
|
5492
|
-
In **March 2021** the ability to use entries such as ':1fat'
|
5493
|
-
was removed again; the code remains though. The reason why
|
5494
|
-
this was removed was that the .pdb files are quite large,
|
5495
|
-
so distributing them via the bioroebe project makes no real
|
5496
|
-
sense. Consider simply downloading the .pdb files; you
|
5497
|
-
can use this from the bioshell or via something
|
5498
|
-
like:
|
5499
|
-
|
5500
|
-
pdb 5TIM
|
5501
|
-
|
5502
|
-
Note that you can also return the aminoacid-sequence from a
|
5503
|
-
.pdb file directly, since as of **May 2020**.
|
5504
|
-
|
5505
|
-
Example for this:
|
5506
|
-
|
5507
|
-
Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
5508
|
-
|
5509
|
-
The first argument should be **the path to the (local)
|
5510
|
-
.pdb file at hand**. (In theory support for remote .pdb
|
5511
|
-
files could also be added easily, but right now this
|
5512
|
-
is not possible, so you have to download it first.)
|
5513
|
-
|
5514
|
-
The **specification for .pdb files** can be read at the following
|
5515
|
-
two remote resources:
|
5516
|
-
|
5517
|
-
http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
|
5518
|
-
http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
|
5519
|
-
|
5520
|
-
Note that the parse_pdb_file.rb can also do some additional
|
5521
|
-
things, such as calculating the maximum distance between
|
5522
|
-
atoms in that file, via the method
|
5523
|
-
**.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
|
5524
|
-
|
5525
|
-
If you wish to report the secondary structures from a given .pdb file
|
5526
|
-
then you can use the following class:
|
5527
|
-
|
5528
|
-
require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
|
5529
|
-
|
5530
|
-
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
|
5531
|
-
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
|
5532
|
-
|
5533
|
-
If you wish to obtain the FASTA sequence of a particular remote
|
5534
|
-
.pdb file then you can use this API:
|
5535
|
-
|
5536
|
-
x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
5537
|
-
|
5538
|
-
Keep in mind that this is the FASTA sequence; the .pdb file itself
|
5539
|
-
has another format, and contains a lot more information, such as
|
5540
|
-
the various ATOM entries.
|
5541
|
-
|
5542
|
-
Since as of **June 2020** the command **fetch** also works from
|
5543
|
-
within the Bioshell, similar to how pymol **works**. This allows
|
5544
|
-
us to quickly download a remote .pdb file.
|
5545
|
-
|
5546
|
-
fetch 2BTS
|
5547
|
-
|
5548
|
-
You can also use the following toplevel-API to download a remote
|
5549
|
-
.pdb file:
|
5550
|
-
|
5551
|
-
Bioroebe.download_this_pdb
|
5552
|
-
Bioroebe.download_this_pdb '355D'
|
5553
|
-
Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
|
5554
|
-
|
5555
|
-
Note that this will be automatically moved to the "correct" default
|
5556
|
-
position in the bioroebe-project, under the **pdb/** subdirectory.
|
5557
|
-
|
5558
|
-
You can also invoke this script from the commandline via
|
5559
|
-
**bin/download_this_pdb**, like in this way:
|
5560
|
-
|
5561
|
-
download_this_pdb 355D
|
5562
|
-
|
5563
|
-
This works with several .pdb files in one go as well:
|
5564
|
-
|
5565
|
-
download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
|
5566
|
-
|
5567
|
-
They would all be downloaded one after the other. Be aware that
|
5568
|
-
this will overwrite the old .pdb files on that position, so
|
5569
|
-
if you don't want this, I recommend to do a backup on the
|
5570
|
-
**pdb/** subdirectory before invoking the above call.
|
5571
|
-
|
5572
|
-
You can also turn the FASTA sequence stored in a .pdb file into
|
5573
|
-
a .fasta file, via **--create-fasta-file**.
|
5574
|
-
|
5575
|
-
Usage examples:
|
5576
|
-
|
5577
|
-
parsedb 1NR6 --create-fasta-file
|
5578
|
-
parsedb 2F9Q --create-fasta-file
|
5579
|
-
parsedb 3TDA --create-fasta-file
|
5580
|
-
parsedb 2HI4 --create-fasta-file
|
5581
|
-
parsedb 2V0M --create-fasta-file
|
5582
|
-
|
5583
|
-
So if you have a file called <b>1NR6.pdb</b> and you use
|
5584
|
-
the first input, a .fasta file will be created. If such
|
5585
|
-
a .pdb file does not exist then this will not work, so
|
5586
|
-
make sure to download the .pdb file before invoking
|
5587
|
-
this commandline-flag.
|
5588
|
-
|
5589
|
-
Last but not least, the following table shall document the
|
5590
|
-
PDB format - it is not yet complete, but it is intended
|
5591
|
-
to add the remaining datasets eventually:
|
5592
|
-
|
5593
|
-
Record Name Describes
|
5594
|
-
MODRES Modifications to standard residues
|
5595
|
-
HET Nonstandard residues (as well as ligands, ions and water)
|
5596
|
-
HETNAM Full chemical name of the residue
|
5597
|
-
HETSYM Synonyms for the residue
|
5598
|
-
FORMUL Chemical formula of the residue
|
5599
|
-
KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
|
5600
|
-
|
5601
5151
|
## Sugars and glyco-patterns
|
5602
5152
|
|
5603
5153
|
I am currently having to do an assignment related to glyco-patterns
|
@@ -5761,6 +5311,9 @@ like this:
|
|
5761
5311
|
|
5762
5312
|
<img src="https://i.imgur.com/vr2kEBz.png" style="margin: 1em; margin-left: 3em">
|
5763
5313
|
|
5314
|
+
Since as of <b>July 2022</b> invalid amino acids will be automatically
|
5315
|
+
filtered away before being assigned to the input.
|
5316
|
+
|
5764
5317
|
## Colourizing hydrophilic and hydrophobic aminoacids on the commandline
|
5765
5318
|
|
5766
5319
|
Via class **Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids** you
|
@@ -5774,35 +5327,36 @@ Example output for this:
|
|
5774
5327
|
|
5775
5328
|
This subsection contains some information about proteases.
|
5776
5329
|
|
5777
|
-
|
5330
|
+
Trypsin:
|
5778
5331
|
https://en.wikipedia.org/wiki/Trypsin
|
5779
|
-
cuts at
|
5332
|
+
<b>cuts at</b>: Trypsin cuts peptide chains mainly at the carboxyl
|
5780
5333
|
side of the amino acids lysine or arginine.
|
5781
5334
|
|
5782
|
-
|
5335
|
+
Chymotrypsin:
|
5783
5336
|
https://en.wikipedia.org/wiki/Chymotrypsin
|
5784
|
-
cuts at
|
5337
|
+
<b>cuts at</b>: Chymotrypsin preferentially cleaves peptide amide
|
5785
5338
|
bonds where the side chain of the amino acid N-terminal
|
5786
|
-
to the scissile amide bond is a large hydrophobic amino
|
5787
|
-
acid (tyrosine, tryptophan, and phenylalanine).
|
5339
|
+
to the scissile amide bond is <b>a large hydrophobic amino</b>
|
5340
|
+
acid (specifically: tyrosine, tryptophan, and phenylalanine).
|
5341
|
+
Chymotrypsin will cleave proteins on the <b>carboxyl side</b>
|
5342
|
+
of aromatic or large hydrophobic amino acids.
|
5788
5343
|
|
5789
|
-
|
5344
|
+
Thrombin:
|
5790
5345
|
https://en.wikipedia.org/wiki/Thrombin
|
5791
|
-
cuts at
|
5346
|
+
<b>cuts at</b>: Thrombin acts as a serine protease that converts
|
5792
5347
|
soluble fibrinogen into insoluble strands of fibrin. It
|
5793
5348
|
catalyzes the hydrolysis of <b>Arg-Gly</b> bonds in
|
5794
5349
|
particular peptide sequences only.
|
5795
5350
|
|
5796
|
-
|
5351
|
+
Plasmin:
|
5797
5352
|
https://en.wikipedia.org/wiki/Plasmin
|
5798
|
-
cuts at
|
5353
|
+
<b>cuts at</b>: Plasmin is a serine protease.
|
5799
5354
|
|
5800
|
-
|
5355
|
+
Papain:
|
5801
5356
|
https://en.wikipedia.org/wiki/Papain
|
5802
|
-
cuts at
|
5803
|
-
|
5804
|
-
|
5805
|
-
not followed by a valine.
|
5357
|
+
<b>cuts at</b>: Papain prefers to cleave after an arginine or
|
5358
|
+
lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile,
|
5359
|
+
Phe, Trp, Tyr) and not followed by a valine.
|
5806
5360
|
|
5807
5361
|
factor Xa:
|
5808
5362
|
|
@@ -5814,8 +5368,8 @@ Some proteins may permanently reside in the lumen of the
|
|
5814
5368
|
Often such proteins will have a special signal sequence attached
|
5815
5369
|
to their **C-terminal part**, such as **KDEL** (Lys-Asp-Glu-Leu).
|
5816
5370
|
|
5817
|
-
KDEL is not the only signal that may be used, though. Some
|
5818
|
-
may use different signals, such as:
|
5371
|
+
<b>KDEL</b> is not the only signal that may be used, though. Some
|
5372
|
+
species may use different signals, such as:
|
5819
5373
|
|
5820
5374
|
aminoacids | species
|
5821
5375
|
-------------|------------------------------------------------------------
|
@@ -5825,8 +5379,9 @@ may use different signals, such as:
|
|
5825
5379
|
ADEL | Schizosaccharomyces pombe (fission yeast)
|
5826
5380
|
SDEL | Plasmodium falciparum
|
5827
5381
|
|
5828
|
-
If you work with the bioshell then you can simply use this
|
5829
|
-
to query whether the given aminoacid sequence has a KDEL
|
5382
|
+
If you work with the <b>bioshell</b> then you can simply use this
|
5383
|
+
method to query whether the given aminoacid sequence has a KDEL
|
5384
|
+
sequence:
|
5830
5385
|
|
5831
5386
|
KDEL?
|
5832
5387
|
|
@@ -6237,8 +5792,6 @@ Next, do something such as this:
|
|
6237
5792
|
This will show the distribution of the oligos.
|
6238
5793
|
|
6239
5794
|
## Number of chromomes in different species
|
6240
|
-
![alt text][cat1]
|
6241
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
6242
5795
|
|
6243
5796
|
Name of the organism | Latin name | Number of chromosomes
|
6244
5797
|
---------------------|--------------|-----------------------
|
@@ -6316,112 +5869,6 @@ So this is what would be returned:
|
|
6316
5869
|
|
6317
5870
|
Bioroebe::DetectMinimalCodon[["TTT", "TTC"]] # => ["TTY"]
|
6318
5871
|
|
6319
|
-
## Codon Usage
|
6320
|
-
|
6321
|
-
This **paragraph** deals with some aspects of **codon usage** in different
|
6322
|
-
organisms.
|
6323
|
-
|
6324
|
-
Let us first define the term <b>codon usage</b>. In order to do so,
|
6325
|
-
we also have to define what a <b>codon</b> is, so let's start with that.
|
6326
|
-
|
6327
|
-
A <span style="color: darkgreen; font-weight: bold">codon</span> is
|
6328
|
-
essentially the basic code used in DNA to denote which particular
|
6329
|
-
**aminoacid** corresponds to these (three) nucleotide base pairs.
|
6330
|
-
A codon is thus **a series of three nucleotides, also called
|
6331
|
-
a <b>triplet</b>.
|
6332
|
-
|
6333
|
-
When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
|
6334
|
-
abbreviated as <b>dsDNA</b>. The codon is, however had, only found
|
6335
|
-
in a single stranded molecule, even within dsDNA. Since some parts of
|
6336
|
-
a **dsDNA** in any given genome gives rise to a, more or less, complementary
|
6337
|
-
copy into **mRNA**, the codons that are actually used, are found in the
|
6338
|
-
corresponding mRNA. (Remember that mRNA differs from DNA in that there
|
6339
|
-
will be Uracil rather than Thymine; otherwise it is the same, sequence-wise.
|
6340
|
-
Of course it uses another sugar (Ribose), but remember we are here mostly
|
6341
|
-
interested in the **information-containing part**, not the full chemical
|
6342
|
-
structure.)
|
6343
|
-
|
6344
|
-
The codon is thus found on the mRNA and since mRNA is mostly
|
6345
|
-
single-stranded, the codon is a component of the mRNA. It is
|
6346
|
-
where the two subunits of the ribosome are assembled (or more
|
6347
|
-
accurately, the smaller subunit scans along the mRNA until it
|
6348
|
-
detects a start codon). Mind you, this subsection will not go into
|
6349
|
-
all relevant details, so just keep in mind that the codon is the
|
6350
|
-
part that will eventually be "translated" at the ribosome into
|
6351
|
-
a corresponding aminoacid, excluding stop codons at the end.
|
6352
|
-
|
6353
|
-
Now - different organisms use **different frequencies of codons**.
|
6354
|
-
**Codon usage** thus describes the fact that many proteins in
|
6355
|
-
these different organisms make use of certain codons with a
|
6356
|
-
**substantially higher frequency than other codons**. We can
|
6357
|
-
use statistics to infer this on a global (proteome) level
|
6358
|
-
too.
|
6359
|
-
|
6360
|
-
Remember that the genetic code is **degenerate**, meaning that
|
6361
|
-
you have a few aminoacids that are encoded only by one codon
|
6362
|
-
(<b>Tryptophan</b> and <b>Methionin</b>), whereas the other
|
6363
|
-
aminoacids are encoded by more than one codon - thus, at the
|
6364
|
-
very least two codons. Note that the latter codons, if they
|
6365
|
-
code for the **same** aminoacid, are also called <b>synonymous
|
6366
|
-
codons</b>.
|
6367
|
-
|
6368
|
-
This means that if you have any given aminoacid chain, you can have
|
6369
|
-
several different sequences (and codons in these sequences, which
|
6370
|
-
ultimtely means that you can have different DNA sequences code for
|
6371
|
-
the very same aminoacid chain).
|
6372
|
-
|
6373
|
-
Usually the third base of a codon has the least influence on
|
6374
|
-
codon meaning. This is also called <b>wobbling</b> - since
|
6375
|
-
the anticodon loop on the tRNA is in the reverse direction,
|
6376
|
-
and the wobble position refers to the tRNA, this means that
|
6377
|
-
the wobble-position is at the 5'-end of the tRNA anticodon.
|
6378
|
-
|
6379
|
-
Now a few words about functionality related to codons and codon
|
6380
|
-
usage in the Bioroebe project.
|
6381
|
-
|
6382
|
-
Say that you have a long DNA sequence; let's pick a sample
|
6383
|
-
for now, such as:
|
6384
|
-
|
6385
|
-
ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
6386
|
-
|
6387
|
-
You can analyze the codons used via class **ShowCodonUsage**:
|
6388
|
-
|
6389
|
-
show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
6390
|
-
|
6391
|
-
This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
|
6392
|
-
It will report the top 5 codons in use and also output the
|
6393
|
-
frequency hash on the commandline.
|
6394
|
-
|
6395
|
-
You can use this from ruby too, via this toplevel method:
|
6396
|
-
|
6397
|
-
Bioroebe.codon_frequencies_of_this_sequence(ARGV)
|
6398
|
-
|
6399
|
-
If you want to look at the actual codon frequencies used
|
6400
|
-
by different organisms, have a look here:
|
6401
|
-
|
6402
|
-
http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
|
6403
|
-
|
6404
|
-
This is an excellent resource.
|
6405
|
-
|
6406
|
-
## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
|
6407
|
-
|
6408
|
-
If you quickly wish to determine the aminoacid composition, as a
|
6409
|
-
Hash, you can use **bin/aminoacid_frequencies**.
|
6410
|
-
|
6411
|
-
Example from the commandline for this:
|
6412
|
-
|
6413
|
-
aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
|
6414
|
-
|
6415
|
-
Example from within bioroebe itself (and thus ruby):
|
6416
|
-
|
6417
|
-
require 'bioroebe/frequencies.rb'
|
6418
|
-
|
6419
|
-
Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
|
6420
|
-
|
6421
|
-
The latter will return a Hash that you can then further make use for, such as:
|
6422
|
-
|
6423
|
-
{"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
|
6424
|
-
|
6425
5872
|
## The Levensthein distance
|
6426
5873
|
|
6427
5874
|
The <b>Levensthein distance</b> - also called a '**string metric**' - was formulated
|
@@ -6839,6 +6286,34 @@ change A: teal or C: slateblue to some other colour; these are HTML
|
|
6839
6286
|
colours, so it is recommended to use the names of these HTML
|
6840
6287
|
colours).
|
6841
6288
|
|
6289
|
+
In <b>July 2022</b> the method <b>Bioroebe.colourize_this_fasta_sequence</b>
|
6290
|
+
was extended slightly. You can now attach a "ruler" to the output, that
|
6291
|
+
is a numbered series that shows the nucleotide position, on the commandline.
|
6292
|
+
|
6293
|
+
Example for this:
|
6294
|
+
|
6295
|
+
puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
|
6296
|
+
'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
|
6297
|
+
'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
|
6298
|
+
) { :with_ruler }
|
6299
|
+
|
6300
|
+
By default this will use a white colour on black background. If you want to
|
6301
|
+
modify the foreground colour you can pass the colour name to the method,
|
6302
|
+
such as via:
|
6303
|
+
|
6304
|
+
puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
|
6305
|
+
'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
|
6306
|
+
'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
|
6307
|
+
) { :with_ruler_steelblue_colour }
|
6308
|
+
|
6309
|
+
The following image shows how this can be used on the commandline:
|
6310
|
+
|
6311
|
+
<img src="https://i.imgur.com/ucVEVnK.png" style="margin: 1em; border: 3px solid black">
|
6312
|
+
|
6313
|
+
At a later time this may be extended to allow for use in a webpage,
|
6314
|
+
that is to embed these strings directly into HTML or .php or
|
6315
|
+
.cgi.
|
6316
|
+
|
6842
6317
|
If you wish to show a **chunked display** of the dataset (nucleotides
|
6843
6318
|
normally) then you can use the following API:
|
6844
6319
|
|
@@ -7362,16 +6837,6 @@ This would notify the bioshell that only nucleotides from position
|
|
7362
6837
|
51 to (including) position 3251 will be colourized, when doing another
|
7363
6838
|
"ORF?" invocation.
|
7364
6839
|
|
7365
|
-
## Longest substring
|
7366
|
-
|
7367
|
-
Within the Bioroebe::Shell you can determine the longest substring,
|
7368
|
-
including gaps, like s:'
|
7369
|
-
|
7370
|
-
longest_substring? ATTATTGTT | ATTATTCTT'
|
7371
|
-
|
7372
|
-
Note that this will make use of the diff-lcs gem, which uses
|
7373
|
-
the McIlroy-Hunt algorithm.
|
7374
|
-
|
7375
6840
|
## Restriction Enzymes
|
7376
6841
|
|
7377
6842
|
This **subsection** will eventually be expanded to explain various things about
|
@@ -8730,6 +8195,22 @@ The images that can be generated via this may look as follows:
|
|
8730
8195
|
|
8731
8196
|
<img src="https://i.imgur.com/fWwD1fj.png" style="margin: 1em; margin-left: 2em">
|
8732
8197
|
|
8198
|
+
Let's look at another example.
|
8199
|
+
|
8200
|
+
Say you input the following sequences there:
|
8201
|
+
|
8202
|
+
AGVV
|
8203
|
+
AGVV
|
8204
|
+
AGVV
|
8205
|
+
AGVV
|
8206
|
+
AGGV
|
8207
|
+
AGGV
|
8208
|
+
AGGV
|
8209
|
+
|
8210
|
+
The resulting image that is generated is:
|
8211
|
+
|
8212
|
+
<img src="https://i.imgur.com/3wWApIQ.png" style="margin: 1em; margin-left: 2em">
|
8213
|
+
|
8733
8214
|
## The Kozak Sequence
|
8734
8215
|
|
8735
8216
|
The ribosome usually scans for a **AUG** codon. But there are
|
@@ -8869,85 +8350,6 @@ Usage Example:
|
|
8869
8350
|
|
8870
8351
|
pfasta insulin_mRNA.fasta --toprotein
|
8871
8352
|
|
8872
|
-
## Determining the codon frequencies from the commandline
|
8873
|
-
|
8874
|
-
In April 2022 I noticed that one use case is to show the codon
|
8875
|
-
frequencies of a given sequence - typically a nucleotide sequence.
|
8876
|
-
|
8877
|
-
For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
|
8878
|
-
So, following that logic, a new executable was added at
|
8879
|
-
**bin/codon_frequency**. This will show the Hash of the codon
|
8880
|
-
frequencies, as a String, on the commandline.
|
8881
|
-
|
8882
|
-
Usage example:
|
8883
|
-
|
8884
|
-
codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
|
8885
|
-
|
8886
|
-
The output of this would be the following:
|
8887
|
-
|
8888
|
-
AUU: 2
|
8889
|
-
CGU: 2
|
8890
|
-
ACG: 1
|
8891
|
-
AUC: 1
|
8892
|
-
GAC: 1
|
8893
|
-
UGA: 1
|
8894
|
-
CUG: 1
|
8895
|
-
ACA: 1
|
8896
|
-
GUC: 1
|
8897
|
-
|
8898
|
-
## Showing the codon frequency via countcodon
|
8899
|
-
|
8900
|
-
https://www.kazusa.or.jp/codon/countcodon.html offers a rather useful
|
8901
|
-
functionality via a simple web-interface, in that you can pass in a mRNA
|
8902
|
-
sequence, and it will then show the codon frequency/likelihood of that
|
8903
|
-
sequence - all codons in that sequence, that is. This can be extended
|
8904
|
-
to all protein-coding genes in a given genome, and will thus be useful
|
8905
|
-
for a researcher who may be interested in determining the codon frequency
|
8906
|
-
in general, across all genes in that given genome.
|
8907
|
-
|
8908
|
-
You can test it with an input sequence. For instance, the following
|
8909
|
-
sequence:
|
8910
|
-
|
8911
|
-
ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
|
8912
|
-
|
8913
|
-
Would yield this result:
|
8914
|
-
|
8915
|
-
fields: [triplet] [frequency: per thousand] ([number])
|
8916
|
-
|
8917
|
-
UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
|
8918
|
-
UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
|
8919
|
-
UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
|
8920
|
-
UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
|
8921
|
-
|
8922
|
-
CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
|
8923
|
-
CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
|
8924
|
-
CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
|
8925
|
-
CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
|
8926
|
-
|
8927
|
-
AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
|
8928
|
-
AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
|
8929
|
-
AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
|
8930
|
-
AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
|
8931
|
-
|
8932
|
-
GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
|
8933
|
-
GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
|
8934
|
-
GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
|
8935
|
-
GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
|
8936
|
-
|
8937
|
-
At any rate, the individual functionality for that is also available
|
8938
|
-
within the Bioroebe project since as of **April 2022**.
|
8939
|
-
|
8940
|
-
The method that does so is:
|
8941
|
-
|
8942
|
-
Bioroebe.frequency_per_thousand
|
8943
|
-
Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
|
8944
|
-
|
8945
|
-
At a later time sinatra-bindings as well as ruby-gtk3 bindings will
|
8946
|
-
be added, and possibly ruby-libui bindings as well, for windows
|
8947
|
-
support. What is missing is support for different codon tables in
|
8948
|
-
different species, but that may be added at a later time as well
|
8949
|
-
- for now it seemed more important to offer the functionality.
|
8950
|
-
|
8951
8353
|
## class Bioroebe::Protein
|
8952
8354
|
|
8953
8355
|
**class Bioroebe::Protein** can be used to store a protein sequence.
|
@@ -9180,6 +8582,1036 @@ time being it is what it is. At a later point in time test cases
|
|
9180
8582
|
may be added to check whether it performs correctly or whether it
|
9181
8583
|
does not.
|
9182
8584
|
|
8585
|
+
The other rules, also published in 2004, are the Reynolds rules. Code
|
8586
|
+
support was added to the Bioroebe project in <b>June 2022</b>, but
|
8587
|
+
it was not tested yet, so the implementation may be incorrect.
|
8588
|
+
|
8589
|
+
## The Bioroebe::Shell interface
|
8590
|
+
|
8591
|
+
The following subsection specifically handles information
|
8592
|
+
pertaining to the <b>Bioroebe::Shell</b> interface of the
|
8593
|
+
<b>bioroebe project</b>. It is also called <b>bioshell</b>,
|
8594
|
+
to simplify spelling it.
|
8595
|
+
|
8596
|
+
### Numbers as input in the bioshell
|
8597
|
+
![alt text][cat1]
|
8598
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8599
|
+
|
8600
|
+
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
8601
|
+
|
8602
|
+
This will attempt to <b>display the first 3 nucleotides</b> of
|
8603
|
+
the assigned **main sequence**. It will only work if you have
|
8604
|
+
assigned a sequence prior to that, though.
|
8605
|
+
|
8606
|
+
Examples:
|
8607
|
+
|
8608
|
+
3
|
8609
|
+
33
|
8610
|
+
15
|
8611
|
+
|
8612
|
+
### transeq
|
8613
|
+
![alt text][cat1]
|
8614
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8615
|
+
|
8616
|
+
You can convert a DNA sequence into an aminoacid sequence by
|
8617
|
+
doing this:
|
8618
|
+
|
8619
|
+
transeq
|
8620
|
+
|
8621
|
+
### Shuffling the DNA/RNA string in the bioshell
|
8622
|
+
![alt text][cat1]
|
8623
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8624
|
+
|
8625
|
+
Via
|
8626
|
+
|
8627
|
+
shuffle
|
8628
|
+
|
8629
|
+
you can <b>randomly rearrange the main DNA/RNA string</b>
|
8630
|
+
that is used by the <b>Bioroebe::Shell</b>.
|
8631
|
+
|
8632
|
+
This can be useful if you just wish to quickly "test"
|
8633
|
+
new compositions of the same nucleotide.
|
8634
|
+
|
8635
|
+
### Permanently disabling showing the startup-introduction of the Bioshell
|
8636
|
+
![alt text][cat1]
|
8637
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8638
|
+
|
8639
|
+
If you do not want to see the start-up intro, you can try
|
8640
|
+
any of the following:
|
8641
|
+
|
8642
|
+
bioshell --permanently-disable-startup-intro
|
8643
|
+
bioshell --permanently-disable-startup-notice
|
8644
|
+
bioshell --permanently-no-startup-intro
|
8645
|
+
bioshell --permanently-no-startup-info
|
8646
|
+
|
8647
|
+
### Longest substring
|
8648
|
+
![alt text][cat1]
|
8649
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8650
|
+
|
8651
|
+
Within the Bioroebe::Shell you can determine the longest substring,
|
8652
|
+
including gaps, like s:'
|
8653
|
+
|
8654
|
+
longest_substring? ATTATTGTT | ATTATTCTT'
|
8655
|
+
|
8656
|
+
Note that this will make use of the diff-lcs gem, which uses
|
8657
|
+
the McIlroy-Hunt algorithm.
|
8658
|
+
|
8659
|
+
### Do not create directories on startup of the shell
|
8660
|
+
![alt text][cat1]
|
8661
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8662
|
+
|
8663
|
+
By default the <b>bioshell</b> will try to create some directories
|
8664
|
+
on startup. This may not always be desired by the user, though,
|
8665
|
+
so an option has to exist to <b>disable</b> this functionality.
|
8666
|
+
|
8667
|
+
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
8668
|
+
keeps track of whether directories on startup of the shell will
|
8669
|
+
be created.
|
8670
|
+
|
8671
|
+
To disable this behaviour on startup of the bioshell, try
|
8672
|
+
something like this:
|
8673
|
+
|
8674
|
+
bioshell --do-not-create-directories-on-startup
|
8675
|
+
bioshell --do-not-create-directories
|
8676
|
+
|
8677
|
+
### Generating and assigning a random amount of nucleotides
|
8678
|
+
![alt text][cat1]
|
8679
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8680
|
+
|
8681
|
+
Via:
|
8682
|
+
|
8683
|
+
random 555
|
8684
|
+
|
8685
|
+
you can "generate" 555 random nucleotides (DNA that is) and
|
8686
|
+
assign it to the main sequence in use by the bioshell. This
|
8687
|
+
is mostly a convenience feature, if you want to debug something
|
8688
|
+
quickly.
|
8689
|
+
|
8690
|
+
### Determining the log directory for the Bioroebe::Shell component
|
8691
|
+
![alt text][cat1]
|
8692
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8693
|
+
|
8694
|
+
Via:
|
8695
|
+
|
8696
|
+
bioshell_log_dir?
|
8697
|
+
|
8698
|
+
you can determine the log-directory output for the bioshell
|
8699
|
+
component. On my home system this will default to
|
8700
|
+
<b>/home/Temp/bioroebe/bioshell/</b>.
|
8701
|
+
|
8702
|
+
### Prompt (the shell prompt of the bioshell)
|
8703
|
+
![alt text][cat1]
|
8704
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8705
|
+
|
8706
|
+
You can set a <b>custom prompt</b> in the bioshell, via
|
8707
|
+
the keywords "<b>prompt</b>" or "<b>set_prompt</b>".
|
8708
|
+
|
8709
|
+
To display the <b>current working directory</b>, do:
|
8710
|
+
|
8711
|
+
prompt pwd
|
8712
|
+
|
8713
|
+
To revert to the old default again, do this:
|
8714
|
+
|
8715
|
+
prompt REVERT
|
8716
|
+
prompt revert
|
8717
|
+
prompt DEFAULT
|
8718
|
+
prompt default
|
8719
|
+
|
8720
|
+
If you do not want to set any prompt, do:
|
8721
|
+
|
8722
|
+
prompt none
|
8723
|
+
|
8724
|
+
### Random stuff - generating random DNA sequences in the bioshell
|
8725
|
+
![alt text][cat1]
|
8726
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8727
|
+
|
8728
|
+
You can <b>generate random DNA sequences</b> in the
|
8729
|
+
<b>bioshell</b> via:
|
8730
|
+
|
8731
|
+
random dna 20
|
8732
|
+
random dna 25
|
8733
|
+
random dna 30
|
8734
|
+
# or simpler
|
8735
|
+
random 20
|
8736
|
+
random 25
|
8737
|
+
random 30
|
8738
|
+
|
8739
|
+
This will generate random DNA sequences, with a length
|
8740
|
+
of 20, 25, 30, respectively. This may not be very useful
|
8741
|
+
but it was important that this functionality is made
|
8742
|
+
available somewhere. Sometimes you may not even care
|
8743
|
+
about the sequence and just use the a "filler" sequence,
|
8744
|
+
so randomness has to be part of the Bioroebe project
|
8745
|
+
as well.
|
8746
|
+
|
8747
|
+
You can also use some toplevel-methods to generate, e. g.
|
8748
|
+
20 random aminoacids. Have a look at the following
|
8749
|
+
<b>toplevel API</b>:
|
8750
|
+
|
8751
|
+
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
8752
|
+
|
8753
|
+
Note that there may exist other APIs within the Bioroebe project
|
8754
|
+
that do the same as well.
|
8755
|
+
|
8756
|
+
If you would like to use a ruby-gtk3 widget have a look
|
8757
|
+
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
8758
|
+
It works with aminoacids, DNA and RNA, and allows the user to
|
8759
|
+
create random sequences. (If you need weighted randomness then
|
8760
|
+
you currently have to use the commandline variant. Perhaps I may
|
8761
|
+
add support into the GUI directly for this one day.)
|
8762
|
+
|
8763
|
+
### Deprecations within the Bioroebe::Shell
|
8764
|
+
![alt text][cat1]
|
8765
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8766
|
+
|
8767
|
+
Over the years the Bioroebe::Shell changed quite a bit.
|
8768
|
+
|
8769
|
+
This subsection here will list a few of these changes
|
8770
|
+
or rather, the deprecations.
|
8771
|
+
|
8772
|
+
**raw_sequence**: removed in June 2022 completely. It is
|
8773
|
+
simpler to handle sequences via Bioroebe::Sequence
|
8774
|
+
instead.
|
8775
|
+
|
8776
|
+
<b>@internal_hash[:array_sequences]</b> was no longer in
|
8777
|
+
use, so it was removed in July 2022.
|
8778
|
+
|
8779
|
+
### Chop off nucleotides within the Bioroebe::Shell
|
8780
|
+
![alt text][cat1]
|
8781
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8782
|
+
|
8783
|
+
You can use the following syntax to chop away until you find
|
8784
|
+
a particular substring, in the bioshell:
|
8785
|
+
|
8786
|
+
chop_to ATG
|
8787
|
+
|
8788
|
+
This functionality was specifically added to find the first
|
8789
|
+
ATG codon.
|
8790
|
+
|
8791
|
+
### Truncating output in the bioroebe-shell
|
8792
|
+
![alt text][cat1]
|
8793
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8794
|
+
|
8795
|
+
**DNA/RNA sequences** can become very long and then become
|
8796
|
+
quite difficult to view, read and handle on the commandline.
|
8797
|
+
|
8798
|
+
Normally the bioroebe shell will truncate output of DNA sequences
|
8799
|
+
that are "too long". This is mostly done so that working with
|
8800
|
+
very long sequences becomes a bit more convenient.
|
8801
|
+
|
8802
|
+
Sometimes this can become an antifeature, though, so the user
|
8803
|
+
must be able to toggle this at his or her own discretion.
|
8804
|
+
|
8805
|
+
By default, the bioroebe-shell (bioshell) will always try
|
8806
|
+
to truncate output, but you can toggle this behaviour by
|
8807
|
+
issuing:
|
8808
|
+
|
8809
|
+
do not truncate
|
8810
|
+
|
8811
|
+
In theory, other "do not" actions are also supported, or will
|
8812
|
+
be supported in the future; right now (Oct 2019) this is a bit
|
8813
|
+
limited.
|
8814
|
+
|
8815
|
+
From the toplevel, you can use this method:
|
8816
|
+
|
8817
|
+
Bioroebe.do_not_truncate
|
8818
|
+
|
8819
|
+
The above instruction will toggle the truncate behaviour
|
8820
|
+
to not truncate, ever.
|
8821
|
+
|
8822
|
+
If you need to do so within the bioshell, this is the way:
|
8823
|
+
|
8824
|
+
no_truncate
|
8825
|
+
|
8826
|
+
Or simply
|
8827
|
+
|
8828
|
+
truncate
|
8829
|
+
|
8830
|
+
This will toggle, like a switch.
|
8831
|
+
|
8832
|
+
### Working with .pdb files in the bioshell
|
8833
|
+
![alt text][cat1]
|
8834
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8835
|
+
|
8836
|
+
This subsection only very briefly mentions how to work with
|
8837
|
+
.pdb files in the bioshell. See other parts of this
|
8838
|
+
document for a more extensive overview how you can work
|
8839
|
+
with .pdb files via the Bioroebe project.
|
8840
|
+
|
8841
|
+
If you input something like this, if it ends with .pdb:
|
8842
|
+
|
8843
|
+
1fat.pdb
|
8844
|
+
|
8845
|
+
And if no such file currently exists at
|
8846
|
+
/home/Temp/bioroebe/pdb/1fat.pdb then it will be
|
8847
|
+
downloaded and moved towards
|
8848
|
+
**/home/Temp/bioroebe/pdb/**.
|
8849
|
+
|
8850
|
+
This feature exists just to simplify using the
|
8851
|
+
**bioshell**.
|
8852
|
+
|
8853
|
+
### Showing the stop codons in frame1, frame2 and frame2 in the bioshell
|
8854
|
+
![alt text][cat1]
|
8855
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8856
|
+
|
8857
|
+
When you have a given sequence assigned to the bioshell, such
|
8858
|
+
as via "random 99", you can then show all stop codons in
|
8859
|
+
frame1, frame2 and frame3.
|
8860
|
+
|
8861
|
+
The corresponding input for this will be:
|
8862
|
+
|
8863
|
+
stop_frame1?
|
8864
|
+
stop_frame2?
|
8865
|
+
stop_frame3?
|
8866
|
+
|
8867
|
+
An image shows this next, where we first did input "random 120",
|
8868
|
+
before issuing the above-mentioned instructions one after
|
8869
|
+
the other:
|
8870
|
+
|
8871
|
+
<img src="https://i.imgur.com/HpHF4jq.png" style="margin: 1em; border: 1px solid black">
|
8872
|
+
|
8873
|
+
### Freezing the main sequence in the bioshell - and unfreezing it again
|
8874
|
+
![alt text][cat1]
|
8875
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8876
|
+
|
8877
|
+
You can **freeze** the BioShell, meaning that it will no longer
|
8878
|
+
allow for the main sequence to be modified, via the following
|
8879
|
+
command:
|
8880
|
+
|
8881
|
+
freeze
|
8882
|
+
|
8883
|
+
To <b>unfreeze</b> the sequence again, issue:
|
8884
|
+
|
8885
|
+
unfreeze
|
8886
|
+
|
8887
|
+
This functionality has been added because the shell may sometimes be
|
8888
|
+
quite eager to change the main sequence, so we needed a way to
|
8889
|
+
disable any further modifications (until "unfreeze" is issued
|
8890
|
+
that is).
|
8891
|
+
|
8892
|
+
## Support for other programming languages
|
8893
|
+
|
8894
|
+
The main programming language for the bioroebe project is **ruby**.
|
8895
|
+
Ruby, from a language design point of view, is a great programming
|
8896
|
+
language - not necessarily all of ruby, but the subset that I use.
|
8897
|
+
It is very easy to quickly prototype ideas via ruby.
|
8898
|
+
|
8899
|
+
However had, ruby is known to **not** be among the fastest programming
|
8900
|
+
languages about on this planet; so, it makes sense to use other
|
8901
|
+
languages too from this point of view. Additionally there are some
|
8902
|
+
software stacks in use in **other** programming languages, such as
|
8903
|
+
matplotlib and various more.
|
8904
|
+
|
8905
|
+
Thus, it is important to **support other programming languages** as
|
8906
|
+
well, if there are useful libraries. The bioroebe project, after
|
8907
|
+
all, tries to be **practical**: it focuses on getting things done,
|
8908
|
+
no matter the language.
|
8909
|
+
|
8910
|
+
This means that support for other programming languages can be
|
8911
|
+
found in this project as well, often using system() or similar
|
8912
|
+
functionality to tap into these other programming languages. Do
|
8913
|
+
not be surprised when that happens - the bioroebe project will
|
8914
|
+
also try to act as a **practical glue** towards functionality
|
8915
|
+
enabled via other projects. We want to get things done, no
|
8916
|
+
matter the programming language at hand!
|
8917
|
+
|
8918
|
+
Whenever possible, though, the bioroebe project will try to be
|
8919
|
+
flexible in this regard, so ideally the same solution should
|
8920
|
+
work for many different programming languages.
|
8921
|
+
|
8922
|
+
While Ruby is the primary language for this project, since as
|
8923
|
+
of 2021 I will try to officially support **java**, **jruby**
|
8924
|
+
and the **GraalVM**. This is on my TODO list, though - stay
|
8925
|
+
tuned for more updates in this regard. See also the
|
8926
|
+
subsection <b>Support for Python</b>.
|
8927
|
+
|
8928
|
+
## Support for Python
|
8929
|
+
|
8930
|
+
In <b>June 2022</b> I decided to add support for Python to bioroebe.
|
8931
|
+
|
8932
|
+
While people can - and should - easily use <b>biopython</b> instead,
|
8933
|
+
I simply wanted to see how much python-support I can add to
|
8934
|
+
bioroebe. This may lag behind some years compared to biopython,
|
8935
|
+
but I wanted to extend python support as well, so there you go.
|
8936
|
+
It is simply an additional option for the bioroebe project.
|
8937
|
+
<b>Ruby</b> will remain the primary language for the project,
|
8938
|
+
though, at the least for now.
|
8939
|
+
|
8940
|
+
## Bioroebe::ProfilePattern
|
8941
|
+
|
8942
|
+
This class can be used to generate nucleotide sequences that
|
8943
|
+
are not quite "random". For example, to generate sequences
|
8944
|
+
that may "simulate" a TATA box.
|
8945
|
+
|
8946
|
+
The idea for this class is to be extended into allowing
|
8947
|
+
HMMs (Hidden Markov Models) one day.
|
8948
|
+
|
8949
|
+
Usage example:
|
8950
|
+
|
8951
|
+
_ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
|
8952
|
+
_.generate_sequence_based_on_this_profile
|
8953
|
+
|
8954
|
+
Such a profile will encode the profile specifying the preferred sequence
|
8955
|
+
letters for each position in a section of DNA. You have to provide
|
8956
|
+
the Hash into the method generate_sequence_based_on_this_profile() -
|
8957
|
+
or you use the default Hash, which is stored in the constant
|
8958
|
+
called **PER_POSITION_HASH**.
|
8959
|
+
|
8960
|
+
That profile should be a Hash, with keys pointing to A, T, C, G
|
8961
|
+
and the values being an Array of likelihood chance there,
|
8962
|
+
as a number, such as 140. These values are also called
|
8963
|
+
**scores**. Each score contains a number for each position
|
8964
|
+
that indicates how likely it is to find the given
|
8965
|
+
nucleotide at that location.
|
8966
|
+
|
8967
|
+
You can also use this class to generate a random DNA string,
|
8968
|
+
similar to the method called
|
8969
|
+
**Bioroebe.generate_random_dna_sequence()**. The difference
|
8970
|
+
is that class ProfilePattern allows for a bit more fine-tuned
|
8971
|
+
control. The class will likely be extended in the future too.
|
8972
|
+
|
8973
|
+
## Generate DNA via Bioroebe.random_dna
|
8974
|
+
|
8975
|
+
You can "generate" random DNA strings by making use of the
|
8976
|
+
following code:
|
8977
|
+
|
8978
|
+
x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
|
8979
|
+
|
8980
|
+
As can be seen in the example above, after the #, a String will be
|
8981
|
+
returned representing that nucleotide sequence. In the case above
|
8982
|
+
it'll be 50 nucleotides in length.
|
8983
|
+
|
8984
|
+
The number given to <b>.random_dna()</b> tells the method how many
|
8985
|
+
nucleotides should be generated.
|
8986
|
+
|
8987
|
+
The method accepts a second argument, which should be a Hash.
|
8988
|
+
If it is a hash then the generated DNA will be based on the
|
8989
|
+
**probabilities** given to that Hash.
|
8990
|
+
|
8991
|
+
Let's look at specific example here:
|
8992
|
+
|
8993
|
+
Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"
|
8994
|
+
|
8995
|
+
As you can see, in the Hash defined above, the likelihood for
|
8996
|
+
incorporating a Guanine is much higher than for Adenine
|
8997
|
+
(70 : 10). This will be reflected in the generated DNA
|
8998
|
+
sequence which, as can be seen, contains many more
|
8999
|
+
Guanines than Adenines.
|
9000
|
+
|
9001
|
+
There is yet a third use case for the above. If you pass a **String**
|
9002
|
+
as the second argument rather than a Hash, then that String will be
|
9003
|
+
used as basis for generating the DNA string at hand.
|
9004
|
+
|
9005
|
+
Again, let's look at a specific example here:
|
9006
|
+
|
9007
|
+
Bioroebe.random_dna(10, 'ATCGATCGGG')
|
9008
|
+
|
9009
|
+
Here we add more G than A, T or C, so the new DNA sequence should
|
9010
|
+
contain these nucleotides as well.
|
9011
|
+
|
9012
|
+
More usage examples in this regard:
|
9013
|
+
|
9014
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
|
9015
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"
|
9016
|
+
|
9017
|
+
Note that this is similar to the .randomize() method in the bioruby
|
9018
|
+
project:
|
9019
|
+
|
9020
|
+
hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
|
9021
|
+
puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)
|
9022
|
+
|
9023
|
+
## Generating a random nucleotide sequence based on frequencies
|
9024
|
+
|
9025
|
+
If you ever need to generate a nucleotide frequency then you can use
|
9026
|
+
the following method:
|
9027
|
+
|
9028
|
+
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
|
9029
|
+
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
|
9030
|
+
Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
|
9031
|
+
|
9032
|
+
## Parsing genbank (.gbk) files
|
9033
|
+
|
9034
|
+
You could use class <b>Bioroebe::GenbankParser</b> to parse .gbk files, at
|
9035
|
+
the least if you want to obtain the raw sequence, in FASTA format.
|
9036
|
+
|
9037
|
+
Example for this:
|
9038
|
+
|
9039
|
+
require 'bioroebe/genbank/genbank_parser.rb'
|
9040
|
+
result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
|
9041
|
+
result.dataset? # This method call will return the FASTA sequence.
|
9042
|
+
|
9043
|
+
Note that this currently (<b>July 2022</b>) only grabs one entry. In
|
9044
|
+
the upcoming rewrite in the future the parser will be able to parse
|
9045
|
+
all entries, and then present them to the user. Stay tuned in this
|
9046
|
+
regard.
|
9047
|
+
|
9048
|
+
## Parsers in general
|
9049
|
+
|
9050
|
+
The bioroebe project will store most parsers in the parsers/ subdirectory
|
9051
|
+
since as of <b>July 2022</b>.
|
9052
|
+
|
9053
|
+
Prior to that date different parsers were stored in different subdirectories,
|
9054
|
+
such as the parser for genbank-files being stored in the genbank/
|
9055
|
+
subdirectory. As I found this situation confusing, I settled for
|
9056
|
+
the parsers/ subdirectory since as of <b>July 2022</b>.
|
9057
|
+
|
9058
|
+
## Coomassie staining of proteins
|
9059
|
+
|
9060
|
+
Coomassie staining is typically done on proteins, giving them a blue
|
9061
|
+
or blueish colour. <b>Coomassie staining</b> is <b>the most popular
|
9062
|
+
anionic protein dye</b>.
|
9063
|
+
|
9064
|
+
This may look like this:
|
9065
|
+
|
9066
|
+
<img src="https://i.imgur.com/6eUN7HR.png" style="margin: 1em; border: 1px solid black">
|
9067
|
+
|
9068
|
+
This picture shows five different bands. The molecular weight of the
|
9069
|
+
marker can be seen on the very left hand side, in <b>kDa</b>. The
|
9070
|
+
larger fragments can be seen on top, so the farther the band has
|
9071
|
+
moved, the smaller the fragment must be (in kDa). That means that
|
9072
|
+
the larger proteins can be found on top; the smaller proteins on
|
9073
|
+
the bottom.
|
9074
|
+
|
9075
|
+
Some bands are missing, and this gives information - that is
|
9076
|
+
that a particular protein is missing. Probably it was not
|
9077
|
+
synthesized in the given tissue at hand.
|
9078
|
+
|
9079
|
+
The staining for a Coomassie Blue stain is typically done
|
9080
|
+
via G-250, with a 0.5% density prepared in
|
9081
|
+
50% methanol and 10% acetic acid. The staining duration is
|
9082
|
+
usually done for 5 minutes.
|
9083
|
+
|
9084
|
+
Note that the G-250 stain is the dimethyl derivative from
|
9085
|
+
R-250 - the <b>R</b> stands for <b>red</b> or <b>reddish</b>.
|
9086
|
+
Both dyes will bind via electrostatic interaction with <b>protonated
|
9087
|
+
basic amino acids</b>: that is <b>lysine</b>, <b>arginine</b>,
|
9088
|
+
and <b>histidine</b>. They can also bind via hydrophobic
|
9089
|
+
associations to aromatic residues.
|
9090
|
+
|
9091
|
+
Coomassie stains are in principle reversible. They are not
|
9092
|
+
as sensitive as silver staining, but significantly cheaper,
|
9093
|
+
which is one reason why they have become so popular.
|
9094
|
+
|
9095
|
+
Not every protein has all aminoacids, so staining may be difficult.
|
9096
|
+
For instance, the <b>glycomacropeptide</b> is the only known
|
9097
|
+
naturally occurring protein that contains no Phe (Phenylalanine; F).
|
9098
|
+
|
9099
|
+
A protein that lacks lysine, arginine, histidine or aromatic
|
9100
|
+
acids may be undetectable via Coomassie staining. However had,
|
9101
|
+
this does not seem to be a universal rule; some groups report
|
9102
|
+
that they even managed to stain "unstainable" proteins via
|
9103
|
+
Coomassie staining.
|
9104
|
+
|
9105
|
+
The paper at https://www.jbc.org/article/S0021-9258(17)39198-6/pdf,
|
9106
|
+
titled "Why Does Coomassie Brilliant Blue R Interact Differently
|
9107
|
+
with Different Proteins?" and published in the year 1985, tries
|
9108
|
+
to give some explanations to different groups yielding different
|
9109
|
+
results via Coomassie staining.
|
9110
|
+
|
9111
|
+
They specifically point out that "there is a striking correlation
|
9112
|
+
between intensity of response to Coomassie dyes and the basicity
|
9113
|
+
of a protein which depends on the number of lysine, histidine,
|
9114
|
+
and arginine residues, as well as the NH₂-terminal amino group"
|
9115
|
+
(aka the aminoterminus of the protein at hand). The concluding
|
9116
|
+
remark from that paper is that <b>"Coomassie R Interacts
|
9117
|
+
Differently with Different Proteins"</b>.
|
9118
|
+
|
9119
|
+
On class <b>Bioroebe::Protein</b> you can determine whether
|
9120
|
+
a given protein can be stained via coomassie through the
|
9121
|
+
following method:
|
9122
|
+
|
9123
|
+
.can_be_stained_via_coomassie?
|
9124
|
+
|
9125
|
+
This isn't an ideal check, so don't rely on it. It will simply
|
9126
|
+
check whether the sequence has at the least one lysine,
|
9127
|
+
or one histidine, or one arginine, or any of the aromatic
|
9128
|
+
amino acids.
|
9129
|
+
|
9130
|
+
## Codon Usage
|
9131
|
+
|
9132
|
+
This **paragraph** deals with some aspects of **codon usage** in different
|
9133
|
+
organisms.
|
9134
|
+
|
9135
|
+
Let us first define the term <b>codon usage</b> so we can base any further
|
9136
|
+
analysis on this definition. In order to do so, we also have to define
|
9137
|
+
what a <b>codon</b> is, so let's start with that actually.
|
9138
|
+
|
9139
|
+
A <span style="color: darkgreen; font-weight: bold">codon</span> is
|
9140
|
+
essentially the basic code used in DNA to denote which particular
|
9141
|
+
**aminoacid** corresponds to these (three) nucleotide base pairs.
|
9142
|
+
A codon is thus <b>a series of three nucleotides</b>, also called
|
9143
|
+
a <b>triplet</b>, such as <b>ATG</b>.
|
9144
|
+
|
9145
|
+
When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
|
9146
|
+
abbreviated as <b>dsDNA</b>. The codon is, however had, only found
|
9147
|
+
in a single stranded molecule, even within dsDNA. Since some parts of
|
9148
|
+
a **dsDNA** in any given genome give rise to a, more or less, complementary
|
9149
|
+
copy into **mRNA**, the codons that are actually used, are found in the
|
9150
|
+
corresponding mRNA as well, excluding the codon that codes for a stop
|
9151
|
+
signal (a so-called <b>stop codon</b>). (Remember that mRNA differs from
|
9152
|
+
DNA in that there will be Uracil rather than Thymine; otherwise it is
|
9153
|
+
the same, sequence-wise. Of course it uses another sugar (Ribose), but
|
9154
|
+
remember we are here mostly interested in the **information-containing
|
9155
|
+
part**, not the full chemical structure.)
|
9156
|
+
|
9157
|
+
The <b>codon</b> is thus found on the mRNA and since mRNA is mostly
|
9158
|
+
single-stranded, the codon is a component of the mRNA. The two subunits
|
9159
|
+
of the ribosome are assembled on a mRNA, at the least in prokaryotes (or
|
9160
|
+
more accurately, the smaller subunit scans along the mRNA until it
|
9161
|
+
<b>detects</b> a start codon). Mind you, this subsection will not go
|
9162
|
+
into all relevant details, so just keep in mind that the codon is the
|
9163
|
+
part that will eventually be "<i>translated</i>" at the ribosome into
|
9164
|
+
a corresponding aminoacid, excluding stop codons at the end.
|
9165
|
+
|
9166
|
+
Now - different organisms use **different frequencies of codons**.
|
9167
|
+
<b style="color:darkblue">Codon usage</b> thus describes the fact
|
9168
|
+
that many proteins in these different organisms make use of certain
|
9169
|
+
codons with a **substantially higher frequency than other codons**.
|
9170
|
+
We can use statistics to infer this on a global (proteome)
|
9171
|
+
level too.
|
9172
|
+
|
9173
|
+
Remember that the genetic code is **degenerate**, meaning that
|
9174
|
+
you have a few aminoacids that are encoded only by one codon
|
9175
|
+
(<b>Tryptophan</b> and <b>Methionine</b>), whereas the other
|
9176
|
+
aminoacids are encoded by more than one codon - thus, at the
|
9177
|
+
very least two codons. Note that the latter codons, if they
|
9178
|
+
code for the **same** aminoacid, are also called
|
9179
|
+
<b style="font-style: italic">synonymous codons</b>.
|
9180
|
+
|
9181
|
+
This means that if you have any given aminoacid chain, you can have
|
9182
|
+
several different sequences that would yield to the very same
|
9183
|
+
amino acid chain (and codons in these sequences, which
|
9184
|
+
ultimately means that you can have different DNA sequences
|
9185
|
+
code for the very same aminoacid chain).
|
9186
|
+
|
9187
|
+
Usually the third base of a codon has the least influence on
|
9188
|
+
codon meaning. This is also called <b>wobbling</b> - since
|
9189
|
+
the anticodon loop on the tRNA is in the reverse direction,
|
9190
|
+
and the wobble position refers to the tRNA, this means that
|
9191
|
+
the wobble-position is at the 5'-end of the tRNA anticodon.
|
9192
|
+
|
9193
|
+
Now a few words about functionality related to codons and codon
|
9194
|
+
usage in the Bioroebe project.
|
9195
|
+
|
9196
|
+
Say that you have a long DNA sequence; let's pick a sample
|
9197
|
+
for now, such as:
|
9198
|
+
|
9199
|
+
ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
9200
|
+
|
9201
|
+
You can analyze the codons used via class **ShowCodonUsage**
|
9202
|
+
and the corresponding entry at <b>bin/show_codon_usage</b>:
|
9203
|
+
|
9204
|
+
show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
|
9205
|
+
|
9206
|
+
This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
|
9207
|
+
It will report the top 5 codons in use and also output the
|
9208
|
+
frequency hash on the commandline.
|
9209
|
+
|
9210
|
+
On my computer at home the output it yields via the commandline,
|
9211
|
+
on a KDE konsole terminal, looks like this:
|
9212
|
+
|
9213
|
+
<img src="https://i.imgur.com/h55Thdu.png" style="margin: 1em; border: 3px solid black">
|
9214
|
+
|
9215
|
+
You can use this from within ruby code too, via the following
|
9216
|
+
toplevel method:
|
9217
|
+
|
9218
|
+
Bioroebe.codon_frequencies_of_this_sequence(ARGV)
|
9219
|
+
|
9220
|
+
To get the hash of the codon frequencies you can use the .hash? method:
|
9221
|
+
|
9222
|
+
hash = Bioroebe.codon_frequencies_of_this_sequence('ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG').hash?
|
9223
|
+
|
9224
|
+
If you want to look at the actual codon frequencies used
|
9225
|
+
by different organisms, have a look here:
|
9226
|
+
|
9227
|
+
http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
|
9228
|
+
|
9229
|
+
This is an excellent resource.
|
9230
|
+
|
9231
|
+
For instance, the <i>E. coli</i> K strain can be found here:
|
9232
|
+
|
9233
|
+
https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=83333&aa=9&style=N
|
9234
|
+
|
9235
|
+
## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
|
9236
|
+
|
9237
|
+
If you quickly wish to determine the aminoacid composition, as a
|
9238
|
+
Hash, you can use **bin/aminoacid_frequencies**.
|
9239
|
+
|
9240
|
+
Example from the commandline for this:
|
9241
|
+
|
9242
|
+
aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
|
9243
|
+
|
9244
|
+
Example from within bioroebe itself (and thus ruby):
|
9245
|
+
|
9246
|
+
require 'bioroebe/frequencies.rb'
|
9247
|
+
|
9248
|
+
Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
|
9249
|
+
|
9250
|
+
The latter will return a Hash that you can then further make use for, such as:
|
9251
|
+
|
9252
|
+
{"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
|
9253
|
+
|
9254
|
+
## Determining the codon frequencies from the commandline
|
9255
|
+
|
9256
|
+
In <b>April 2022</b> I noticed that one use case is to show the
|
9257
|
+
codon frequencies of a given sequence - typically a nucleotide sequence.
|
9258
|
+
|
9259
|
+
For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
|
9260
|
+
|
9261
|
+
So, following that logic, a new executable was added at
|
9262
|
+
**bin/codon_frequency**. This will show the Hash of the codon
|
9263
|
+
frequencies, as a String, on the commandline.
|
9264
|
+
|
9265
|
+
Usage example:
|
9266
|
+
|
9267
|
+
codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
|
9268
|
+
|
9269
|
+
The output of this would be the following:
|
9270
|
+
|
9271
|
+
AUU: 2
|
9272
|
+
CGU: 2
|
9273
|
+
ACG: 1
|
9274
|
+
AUC: 1
|
9275
|
+
GAC: 1
|
9276
|
+
UGA: 1
|
9277
|
+
CUG: 1
|
9278
|
+
ACA: 1
|
9279
|
+
GUC: 1
|
9280
|
+
|
9281
|
+
## Showing the codon frequency via countcodon
|
9282
|
+
|
9283
|
+
The excellent website at https://www.kazusa.or.jp/codon/countcodon.html offers
|
9284
|
+
a rather useful functionality via a simple web-interface, in that you can pass
|
9285
|
+
in a mRNA sequence, and it will then show the codon frequency/likelihood of
|
9286
|
+
that sequence - all codons in that sequence, that is. This can be extended
|
9287
|
+
to <b>all protein-coding genes in a given genome</b>, and will thus be
|
9288
|
+
useful for a researcher who may be interested in determining the codon
|
9289
|
+
frequency in general, across all genes in that given genome.
|
9290
|
+
|
9291
|
+
You can test it with an input sequence.
|
9292
|
+
|
9293
|
+
For instance, the following sequence:
|
9294
|
+
|
9295
|
+
ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
|
9296
|
+
|
9297
|
+
Would yield this result:
|
9298
|
+
|
9299
|
+
fields: [triplet] [frequency: per thousand] ([number])
|
9300
|
+
|
9301
|
+
UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
|
9302
|
+
UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
|
9303
|
+
UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
|
9304
|
+
UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
|
9305
|
+
|
9306
|
+
CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
|
9307
|
+
CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
|
9308
|
+
CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
|
9309
|
+
CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
|
9310
|
+
|
9311
|
+
AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
|
9312
|
+
AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
|
9313
|
+
AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
|
9314
|
+
AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
|
9315
|
+
|
9316
|
+
GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
|
9317
|
+
GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
|
9318
|
+
GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
|
9319
|
+
GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
|
9320
|
+
|
9321
|
+
At any rate, the individual functionality for that is also available
|
9322
|
+
within the Bioroebe project since as of **April 2022**.
|
9323
|
+
|
9324
|
+
The method that does so is:
|
9325
|
+
|
9326
|
+
Bioroebe.frequency_per_thousand
|
9327
|
+
Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
|
9328
|
+
|
9329
|
+
Sinatra-bindings exist to this functionality since as of July 2022,
|
9330
|
+
but they are not very well-polished. Ruby-gtk3 bindings may be
|
9331
|
+
added at a later time, and possibly ruby-libui bindings as well, for
|
9332
|
+
windows support. What is missing is support for different codon tables in
|
9333
|
+
different species, but that may be added at a later time as well - for now
|
9334
|
+
it seemed more important to offer the functionality.
|
9335
|
+
|
9336
|
+
## Working with PDB files (.pdb)
|
9337
|
+
|
9338
|
+
The **PDB**, founded in the year **1971**, holds lots of **atomic
|
9339
|
+
structures of proteins**.
|
9340
|
+
|
9341
|
+
For instance, in **July 2016** it contained **121000 structures**.
|
9342
|
+
|
9343
|
+
In **February 2018** it contained **~124000 structures**
|
9344
|
+
(from X-ray crystallography), and about **~12000 NMR
|
9345
|
+
structures**. <b>NMR</b> is limited to about <b>350 amino
|
9346
|
+
acids maximum length</b>, give or take.
|
9347
|
+
|
9348
|
+
In **April 2020** the PDB contained **163141 structures**.
|
9349
|
+
|
9350
|
+
We can see that more and more structures are available nowadays -
|
9351
|
+
a trend that will most likely continue or even accelerate.
|
9352
|
+
(Let's hope the quality also remains high.)
|
9353
|
+
|
9354
|
+
A typical .pdb file contains entries such as this:
|
9355
|
+
|
9356
|
+
RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
|
9357
|
+
ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
|
9358
|
+
ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
|
9359
|
+
ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
|
9360
|
+
ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
|
9361
|
+
ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
|
9362
|
+
|
9363
|
+
(Not the first line; **RTyp** is just an explanation for the ATOM
|
9364
|
+
entries below that line).
|
9365
|
+
|
9366
|
+
The sequence starts from the N-terminal residue for proteins; see
|
9367
|
+
the <b>Atm</b> entry at <b>Num 1</b>.
|
9368
|
+
|
9369
|
+
The **meaning of these entries** is as follows:
|
9370
|
+
|
9371
|
+
1) RTyp: Record Type
|
9372
|
+
2) Num: Serial number of the atom. Each atom has a unique serial number.
|
9373
|
+
3) Atm: Atom name (in IUPAC format).
|
9374
|
+
4) Res: Residue name (IUPAC format).
|
9375
|
+
5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
|
9376
|
+
6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
|
9377
|
+
7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
|
9378
|
+
10) Occ: Occupancy factor
|
9379
|
+
11) Temp: Temperature factor (atoms disordered in the crystal have high
|
9380
|
+
temperature factors; they are "wobbly" with a high factor.
|
9381
|
+
This is also called the B-factor).
|
9382
|
+
12) PDB: The PDB data file unique identifier.
|
9383
|
+
13) Line: Line (record) number in the data file.
|
9384
|
+
|
9385
|
+
Typically the entry on the most right area, the last one, specifies
|
9386
|
+
which atom it is. A **H** stands for a hydrogen atom; the other atoms
|
9387
|
+
are "heavy" atoms (heavier than hydrogen most definitely).
|
9388
|
+
|
9389
|
+
Most .pdb files will contain **SEQRES** entries. These entries will list
|
9390
|
+
the primary sequence of the polymeric molecules present in the entry.
|
9391
|
+
You can notice this by looking at the standard 3-character code
|
9392
|
+
used by SEQRES here, for the canonical amino acids. So, for instance,
|
9393
|
+
the amino acids that will be mentioned in a SEQRES entry are
|
9394
|
+
ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
|
9395
|
+
PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
|
9396
|
+
method **Bioroebe.three_to_one()** to convert back to the
|
9397
|
+
one-letter chain such as follows:
|
9398
|
+
|
9399
|
+
Bioroebe.three_to_one('PHE') # => "F"
|
9400
|
+
|
9401
|
+
The data in a .pdb file need not necessarily only be a protein, with
|
9402
|
+
a specific aminoacid sequence. It may also include DNA. An example
|
9403
|
+
for such a molecule is
|
9404
|
+
<b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
|
9405
|
+
which includes a protein chain and a DNA chain.
|
9406
|
+
|
9407
|
+
As far as the **bioroebe project** is concerned, you can parse .pdb files
|
9408
|
+
via the following class:
|
9409
|
+
|
9410
|
+
Bioroebe::ParsePdbFile.new
|
9411
|
+
Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
|
9412
|
+
Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
|
9413
|
+
|
9414
|
+
This class also allows some shortcuts for integrated .pdb files,
|
9415
|
+
that is files that are bundled with the bioroebe project:
|
9416
|
+
|
9417
|
+
Bioroebe::ParsePdbFile.new ':1fat'
|
9418
|
+
|
9419
|
+
This requires a String because ruby symbols may not start with
|
9420
|
+
a number. Note that this also works through the commandline,
|
9421
|
+
such as:
|
9422
|
+
|
9423
|
+
parse_pdb_file :1fat
|
9424
|
+
|
9425
|
+
A shell such as bash does not understand ruby symbols, so instead
|
9426
|
+
a string will be passed in, being :1fat. The ParsePdbFile will
|
9427
|
+
handle this correctly internally.
|
9428
|
+
|
9429
|
+
Note that a small bug was fixed in the file parse_pdb_file.rb;
|
9430
|
+
some entries were skipped due to an erroneous loop in the ruby
|
9431
|
+
file. This was corrected in **May 2020**.
|
9432
|
+
|
9433
|
+
In **March 2021** the ability to use entries such as ':1fat'
|
9434
|
+
was removed again; the code remains though. The reason why
|
9435
|
+
this was removed was that the .pdb files are quite large,
|
9436
|
+
so distributing them via the bioroebe project makes no real
|
9437
|
+
sense. Consider simply downloading the .pdb files; you
|
9438
|
+
can use this from the bioshell or via something
|
9439
|
+
like:
|
9440
|
+
|
9441
|
+
pdb 5TIM
|
9442
|
+
|
9443
|
+
Note that you can also return the aminoacid-sequence from a
|
9444
|
+
.pdb file directly, since as of **May 2020**.
|
9445
|
+
|
9446
|
+
Example for this:
|
9447
|
+
|
9448
|
+
Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
9449
|
+
|
9450
|
+
The first argument should be **the path to the (local)
|
9451
|
+
.pdb file at hand**. (In theory support for remote .pdb
|
9452
|
+
files could also be added easily, but right now this
|
9453
|
+
is not possible, so you have to download it first.)
|
9454
|
+
|
9455
|
+
The **specification for .pdb files** can be read at the following
|
9456
|
+
two remote resources:
|
9457
|
+
|
9458
|
+
http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
|
9459
|
+
http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
|
9460
|
+
|
9461
|
+
Note that the parse_pdb_file.rb can also do some additional
|
9462
|
+
things, such as calculating the maximum distance between
|
9463
|
+
atoms in that file, via the method
|
9464
|
+
**.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
|
9465
|
+
|
9466
|
+
If you wish to report the secondary structures from a given .pdb file
|
9467
|
+
then you can use the following class:
|
9468
|
+
|
9469
|
+
require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
|
9470
|
+
|
9471
|
+
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
|
9472
|
+
Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
|
9473
|
+
|
9474
|
+
If you wish to obtain the FASTA sequence of a particular remote
|
9475
|
+
.pdb file then you can use this API:
|
9476
|
+
|
9477
|
+
x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
|
9478
|
+
|
9479
|
+
Keep in mind that this is the FASTA sequence; the .pdb file itself
|
9480
|
+
has another format, and contains a lot more information, such as
|
9481
|
+
the various ATOM entries.
|
9482
|
+
|
9483
|
+
Since as of **June 2020** the command **fetch** also works from
|
9484
|
+
within the Bioshell, similar to how pymol **works**. This allows
|
9485
|
+
us to quickly download a remote .pdb file.
|
9486
|
+
|
9487
|
+
fetch 2BTS
|
9488
|
+
|
9489
|
+
You can also use the following toplevel-API to download a remote
|
9490
|
+
.pdb file:
|
9491
|
+
|
9492
|
+
Bioroebe.download_this_pdb
|
9493
|
+
Bioroebe.download_this_pdb '355D'
|
9494
|
+
Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
|
9495
|
+
Bioroebe.download_this_pdb '1fat.pdb' # Lectin Phytohemagglutinin
|
9496
|
+
|
9497
|
+
This will refer to a remote URL such as
|
9498
|
+
https://files.rcsb.org/view/1FAT.pdb.
|
9499
|
+
|
9500
|
+
Note that this will be automatically moved to the "correct" default
|
9501
|
+
position in the bioroebe-project, under the **pdb/** subdirectory.
|
9502
|
+
|
9503
|
+
You can also invoke this script from the commandline via
|
9504
|
+
**bin/download_this_pdb**, like in this way:
|
9505
|
+
|
9506
|
+
download_this_pdb 355D
|
9507
|
+
|
9508
|
+
This works with several .pdb files in one go as well:
|
9509
|
+
|
9510
|
+
download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
|
9511
|
+
|
9512
|
+
They would all be downloaded one after the other. Be aware that
|
9513
|
+
this will overwrite the old .pdb files on that position, so
|
9514
|
+
if you don't want this, I recommend to do a backup on the
|
9515
|
+
**pdb/** subdirectory before invoking the above call.
|
9516
|
+
|
9517
|
+
You can also turn the FASTA sequence stored in a .pdb file into
|
9518
|
+
a .fasta file, via **--create-fasta-file**.
|
9519
|
+
|
9520
|
+
Usage examples:
|
9521
|
+
|
9522
|
+
parsedb 1NR6 --create-fasta-file
|
9523
|
+
parsedb 2F9Q --create-fasta-file
|
9524
|
+
parsedb 3TDA --create-fasta-file
|
9525
|
+
parsedb 2HI4 --create-fasta-file
|
9526
|
+
parsedb 2V0M --create-fasta-file
|
9527
|
+
|
9528
|
+
So if you have a file called <b>1NR6.pdb</b> and you use
|
9529
|
+
the first input, a .fasta file will be created. If such
|
9530
|
+
a .pdb file does not exist then this will not work, so
|
9531
|
+
make sure to download the .pdb file before invoking
|
9532
|
+
this commandline-flag.
|
9533
|
+
|
9534
|
+
Last but not least, the following table shall document the
|
9535
|
+
PDB format - it is not yet complete, but it is intended
|
9536
|
+
to add the remaining datasets eventually:
|
9537
|
+
|
9538
|
+
Record Name Describes
|
9539
|
+
MODRES Modifications to standard residues
|
9540
|
+
HET Nonstandard residues (as well as ligands, ions and water)
|
9541
|
+
HETNAM Full chemical name of the residue
|
9542
|
+
HETSYM Synonyms for the residue
|
9543
|
+
FORMUL Chemical formula of the residue
|
9544
|
+
KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
|
9545
|
+
|
9546
|
+
|
9547
|
+
## Determining how many stop codons existing in a given sequence
|
9548
|
+
|
9549
|
+
You can use **bin/n_stop_codons_in_this_sequence** to determine
|
9550
|
+
how many stop codons exist in a given sequence at hand.
|
9551
|
+
|
9552
|
+
Usage example from the commandline:
|
9553
|
+
|
9554
|
+
n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
|
9555
|
+
|
9556
|
+
You can also separate these via a ' ' spacer on the commandline of
|
9557
|
+
course:
|
9558
|
+
|
9559
|
+
n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
|
9560
|
+
|
9561
|
+
Internally this makes use of the method called
|
9562
|
+
<b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
|
9563
|
+
aliased names. Usage example for the method, just as in the
|
9564
|
+
first example shown above:
|
9565
|
+
|
9566
|
+
Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
|
9567
|
+
|
9568
|
+
## The Aliphatic Index of Globular Proteins
|
9569
|
+
|
9570
|
+
In a paper from 1980, Atsushi IKAI provided a formula with which one can
|
9571
|
+
calculate the aliphatic index of a globular protein, in a short paper
|
9572
|
+
titled "Thermostability and aliphatic index of globular proteins"
|
9573
|
+
(<b>PMID: 7462208</b>,
|
9574
|
+
<a href="https://www.jstage.jst.go.jp/article/biochemistry1922/88/6/88_6_1895/_article">
|
9575
|
+
see here</a>).
|
9576
|
+
|
9577
|
+
Atsushi provided a statistical analysis of proteins, and determined
|
9578
|
+
that the aliphatic index - which is defined as the relative volume
|
9579
|
+
of a protein occupied by <b>aliphatic side chains</b> (alanine, valine,
|
9580
|
+
isoleucine, and leucine) - of proteins of thermophilic bacteria
|
9581
|
+
is significantly higher than that of ordinary proteins.
|
9582
|
+
|
9583
|
+
Atsushi reasoned that the index may be regarded as a positive
|
9584
|
+
factor for the <b>increase of thermostability of globular
|
9585
|
+
proteins</b>. The enzymes of some organisms are more stable
|
9586
|
+
at higher temperature than the enzymes of other organisms,
|
9587
|
+
in particular among <b>thermostable proteins</b>.
|
9588
|
+
|
9589
|
+
Thus, there is a good correlation between the "aliphatic
|
9590
|
+
index" on the one hand, and the thermostability of proteins
|
9591
|
+
on the other hand.
|
9592
|
+
|
9593
|
+
Atsushi gave the following formula for calculating this:
|
9594
|
+
|
9595
|
+
Aliphatic Index = XA + aXV + b (xI+XL)
|
9596
|
+
|
9597
|
+
The four letters A, V, I and L refer to the four aminoacids
|
9598
|
+
Alanine, Valine, Isoleucine and Leucine. The two coefficients
|
9599
|
+
a and b are the relative volumes of the side chains of
|
9600
|
+
Alanine. A has a value range of 2.8-3.0 and
|
9601
|
+
b has a value range of 3.8-4.0.
|
9602
|
+
|
9603
|
+
The method called <b>.aliphatic_index()</b> is making use of that
|
9604
|
+
formula. As values for a and b the two values <b>2.9</b> and
|
9605
|
+
<b>3.9</b> have been taken. The code in the bioroebe project
|
9606
|
+
for this has been inspired by: https://github.com/wwood/bioruby-aliphatic_index
|
9607
|
+
|
9608
|
+
It yields the following usage example for bioruby:
|
9609
|
+
|
9610
|
+
Bio::Sequence::AA.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
|
9611
|
+
|
9612
|
+
In bioroebe, the equivalent would be:
|
9613
|
+
|
9614
|
+
Bioroebe::Protein.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
|
9183
9615
|
|
9184
9616
|
## Possibly useful links in regards to molecular biology and science in general
|
9185
9617
|
|