bioroebe 0.10.80 → 0.11.12
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of bioroebe might be problematic. Click here for more details.
- checksums.yaml +4 -4
- data/README.md +507 -310
- data/bioroebe.gemspec +3 -3
- data/doc/README.gen +506 -309
- data/doc/todo/bioroebe_todo.md +29 -40
- data/lib/bioroebe/aminoacids/display_aminoacid_table.rb +1 -0
- data/lib/bioroebe/base/colours_for_base/colours_for_base.rb +18 -8
- data/lib/bioroebe/base/commandline_application/commandline_arguments.rb +13 -11
- data/lib/bioroebe/base/commandline_application/misc.rb +18 -8
- data/lib/bioroebe/base/prototype/misc.rb +1 -1
- data/lib/bioroebe/codons/show_codon_tables.rb +6 -2
- data/lib/bioroebe/constants/aminoacids_and_proteins.rb +1 -0
- data/lib/bioroebe/constants/files_and_directories.rb +8 -1
- data/lib/bioroebe/count/count_amount_of_nucleotides.rb +3 -0
- data/lib/bioroebe/gui/gtk3/protein_to_DNA/protein_to_DNA.rb +18 -18
- data/lib/bioroebe/gui/shared_code/protein_to_DNA/protein_to_DNA_module.rb +14 -14
- data/lib/bioroebe/parsers/genbank_parser.rb +353 -24
- data/lib/bioroebe/python/README.md +1 -0
- data/lib/bioroebe/python/__pycache__/mymodule.cpython-39.pyc +0 -0
- data/lib/bioroebe/python/gui/gtk3/widget1.py +22 -0
- data/lib/bioroebe/python/mymodule.py +8 -0
- data/lib/bioroebe/python/protein_to_dna.py +30 -0
- data/lib/bioroebe/python/shell/shell.py +19 -0
- data/lib/bioroebe/python/to_rna.py +14 -0
- data/lib/bioroebe/python/toplevel_methods/to_camelcase.py +11 -0
- data/lib/bioroebe/sequence/nucleotide_module/nucleotide_module.rb +28 -25
- data/lib/bioroebe/sequence/sequence.rb +54 -2
- data/lib/bioroebe/shell/menu.rb +3336 -3304
- data/lib/bioroebe/shell/readline/readline.rb +1 -1
- data/lib/bioroebe/shell/shell.rb +11233 -28
- data/lib/bioroebe/siRNA/siRNA.rb +81 -1
- data/lib/bioroebe/string_matching/find_longest_substring.rb +3 -2
- data/lib/bioroebe/toplevel_methods/aminoacids_and_proteins.rb +31 -24
- data/lib/bioroebe/toplevel_methods/nucleotides.rb +22 -5
- data/lib/bioroebe/toplevel_methods/open_in_browser.rb +2 -0
- data/lib/bioroebe/toplevel_methods/to_camelcase.rb +5 -0
- data/lib/bioroebe/version/version.rb +2 -2
- data/lib/bioroebe/yaml/configuration/browser.yml +1 -1
- data/lib/bioroebe/yaml/restriction_enzymes/restriction_enzymes.yml +3 -3
- metadata +17 -36
- data/doc/setup.rb +0 -1655
- data/lib/bioroebe/genbank/genbank_parser.rb +0 -291
- data/lib/bioroebe/shell/add.rb +0 -108
- data/lib/bioroebe/shell/assign.rb +0 -360
- data/lib/bioroebe/shell/chop_and_cut.rb +0 -281
- data/lib/bioroebe/shell/constants.rb +0 -166
- data/lib/bioroebe/shell/download.rb +0 -335
- data/lib/bioroebe/shell/enable_and_disable.rb +0 -158
- data/lib/bioroebe/shell/enzymes.rb +0 -310
- data/lib/bioroebe/shell/fasta.rb +0 -345
- data/lib/bioroebe/shell/gtk.rb +0 -76
- data/lib/bioroebe/shell/history.rb +0 -132
- data/lib/bioroebe/shell/initialize.rb +0 -217
- data/lib/bioroebe/shell/loop.rb +0 -74
- data/lib/bioroebe/shell/misc.rb +0 -4341
- data/lib/bioroebe/shell/prompt.rb +0 -107
- data/lib/bioroebe/shell/random.rb +0 -289
- data/lib/bioroebe/shell/reset.rb +0 -335
- data/lib/bioroebe/shell/scan_and_parse.rb +0 -135
- data/lib/bioroebe/shell/search.rb +0 -337
- data/lib/bioroebe/shell/sequences.rb +0 -200
- data/lib/bioroebe/shell/show_report_and_display.rb +0 -2901
- data/lib/bioroebe/shell/startup.rb +0 -127
- data/lib/bioroebe/shell/taxonomy.rb +0 -14
- data/lib/bioroebe/shell/tk.rb +0 -23
- data/lib/bioroebe/shell/user_input.rb +0 -88
- data/lib/bioroebe/shell/xorg.rb +0 -45
data/README.md
CHANGED
@@ -2,13 +2,13 @@
|
|
2
2
|
[![forthebadge](https://forthebadge.com/images/badges/made-with-ruby.svg)](https://www.ruby-lang.org/en/)
|
3
3
|
[![Gem Version](https://badge.fury.io/rb/bioroebe.svg)](https://badge.fury.io/rb/bioroebe)
|
4
4
|
|
5
|
-
This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">
|
5
|
+
This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">05.07.2022</span> (dd.mm.yyyy notation), at <span style="color: steelblue; font-weight: bold">16:47:23</span> o'clock.
|
6
6
|
|
7
7
|
# The Bioroebe Project
|
8
8
|
|
9
9
|
## Bioroebe
|
10
10
|
|
11
|
-
<img src="
|
11
|
+
<img src="https://i.imgur.com/mAoP7AP.png">
|
12
12
|
<img src="https://i.imgur.com/YqYxRBZ.png" style="margin: 4px; margin-left: 12px;"/>
|
13
13
|
<img src="https://i.imgur.com/k7mMlg2.png" style="margin: 4px; margin-left: 12px;"/>
|
14
14
|
|
@@ -335,41 +335,6 @@ so I opted to go the yaml route. But if people want to use a hash
|
|
335
335
|
instead, they can do so, too - see the <b>API</b> for codon tables
|
336
336
|
lateron. Simply define your own constants and pass them to the
|
337
337
|
appropriate methods.
|
338
|
-
|
339
|
-
## Support for other programming languages
|
340
|
-
|
341
|
-
The main programming language for the bioroebe project is **ruby**.
|
342
|
-
Ruby, from a language design point of view, is a great programming
|
343
|
-
language - not necessarily all of ruby, but the subset that I use.
|
344
|
-
It is very easy to quickly prototype ideas via ruby.
|
345
|
-
|
346
|
-
However had, ruby is known to **not** be among the fastest programming
|
347
|
-
languages about on this planet; so, it makes sense to use other
|
348
|
-
languages too from this point of view. Additionally there are some
|
349
|
-
software stacks in use in **other** programming languages, such as
|
350
|
-
matplotlib and various more.
|
351
|
-
|
352
|
-
Thus, it is important to **support other programming languages** as
|
353
|
-
well, if there are useful libraries. The bioroebe project, after
|
354
|
-
all, tries to be **practical**: it focuses on getting things done,
|
355
|
-
no matter the language.
|
356
|
-
|
357
|
-
This means that support for other programming languages can be
|
358
|
-
found in this project as well, often using system() or similar
|
359
|
-
functionality to tap into these other programming languages. Do
|
360
|
-
not be surprised when that happens - the bioroebe project will
|
361
|
-
also try to act as a **practical glue** towards functionality
|
362
|
-
enabled via other projects. We want to get things done, no
|
363
|
-
matter the programming language at hand!
|
364
|
-
|
365
|
-
Whenever possible, though, the bioroebe project will try to be
|
366
|
-
flexible in this regard, so ideally the same solution should
|
367
|
-
work for many different programming languages.
|
368
|
-
|
369
|
-
While Ruby is the primary language for this project, since as
|
370
|
-
of 2021 I will try to officially support **java**, **jruby**
|
371
|
-
and the **GraalVM**. This is on my TODO list, though - stay
|
372
|
-
tuned for more updates in this regard.
|
373
338
|
|
374
339
|
## Readline support in the BioRoebe project
|
375
340
|
|
@@ -553,16 +518,16 @@ the DNA-to-Protein translation is somewhat simply kept as a
|
|
553
518
|
Once you are inside a **running Bioshell**, you can do other **commands**
|
554
519
|
such as this one here:
|
555
520
|
|
556
|
-
random # ← This will generate a random DNA sequence.
|
521
|
+
random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.
|
557
522
|
|
558
523
|
To **assign** a DNA sequence, do:
|
559
524
|
|
560
525
|
assign ATAGGGCTTTT
|
561
526
|
|
562
|
-
Note that since the year 2016
|
563
|
-
the one above, without any other commands/words, then we will assume
|
527
|
+
Note that since as of the year <b>2016</b>, if you input a nucleotide sequence
|
528
|
+
like the one above, without any other commands/words, then we will assume
|
564
529
|
that you did mean to do an assignment as-is anyway. The "assign" part
|
565
|
-
then becomes superfluous.
|
530
|
+
then becomes superfluous and can be omitted.
|
566
531
|
|
567
532
|
This is how this is simply done, by omitting the "assign" part of the
|
568
533
|
above instruction altogether:
|
@@ -1073,18 +1038,18 @@ The text **banana** thus has the following suffixes:
|
|
1073
1038
|
|
1074
1039
|
This subsection deals with some aspects of **HMMs**.
|
1075
1040
|
|
1076
|
-
Why are HMMs useful in biology? They can be used to represent protein
|
1077
|
-
families
|
1041
|
+
Why are HMMs useful in biology? They can be used to <b>represent protein
|
1042
|
+
families</b>, for example (via <b>pHMMs</b> - profile hidden markov models).
|
1078
1043
|
|
1079
1044
|
Furthermore, they can show some bias in the mutation rate that can be
|
1080
1045
|
observed. Different genomes are known to have different hotspots where
|
1081
|
-
mutations are more likely to happen. These are
|
1082
|
-
may be useful.
|
1046
|
+
mutations are more likely to happen, for various reasons. These are
|
1047
|
+
examples where a HMM may be useful.
|
1083
1048
|
|
1084
|
-
HMMs are usually based on the Shannon model where you assign different
|
1049
|
+
HMMs are usually based on the <b>Shannon model</b> where you assign different
|
1085
1050
|
probabilities to "change" events. An example that was mentioned back
|
1086
|
-
in 1948 was the english alphabet - some letters, and combinations
|
1087
|
-
letters, are more commonly seen. Shannon gave the example of "E"
|
1051
|
+
in <b>1948</b> was the english alphabet - some letters, and combinations
|
1052
|
+
of letters, are more commonly seen. Shannon gave the example of "E"
|
1088
1053
|
versus "W", as shown in the following graph (a **finite state
|
1089
1054
|
graph**):
|
1090
1055
|
|
@@ -1098,40 +1063,47 @@ DNA sequence, a 10-mer would be equivalent to **10 base pairs**.
|
|
1098
1063
|
The individual transition states are based on an assumption of
|
1099
1064
|
"randomness", but ensuring that these are truly random is not
|
1100
1065
|
necessarily trivial. Computers do not really 'generate' true
|
1101
|
-
randomness, at the least not when they are working solo
|
1102
|
-
can even 'predict' some randomness here or there
|
1103
|
-
|
1104
|
-
|
1105
|
-
|
1106
|
-
|
1107
|
-
of
|
1108
|
-
|
1109
|
-
given position, but this is not
|
1110
|
-
|
1111
|
-
|
1112
|
-
|
1113
|
-
|
1114
|
-
|
1115
|
-
|
1066
|
+
randomness, at the least not when they are working solo, "on
|
1067
|
+
their own". You can even 'predict' some randomness here or there
|
1068
|
+
via various techniques - see vulnerabilities such as <b>Specter</b>
|
1069
|
+
or similar variants where software can read from areas of the
|
1070
|
+
memory that should be inaccessible to them. Some of this is based
|
1071
|
+
on co-predictions. For distributed computers, you may often use
|
1072
|
+
random noise or decay of atoms as 'a source of randomness'. For
|
1073
|
+
any DNA nucleotide sequence, we would assume that each base pair
|
1074
|
+
has a 25% chance to exist at any given position, but this is not
|
1075
|
+
necessarily true, again for various reasons.
|
1076
|
+
|
1077
|
+
An interesting thought is ... why is <b>ATP</b> so important?
|
1078
|
+
Yes, of course due to it being 'the energy currency in a cell' but ..
|
1079
|
+
why is this ATP, aka adenine? Why not GTP, aka guanine or any of
|
1080
|
+
the other two nucleotides? (GTP is used too, but why? Why not
|
1081
|
+
CTP and TTP?) I can not answer this question; there may
|
1082
|
+
be many reasons, including differential chemical storage power
|
1083
|
+
as well as mere random chance event in evolution, but for whatever
|
1116
1084
|
the reason, you will not find a complete 25% percentage value
|
1117
1085
|
for every given "slot" in DNA, depending on the organism.
|
1118
1086
|
|
1119
1087
|
From a practical point of view, how can we approach Hidden Markov
|
1120
|
-
Models?
|
1088
|
+
Models and use them?
|
1121
1089
|
|
1122
|
-
Let's take the following sequence:
|
1090
|
+
Let's take the following simple sequence:
|
1123
1091
|
|
1124
1092
|
ACGTACGC
|
1125
1093
|
|
1126
1094
|
From this sequence we can see that the <b>3-mer</b> "ACG"
|
1127
1095
|
is followed by either a T, or a C. Have a look at the sequence
|
1128
|
-
to see if you can identify the two ACG subsequences
|
1096
|
+
again to see if you can identify the two ACG subsequences
|
1097
|
+
there. You can see one at the start, and the other one
|
1098
|
+
following a bit later, hence why we come to the conclusion
|
1099
|
+
that either a T or a C will follow this <b>3-mer</b>.
|
1129
1100
|
|
1130
|
-
The probability of either T or C
|
1131
|
-
for A and G to follow there
|
1132
|
-
be ignored.
|
1101
|
+
The probability of either T or C to occur on <b>that</b>
|
1102
|
+
position, thus, is 0.5 (50%); for A and G to follow there
|
1103
|
+
is 0% so the latter two can be ignored.
|
1133
1104
|
|
1134
|
-
Thus, we could use a ruby Hash as follows
|
1105
|
+
Thus, we could use a ruby Hash as follows that should
|
1106
|
+
describe these probabilities:
|
1135
1107
|
|
1136
1108
|
probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well
|
1137
1109
|
|
@@ -1217,34 +1189,6 @@ each edge.
|
|
1217
1189
|
Parsimony assumes that substitutions are rare and that back-mutations
|
1218
1190
|
do not occur.
|
1219
1191
|
|
1220
|
-
## Random stuff
|
1221
|
-
|
1222
|
-
You can generate random DNA sequences in the shell:
|
1223
|
-
|
1224
|
-
random dna 20
|
1225
|
-
random dna 25
|
1226
|
-
random dna 30
|
1227
|
-
|
1228
|
-
This will generate random DNA sequences, with a length
|
1229
|
-
of 20, 25, 30, respectively. This may not be very useful
|
1230
|
-
but it was important that this functionality is made
|
1231
|
-
available somewhere.
|
1232
|
-
|
1233
|
-
You can also use some toplevel-methods to generate, e. g.
|
1234
|
-
20 random aminoacids:
|
1235
|
-
|
1236
|
-
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
1237
|
-
|
1238
|
-
Note that there may exist other APIs within the Bioroebe project
|
1239
|
-
that do the same as well.
|
1240
|
-
|
1241
|
-
If you would like to use a ruby-gtk3 widget have a look
|
1242
|
-
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
1243
|
-
It works with aminoacids, DNA and RNA, and allows the user to
|
1244
|
-
create random sequences. (If you need weighted randomness then
|
1245
|
-
you currently have to use the commandline variant. Perhaps I may
|
1246
|
-
add support into the GUI directly for this one day.)
|
1247
|
-
|
1248
1192
|
## Displaying the main sequence with delimiter characters
|
1249
1193
|
|
1250
1194
|
From within the <b>bioshell</b>, you can use some alternative ways to
|
@@ -2714,18 +2658,6 @@ This may look as follows:
|
|
2714
2658
|
|
2715
2659
|
<img src="https://i.imgur.com/gAZg8qG.png" style="margin: 1em; margin-left: 3em">
|
2716
2660
|
|
2717
|
-
## Obtaining a subsequence from a Bioroebe::Sequence object
|
2718
|
-
|
2719
|
-
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
2720
|
-
|
2721
|
-
There are several ways how to obtain a subsequence from
|
2722
|
-
this. One variant will be shown next, by making use of
|
2723
|
-
the method called **.subseq()**.
|
2724
|
-
|
2725
|
-
Example:
|
2726
|
-
|
2727
|
-
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
2728
|
-
|
2729
2661
|
## Bioroebe::Protein
|
2730
2662
|
|
2731
2663
|
This class is a subclass of class **Bioroebe::Sequence**. The
|
@@ -2740,16 +2672,6 @@ functionality is also available in another method.
|
|
2740
2672
|
For now keep this in mind; at some later point I may decide whether
|
2741
2673
|
this class is to be kept or not.
|
2742
2674
|
|
2743
|
-
## Permanently disabling showing the startup-introduction of the Bioshell
|
2744
|
-
|
2745
|
-
If you do not want to see the start-up intro, you can try
|
2746
|
-
any of the following:
|
2747
|
-
|
2748
|
-
bioshell --permanently-disable-startup-intro
|
2749
|
-
bioshell --permanently-disable-startup-notice
|
2750
|
-
bioshell --permanently-no-startup-intro
|
2751
|
-
bioshell --permanently-no-startup-info
|
2752
|
-
|
2753
2675
|
## Decoding aminoacids
|
2754
2676
|
|
2755
2677
|
Decoding aminoacids means to take the aminoacid at hand, ideally
|
@@ -3176,47 +3098,45 @@ can try to use:
|
|
3176
3098
|
On class Bioroebe::Sequence. More customizability may be added
|
3177
3099
|
to that method in this regard, if users need this.
|
3178
3100
|
|
3179
|
-
|
3101
|
+
### Obtaining a subsequence from a Bioroebe::Sequence object
|
3180
3102
|
|
3181
|
-
|
3182
|
-
the **bioshell**.
|
3103
|
+
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
3183
3104
|
|
3184
|
-
|
3105
|
+
There are several ways how to obtain a subsequence from
|
3106
|
+
this. One variant will be shown next, by making use of
|
3107
|
+
the method called **.subseq()**.
|
3185
3108
|
|
3186
|
-
|
3109
|
+
Example:
|
3187
3110
|
|
3188
|
-
|
3111
|
+
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
3189
3112
|
|
3190
|
-
You can
|
3191
|
-
code:
|
3113
|
+
You can also randomize the sequence, via .randomize().
|
3192
3114
|
|
3193
|
-
|
3115
|
+
Example:
|
3194
3116
|
|
3195
|
-
|
3196
|
-
returned representing that nucleotide sequence.
|
3117
|
+
x = Bioroebe::Sequence.new; x.randomize
|
3197
3118
|
|
3198
|
-
|
3199
|
-
should be generated.
|
3119
|
+
This is similar to the method in Bioruby here:
|
3200
3120
|
|
3201
|
-
|
3121
|
+
https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243
|
3202
3122
|
|
3203
|
-
|
3204
|
-
such as by issuing the following command:
|
3123
|
+
## The Hydropathy index
|
3205
3124
|
|
3206
|
-
|
3125
|
+
You can display the hydropathy index for aminoacids from within
|
3126
|
+
the **bioshell**.
|
3207
3127
|
|
3208
|
-
|
3128
|
+
Simply issue:
|
3209
3129
|
|
3210
|
-
|
3130
|
+
hydropathy?
|
3211
3131
|
|
3212
|
-
|
3132
|
+
## The GFF file format
|
3213
3133
|
|
3214
|
-
|
3134
|
+
From within the **bioshell** you can analyze .gff and .gff3 files,
|
3135
|
+
such as by issuing the following command:
|
3215
3136
|
|
3216
|
-
|
3137
|
+
gff3? foobar.gff3
|
3217
3138
|
|
3218
|
-
|
3219
|
-
compositions of the same nucleotide.
|
3139
|
+
Evidently for this to work the file at hand has to exist.
|
3220
3140
|
|
3221
3141
|
## The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)
|
3222
3142
|
|
@@ -3353,47 +3273,6 @@ nucleotides by issuing:
|
|
3353
3273
|
|
3354
3274
|
show_individual_weight_of_the_four_dna_nucleotides
|
3355
3275
|
|
3356
|
-
## Truncating output in the bioroebe-shell
|
3357
|
-
![alt text][cat1]
|
3358
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3359
|
-
|
3360
|
-
**DNA/RNA sequences** can become very long and then become
|
3361
|
-
quite difficult to view, read and handle on the commandline.
|
3362
|
-
|
3363
|
-
Normally the bioroebe shell will truncate output of DNA sequences
|
3364
|
-
that are "too long". This is mostly done so that working with
|
3365
|
-
very long sequences becomes a bit more convenient.
|
3366
|
-
|
3367
|
-
Sometimes this can become an antifeature, though, so the user
|
3368
|
-
must be able to toggle this at his or her own discretion.
|
3369
|
-
|
3370
|
-
By default, the bioroebe-shell (bioshell) will always try
|
3371
|
-
to truncate output, but you can toggle this behaviour by
|
3372
|
-
issuing:
|
3373
|
-
|
3374
|
-
do not truncate
|
3375
|
-
|
3376
|
-
In theory, other "do not" actions are also supported, or will
|
3377
|
-
be supported in the future; right now (Oct 2019) this is a bit
|
3378
|
-
limited.
|
3379
|
-
|
3380
|
-
From the toplevel, you can use this method:
|
3381
|
-
|
3382
|
-
Bioroebe.do_not_truncate
|
3383
|
-
|
3384
|
-
The above instruction will toggle the truncate behaviour
|
3385
|
-
to not truncate, ever.
|
3386
|
-
|
3387
|
-
If you need to do so within the bioshell, this is the way:
|
3388
|
-
|
3389
|
-
no_truncate
|
3390
|
-
|
3391
|
-
Or simply
|
3392
|
-
|
3393
|
-
truncate
|
3394
|
-
|
3395
|
-
This will toggle, like a switch.
|
3396
|
-
|
3397
3276
|
## Rosalind Challenges
|
3398
3277
|
![alt text][cat1]
|
3399
3278
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3530,31 +3409,6 @@ investing more time into Rosalind. Let's focus on solving
|
|
3530
3409
|
real, existing problems instead - at the least as far as
|
3531
3410
|
the Bioroebe project is concerned.
|
3532
3411
|
|
3533
|
-
## Numbers as input in the bioshell
|
3534
|
-
![alt text][cat1]
|
3535
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3536
|
-
|
3537
|
-
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
3538
|
-
|
3539
|
-
This will attempt to <b>display the first 3 nucleotides</b> of
|
3540
|
-
the assigned **main sequence**. It will only work if you have
|
3541
|
-
assigned a sequence prior to that, though.
|
3542
|
-
|
3543
|
-
Examples:
|
3544
|
-
|
3545
|
-
3
|
3546
|
-
33
|
3547
|
-
15
|
3548
|
-
|
3549
|
-
## transeq
|
3550
|
-
![alt text][cat1]
|
3551
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3552
|
-
|
3553
|
-
You can convert a DNA sequence into an aminoacid sequence by
|
3554
|
-
doing this:
|
3555
|
-
|
3556
|
-
transeq
|
3557
|
-
|
3558
3412
|
## Align two different sequences
|
3559
3413
|
![alt text][cat1]
|
3560
3414
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3866,22 +3720,6 @@ does not (yet?) have support for comparing two genomes to
|
|
3866
3720
|
one another and generate a visual map indicating the findings
|
3867
3721
|
there.
|
3868
3722
|
|
3869
|
-
## Do not create directories on startup of the shell
|
3870
|
-
|
3871
|
-
By default the bioshell will try to create some directories
|
3872
|
-
on startup. This may not always be desired by the user
|
3873
|
-
though, so an option has to exist to disable this functionality.
|
3874
|
-
|
3875
|
-
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
3876
|
-
keeps track of whether directories on startup of the shell will
|
3877
|
-
be created.
|
3878
|
-
|
3879
|
-
To disable this behaviour on startup of the bioshell, try
|
3880
|
-
something like this:
|
3881
|
-
|
3882
|
-
bioshell --do-not-create-directories-on-startup
|
3883
|
-
bioshell --do-not-create-directories
|
3884
|
-
|
3885
3723
|
## class Bioroebe::MoveFileToItsCorrectLocation
|
3886
3724
|
|
3887
3725
|
This class will move a bio-file to its "correct" location, with respect
|
@@ -4050,39 +3888,6 @@ has". Genes in itself are not that well-defined, so they are not necessarily
|
|
4050
3888
|
the primary means of complexity. Think of this more as an interactome,
|
4051
3889
|
where RNAs play a major dynamic role as well.
|
4052
3890
|
|
4053
|
-
## Bioroebe::ProfilePattern
|
4054
|
-
|
4055
|
-
This class can be used to generate nucleotide sequences that
|
4056
|
-
are not quite "random". For example, to generate sequences
|
4057
|
-
that may "simulate" a TATA box.
|
4058
|
-
|
4059
|
-
The idea for this class is to be extended into allowing
|
4060
|
-
HMMs (Hidden Markov Models) one day.
|
4061
|
-
|
4062
|
-
Usage example:
|
4063
|
-
|
4064
|
-
_ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
|
4065
|
-
_.generate_sequence_based_on_this_profile
|
4066
|
-
|
4067
|
-
Such a profile will encode the profile specifying the preferred sequence
|
4068
|
-
letters for each position in a section of DNA. You have to provide
|
4069
|
-
the Hash into the method generate_sequence_based_on_this_profile() -
|
4070
|
-
or you use the default Hash, which is stored in the constant
|
4071
|
-
called **PER_POSITION_HASH**.
|
4072
|
-
|
4073
|
-
That profile should be a Hash, with keys pointing to A, T, C, G
|
4074
|
-
and the values being an Array of likelihood chance there,
|
4075
|
-
as a number, such as 140. These values are also called
|
4076
|
-
**scores**. Each score contains a number for each position
|
4077
|
-
that indicates how likely it is to find the given
|
4078
|
-
nucleotide at that location.
|
4079
|
-
|
4080
|
-
You can also use this class to generate a random DNA string,
|
4081
|
-
similar to the method called
|
4082
|
-
**Bioroebe.generate_random_dna_sequence()**. The difference
|
4083
|
-
is that class ProfilePattern allows for a bit more fine-tuned
|
4084
|
-
control. The class will likely be extended in the future too.
|
4085
|
-
|
4086
3891
|
## class Bioroebe::DisplayOpenReadingFrames
|
4087
3892
|
|
4088
3893
|
**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
|
@@ -4462,28 +4267,6 @@ the BioRoebe-Shell, then you can use either of the following:
|
|
4462
4267
|
|
4463
4268
|
seq?
|
4464
4269
|
seq_with_tab?
|
4465
|
-
|
4466
|
-
## Prompt (the shell prompt9
|
4467
|
-
|
4468
|
-
You can set a <b>custom prompt</b>, via the keywords
|
4469
|
-
"prompt" or "set_prompt".
|
4470
|
-
|
4471
|
-
To display the <b>current working directory</b>, do:
|
4472
|
-
|
4473
|
-
prompt pwd
|
4474
|
-
|
4475
|
-
To revert to the old default again, do this:
|
4476
|
-
|
4477
|
-
prompt REVERT
|
4478
|
-
prompt revert
|
4479
|
-
prompt DEFAULT
|
4480
|
-
prompt default
|
4481
|
-
|
4482
|
-
If you do not want to set any prompt, do:
|
4483
|
-
|
4484
|
-
prompt none
|
4485
|
-
|
4486
|
-
|
4487
4270
|
|
4488
4271
|
## Leader and Trailer
|
4489
4272
|
|
@@ -5764,6 +5547,9 @@ like this:
|
|
5764
5547
|
|
5765
5548
|
<img src="https://i.imgur.com/vr2kEBz.png" style="margin: 1em; margin-left: 3em">
|
5766
5549
|
|
5550
|
+
Since as of <b>July 2022</b> invalid amino acids will be automatically
|
5551
|
+
filtered away before being assigned to the input.
|
5552
|
+
|
5767
5553
|
## Colourizing hydrophilic and hydrophobic aminoacids on the commandline
|
5768
5554
|
|
5769
5555
|
Via class **Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids** you
|
@@ -5777,35 +5563,36 @@ Example output for this:
|
|
5777
5563
|
|
5778
5564
|
This subsection contains some information about proteases.
|
5779
5565
|
|
5780
|
-
|
5566
|
+
Trypsin:
|
5781
5567
|
https://en.wikipedia.org/wiki/Trypsin
|
5782
|
-
cuts at
|
5568
|
+
<b>cuts at</b>: Trypsin cuts peptide chains mainly at the carboxyl
|
5783
5569
|
side of the amino acids lysine or arginine.
|
5784
5570
|
|
5785
|
-
|
5571
|
+
Chymotrypsin:
|
5786
5572
|
https://en.wikipedia.org/wiki/Chymotrypsin
|
5787
|
-
cuts at
|
5573
|
+
<b>cuts at</b>: Chymotrypsin preferentially cleaves peptide amide
|
5788
5574
|
bonds where the side chain of the amino acid N-terminal
|
5789
|
-
to the scissile amide bond is a large hydrophobic amino
|
5790
|
-
acid (tyrosine, tryptophan, and phenylalanine).
|
5575
|
+
to the scissile amide bond is <b>a large hydrophobic amino</b>
|
5576
|
+
acid (specifically: tyrosine, tryptophan, and phenylalanine).
|
5577
|
+
Chymotrypsin will cleave proteins on the <b>carboxyl side</b>
|
5578
|
+
of aromatic or large hydrophobic amino acids.
|
5791
5579
|
|
5792
|
-
|
5580
|
+
Thrombin:
|
5793
5581
|
https://en.wikipedia.org/wiki/Thrombin
|
5794
|
-
cuts at
|
5582
|
+
<b>cuts at</b>: Thrombin acts as a serine protease that converts
|
5795
5583
|
soluble fibrinogen into insoluble strands of fibrin. It
|
5796
5584
|
catalyzes the hydrolysis of <b>Arg-Gly</b> bonds in
|
5797
5585
|
particular peptide sequences only.
|
5798
5586
|
|
5799
|
-
|
5587
|
+
Plasmin:
|
5800
5588
|
https://en.wikipedia.org/wiki/Plasmin
|
5801
|
-
cuts at
|
5589
|
+
<b>cuts at</b>: Plasmin is a serine protease.
|
5802
5590
|
|
5803
|
-
|
5591
|
+
Papain:
|
5804
5592
|
https://en.wikipedia.org/wiki/Papain
|
5805
|
-
cuts at
|
5806
|
-
|
5807
|
-
|
5808
|
-
not followed by a valine.
|
5593
|
+
<b>cuts at</b>: Papain prefers to cleave after an arginine or
|
5594
|
+
lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile,
|
5595
|
+
Phe, Trp, Tyr) and not followed by a valine.
|
5809
5596
|
|
5810
5597
|
factor Xa:
|
5811
5598
|
|
@@ -5817,8 +5604,8 @@ Some proteins may permanently reside in the lumen of the
|
|
5817
5604
|
Often such proteins will have a special signal sequence attached
|
5818
5605
|
to their **C-terminal part**, such as **KDEL** (Lys-Asp-Glu-Leu).
|
5819
5606
|
|
5820
|
-
KDEL is not the only signal that may be used, though. Some
|
5821
|
-
may use different signals, such as:
|
5607
|
+
<b>KDEL</b> is not the only signal that may be used, though. Some
|
5608
|
+
species may use different signals, such as:
|
5822
5609
|
|
5823
5610
|
aminoacids | species
|
5824
5611
|
-------------|------------------------------------------------------------
|
@@ -5828,8 +5615,9 @@ may use different signals, such as:
|
|
5828
5615
|
ADEL | Schizosaccharomyces pombe (fission yeast)
|
5829
5616
|
SDEL | Plasmodium falciparum
|
5830
5617
|
|
5831
|
-
If you work with the bioshell then you can simply use this
|
5832
|
-
to query whether the given aminoacid sequence has a KDEL
|
5618
|
+
If you work with the <b>bioshell</b> then you can simply use this
|
5619
|
+
method to query whether the given aminoacid sequence has a KDEL
|
5620
|
+
sequence:
|
5833
5621
|
|
5834
5622
|
KDEL?
|
5835
5623
|
|
@@ -7365,16 +7153,6 @@ This would notify the bioshell that only nucleotides from position
|
|
7365
7153
|
51 to (including) position 3251 will be colourized, when doing another
|
7366
7154
|
"ORF?" invocation.
|
7367
7155
|
|
7368
|
-
## Longest substring
|
7369
|
-
|
7370
|
-
Within the Bioroebe::Shell you can determine the longest substring,
|
7371
|
-
including gaps, like s:'
|
7372
|
-
|
7373
|
-
longest_substring? ATTATTGTT | ATTATTCTT'
|
7374
|
-
|
7375
|
-
Note that this will make use of the diff-lcs gem, which uses
|
7376
|
-
the McIlroy-Hunt algorithm.
|
7377
|
-
|
7378
7156
|
## Restriction Enzymes
|
7379
7157
|
|
7380
7158
|
This **subsection** will eventually be expanded to explain various things about
|
@@ -8733,6 +8511,22 @@ The images that can be generated via this may look as follows:
|
|
8733
8511
|
|
8734
8512
|
<img src="https://i.imgur.com/fWwD1fj.png" style="margin: 1em; margin-left: 2em">
|
8735
8513
|
|
8514
|
+
Let's look at another example.
|
8515
|
+
|
8516
|
+
Say you input the following sequences there:
|
8517
|
+
|
8518
|
+
AGVV
|
8519
|
+
AGVV
|
8520
|
+
AGVV
|
8521
|
+
AGVV
|
8522
|
+
AGGV
|
8523
|
+
AGGV
|
8524
|
+
AGGV
|
8525
|
+
|
8526
|
+
The resulting image that is generated is:
|
8527
|
+
|
8528
|
+
<img src="https://i.imgur.com/3wWApIQ.png" style="margin: 1em; margin-left: 2em">
|
8529
|
+
|
8736
8530
|
## The Kozak Sequence
|
8737
8531
|
|
8738
8532
|
The ribosome usually scans for a **AUG** codon. But there are
|
@@ -9183,6 +8977,409 @@ time being it is what it is. At a later point in time test cases
|
|
9183
8977
|
may be added to check whether it performs correctly or whether it
|
9184
8978
|
does not.
|
9185
8979
|
|
8980
|
+
The other rules, also published in 2004, are the Reynolds rules. Code
|
8981
|
+
support was added to the Bioroebe project in <b>June 2022</b>, but
|
8982
|
+
it was not tested yet, so the implementation may be incorrect.
|
8983
|
+
|
8984
|
+
## The Bioroebe::Shell interface
|
8985
|
+
|
8986
|
+
The following subsection specifically handles information
|
8987
|
+
pertaining to the <b>Bioroebe::Shell</b> interface of the
|
8988
|
+
<b>bioroebe project</b>. It is also called <b>bioshell</b>,
|
8989
|
+
to simplify spelling it.
|
8990
|
+
|
8991
|
+
### Numbers as input in the bioshell
|
8992
|
+
![alt text][cat1]
|
8993
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8994
|
+
|
8995
|
+
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
8996
|
+
|
8997
|
+
This will attempt to <b>display the first 3 nucleotides</b> of
|
8998
|
+
the assigned **main sequence**. It will only work if you have
|
8999
|
+
assigned a sequence prior to that, though.
|
9000
|
+
|
9001
|
+
Examples:
|
9002
|
+
|
9003
|
+
3
|
9004
|
+
33
|
9005
|
+
15
|
9006
|
+
|
9007
|
+
### transeq
|
9008
|
+
![alt text][cat1]
|
9009
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9010
|
+
|
9011
|
+
You can convert a DNA sequence into an aminoacid sequence by
|
9012
|
+
doing this:
|
9013
|
+
|
9014
|
+
transeq
|
9015
|
+
|
9016
|
+
### Shuffling the DNA/RNA string in the bioshell
|
9017
|
+
![alt text][cat1]
|
9018
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9019
|
+
|
9020
|
+
Via
|
9021
|
+
|
9022
|
+
shuffle
|
9023
|
+
|
9024
|
+
you can <b>randomly rearrange the main DNA/RNA string</b>
|
9025
|
+
that is used by the <b>Bioroebe::Shell</b>.
|
9026
|
+
|
9027
|
+
This can be useful if you just wish to quickly "test"
|
9028
|
+
new compositions of the same nucleotide.
|
9029
|
+
|
9030
|
+
### Permanently disabling showing the startup-introduction of the Bioshell
|
9031
|
+
![alt text][cat1]
|
9032
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9033
|
+
|
9034
|
+
If you do not want to see the start-up intro, you can try
|
9035
|
+
any of the following:
|
9036
|
+
|
9037
|
+
bioshell --permanently-disable-startup-intro
|
9038
|
+
bioshell --permanently-disable-startup-notice
|
9039
|
+
bioshell --permanently-no-startup-intro
|
9040
|
+
bioshell --permanently-no-startup-info
|
9041
|
+
|
9042
|
+
### Longest substring
|
9043
|
+
![alt text][cat1]
|
9044
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9045
|
+
|
9046
|
+
Within the Bioroebe::Shell you can determine the longest substring,
|
9047
|
+
including gaps, like s:'
|
9048
|
+
|
9049
|
+
longest_substring? ATTATTGTT | ATTATTCTT'
|
9050
|
+
|
9051
|
+
Note that this will make use of the diff-lcs gem, which uses
|
9052
|
+
the McIlroy-Hunt algorithm.
|
9053
|
+
|
9054
|
+
### Do not create directories on startup of the shell
|
9055
|
+
![alt text][cat1]
|
9056
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9057
|
+
|
9058
|
+
By default the <b>bioshell</b> will try to create some directories
|
9059
|
+
on startup. This may not always be desired by the user, though,
|
9060
|
+
so an option has to exist to <b>disable</b> this functionality.
|
9061
|
+
|
9062
|
+
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
9063
|
+
keeps track of whether directories on startup of the shell will
|
9064
|
+
be created.
|
9065
|
+
|
9066
|
+
To disable this behaviour on startup of the bioshell, try
|
9067
|
+
something like this:
|
9068
|
+
|
9069
|
+
bioshell --do-not-create-directories-on-startup
|
9070
|
+
bioshell --do-not-create-directories
|
9071
|
+
|
9072
|
+
### Generating and assigning a random amount of nucleotides
|
9073
|
+
![alt text][cat1]
|
9074
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9075
|
+
|
9076
|
+
Via:
|
9077
|
+
|
9078
|
+
random 555
|
9079
|
+
|
9080
|
+
you can "generate" 555 random nucleotides (DNA that is) and
|
9081
|
+
assign it to the main sequence in use by the bioshell. This
|
9082
|
+
is mostly a convenience feature, if you want to debug something
|
9083
|
+
quickly.
|
9084
|
+
|
9085
|
+
### Determining the log directory for the Bioroebe::Shell component
|
9086
|
+
![alt text][cat1]
|
9087
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9088
|
+
|
9089
|
+
Via:
|
9090
|
+
|
9091
|
+
bioshell_log_dir?
|
9092
|
+
|
9093
|
+
you can determine the log-directory output for the bioshell
|
9094
|
+
component. On my home system this will default to
|
9095
|
+
<b>/home/Temp/bioroebe/bioshell/</b>.
|
9096
|
+
|
9097
|
+
### Prompt (the shell prompt of the bioshell)
|
9098
|
+
![alt text][cat1]
|
9099
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9100
|
+
|
9101
|
+
You can set a <b>custom prompt</b> in the bioshell, via
|
9102
|
+
the keywords "<b>prompt</b>" or "<b>set_prompt</b>".
|
9103
|
+
|
9104
|
+
To display the <b>current working directory</b>, do:
|
9105
|
+
|
9106
|
+
prompt pwd
|
9107
|
+
|
9108
|
+
To revert to the old default again, do this:
|
9109
|
+
|
9110
|
+
prompt REVERT
|
9111
|
+
prompt revert
|
9112
|
+
prompt DEFAULT
|
9113
|
+
prompt default
|
9114
|
+
|
9115
|
+
If you do not want to set any prompt, do:
|
9116
|
+
|
9117
|
+
prompt none
|
9118
|
+
|
9119
|
+
### Random stuff - generating random DNA sequences in the bioshell
|
9120
|
+
![alt text][cat1]
|
9121
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9122
|
+
|
9123
|
+
You can <b>generate random DNA sequences</b> in the
|
9124
|
+
<b>bioshell</b> via:
|
9125
|
+
|
9126
|
+
random dna 20
|
9127
|
+
random dna 25
|
9128
|
+
random dna 30
|
9129
|
+
# or simpler
|
9130
|
+
random 20
|
9131
|
+
random 25
|
9132
|
+
random 30
|
9133
|
+
|
9134
|
+
This will generate random DNA sequences, with a length
|
9135
|
+
of 20, 25, 30, respectively. This may not be very useful
|
9136
|
+
but it was important that this functionality is made
|
9137
|
+
available somewhere. Sometimes you may not even care
|
9138
|
+
about the sequence and just use the a "filler" sequence,
|
9139
|
+
so randomness has to be part of the Bioroebe project
|
9140
|
+
as well.
|
9141
|
+
|
9142
|
+
You can also use some toplevel-methods to generate, e. g.
|
9143
|
+
20 random aminoacids. Have a look at the following
|
9144
|
+
<b>toplevel API</b>:
|
9145
|
+
|
9146
|
+
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
9147
|
+
|
9148
|
+
Note that there may exist other APIs within the Bioroebe project
|
9149
|
+
that do the same as well.
|
9150
|
+
|
9151
|
+
If you would like to use a ruby-gtk3 widget have a look
|
9152
|
+
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
9153
|
+
It works with aminoacids, DNA and RNA, and allows the user to
|
9154
|
+
create random sequences. (If you need weighted randomness then
|
9155
|
+
you currently have to use the commandline variant. Perhaps I may
|
9156
|
+
add support into the GUI directly for this one day.)
|
9157
|
+
|
9158
|
+
### Deprecations within the Bioroebe::Shell
|
9159
|
+
![alt text][cat1]
|
9160
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9161
|
+
|
9162
|
+
Over the years the Bioroebe::Shell changed quite a bit.
|
9163
|
+
|
9164
|
+
This subsection here will list a few of these changes
|
9165
|
+
or rather, the deprecations.
|
9166
|
+
|
9167
|
+
**raw_sequence**: removed in June 2022 completely. It is
|
9168
|
+
simpler to handle sequences via Bioroebe::Sequence
|
9169
|
+
instead.
|
9170
|
+
|
9171
|
+
<b>@internal_hash[:array_sequences]</b> was no longer in
|
9172
|
+
use, so it was removed in July 2022.
|
9173
|
+
|
9174
|
+
### Chop off nucleotides within the Bioroebe::Shell
|
9175
|
+
![alt text][cat1]
|
9176
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9177
|
+
|
9178
|
+
You can use the following syntax to chop away until you find
|
9179
|
+
a particular substring, in the bioshell:
|
9180
|
+
|
9181
|
+
chop_to ATG
|
9182
|
+
|
9183
|
+
This functionality was specifically added to find the first
|
9184
|
+
ATG codon.
|
9185
|
+
|
9186
|
+
### Truncating output in the bioroebe-shell
|
9187
|
+
![alt text][cat1]
|
9188
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9189
|
+
|
9190
|
+
**DNA/RNA sequences** can become very long and then become
|
9191
|
+
quite difficult to view, read and handle on the commandline.
|
9192
|
+
|
9193
|
+
Normally the bioroebe shell will truncate output of DNA sequences
|
9194
|
+
that are "too long". This is mostly done so that working with
|
9195
|
+
very long sequences becomes a bit more convenient.
|
9196
|
+
|
9197
|
+
Sometimes this can become an antifeature, though, so the user
|
9198
|
+
must be able to toggle this at his or her own discretion.
|
9199
|
+
|
9200
|
+
By default, the bioroebe-shell (bioshell) will always try
|
9201
|
+
to truncate output, but you can toggle this behaviour by
|
9202
|
+
issuing:
|
9203
|
+
|
9204
|
+
do not truncate
|
9205
|
+
|
9206
|
+
In theory, other "do not" actions are also supported, or will
|
9207
|
+
be supported in the future; right now (Oct 2019) this is a bit
|
9208
|
+
limited.
|
9209
|
+
|
9210
|
+
From the toplevel, you can use this method:
|
9211
|
+
|
9212
|
+
Bioroebe.do_not_truncate
|
9213
|
+
|
9214
|
+
The above instruction will toggle the truncate behaviour
|
9215
|
+
to not truncate, ever.
|
9216
|
+
|
9217
|
+
If you need to do so within the bioshell, this is the way:
|
9218
|
+
|
9219
|
+
no_truncate
|
9220
|
+
|
9221
|
+
Or simply
|
9222
|
+
|
9223
|
+
truncate
|
9224
|
+
|
9225
|
+
This will toggle, like a switch.
|
9226
|
+
|
9227
|
+
## Support for other programming languages
|
9228
|
+
|
9229
|
+
The main programming language for the bioroebe project is **ruby**.
|
9230
|
+
Ruby, from a language design point of view, is a great programming
|
9231
|
+
language - not necessarily all of ruby, but the subset that I use.
|
9232
|
+
It is very easy to quickly prototype ideas via ruby.
|
9233
|
+
|
9234
|
+
However had, ruby is known to **not** be among the fastest programming
|
9235
|
+
languages about on this planet; so, it makes sense to use other
|
9236
|
+
languages too from this point of view. Additionally there are some
|
9237
|
+
software stacks in use in **other** programming languages, such as
|
9238
|
+
matplotlib and various more.
|
9239
|
+
|
9240
|
+
Thus, it is important to **support other programming languages** as
|
9241
|
+
well, if there are useful libraries. The bioroebe project, after
|
9242
|
+
all, tries to be **practical**: it focuses on getting things done,
|
9243
|
+
no matter the language.
|
9244
|
+
|
9245
|
+
This means that support for other programming languages can be
|
9246
|
+
found in this project as well, often using system() or similar
|
9247
|
+
functionality to tap into these other programming languages. Do
|
9248
|
+
not be surprised when that happens - the bioroebe project will
|
9249
|
+
also try to act as a **practical glue** towards functionality
|
9250
|
+
enabled via other projects. We want to get things done, no
|
9251
|
+
matter the programming language at hand!
|
9252
|
+
|
9253
|
+
Whenever possible, though, the bioroebe project will try to be
|
9254
|
+
flexible in this regard, so ideally the same solution should
|
9255
|
+
work for many different programming languages.
|
9256
|
+
|
9257
|
+
While Ruby is the primary language for this project, since as
|
9258
|
+
of 2021 I will try to officially support **java**, **jruby**
|
9259
|
+
and the **GraalVM**. This is on my TODO list, though - stay
|
9260
|
+
tuned for more updates in this regard. See also the
|
9261
|
+
subsection <b>Support for Python</b>.
|
9262
|
+
|
9263
|
+
## Support for Python
|
9264
|
+
|
9265
|
+
In <b>June 2022</b> I decided to add support for Python to bioroebe.
|
9266
|
+
|
9267
|
+
While people can - and should - easily use <b>biopython</b> instead,
|
9268
|
+
I simply wanted to see how much python-support I can add to
|
9269
|
+
bioroebe. This may lag behind some years compared to biopython,
|
9270
|
+
but I wanted to extend python support as well, so there you go.
|
9271
|
+
It is simply an additional option for the bioroebe project.
|
9272
|
+
<b>Ruby</b> will remain the primary language for the project,
|
9273
|
+
though, at the least for now.
|
9274
|
+
|
9275
|
+
## Bioroebe::ProfilePattern
|
9276
|
+
|
9277
|
+
This class can be used to generate nucleotide sequences that
|
9278
|
+
are not quite "random". For example, to generate sequences
|
9279
|
+
that may "simulate" a TATA box.
|
9280
|
+
|
9281
|
+
The idea for this class is to be extended into allowing
|
9282
|
+
HMMs (Hidden Markov Models) one day.
|
9283
|
+
|
9284
|
+
Usage example:
|
9285
|
+
|
9286
|
+
_ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
|
9287
|
+
_.generate_sequence_based_on_this_profile
|
9288
|
+
|
9289
|
+
Such a profile will encode the profile specifying the preferred sequence
|
9290
|
+
letters for each position in a section of DNA. You have to provide
|
9291
|
+
the Hash into the method generate_sequence_based_on_this_profile() -
|
9292
|
+
or you use the default Hash, which is stored in the constant
|
9293
|
+
called **PER_POSITION_HASH**.
|
9294
|
+
|
9295
|
+
That profile should be a Hash, with keys pointing to A, T, C, G
|
9296
|
+
and the values being an Array of likelihood chance there,
|
9297
|
+
as a number, such as 140. These values are also called
|
9298
|
+
**scores**. Each score contains a number for each position
|
9299
|
+
that indicates how likely it is to find the given
|
9300
|
+
nucleotide at that location.
|
9301
|
+
|
9302
|
+
You can also use this class to generate a random DNA string,
|
9303
|
+
similar to the method called
|
9304
|
+
**Bioroebe.generate_random_dna_sequence()**. The difference
|
9305
|
+
is that class ProfilePattern allows for a bit more fine-tuned
|
9306
|
+
control. The class will likely be extended in the future too.
|
9307
|
+
|
9308
|
+
## Generate DNA via Bioroebe.random_dna
|
9309
|
+
|
9310
|
+
You can "generate" random DNA strings by making use of the
|
9311
|
+
following code:
|
9312
|
+
|
9313
|
+
x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
|
9314
|
+
|
9315
|
+
As can be seen in the example above, after the #, a String will be
|
9316
|
+
returned representing that nucleotide sequence. In the case above
|
9317
|
+
it'll be 50 nucleotides in length.
|
9318
|
+
|
9319
|
+
The number given to <b>.random_dna()</b> tells the method how many
|
9320
|
+
nucleotides should be generated.
|
9321
|
+
|
9322
|
+
The method accepts a second argument, which should be a Hash.
|
9323
|
+
If it is a hash then the generated DNA will be based on the
|
9324
|
+
**probabilities** given to that Hash.
|
9325
|
+
|
9326
|
+
Let's look at specific example here:
|
9327
|
+
|
9328
|
+
Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"
|
9329
|
+
|
9330
|
+
As you can see, in the Hash defined above, the likelihood for
|
9331
|
+
incorporating a Guanine is much higher than for Adenine
|
9332
|
+
(70 : 10). This will be reflected in the generated DNA
|
9333
|
+
sequence which, as can be seen, contains many more
|
9334
|
+
Guanines than Adenines.
|
9335
|
+
|
9336
|
+
There is yet a third use case for the above. If you pass a **String**
|
9337
|
+
as the second argument rather than a Hash, then that String will be
|
9338
|
+
used as basis for generating the DNA string at hand.
|
9339
|
+
|
9340
|
+
Again, let's look at a specific example here:
|
9341
|
+
|
9342
|
+
Bioroebe.random_dna(10, 'ATCGATCGGG')
|
9343
|
+
|
9344
|
+
Here we add more G than A, T or C, so the new DNA sequence should
|
9345
|
+
contain these nucleotides as well.
|
9346
|
+
|
9347
|
+
More usage examples in this regard:
|
9348
|
+
|
9349
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
|
9350
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"
|
9351
|
+
|
9352
|
+
Note that this is similar to the .randomize() method in the bioruby
|
9353
|
+
project:
|
9354
|
+
|
9355
|
+
hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
|
9356
|
+
puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)
|
9357
|
+
|
9358
|
+
## Parsing genbank (.gbk) files
|
9359
|
+
|
9360
|
+
You could use Bioroebe::GenbankParser to parse .gbk files, at the
|
9361
|
+
least if you want to obtain the raw sequence, in FASTA format.
|
9362
|
+
|
9363
|
+
Example for this:
|
9364
|
+
|
9365
|
+
require 'bioroebe/genbank/genbank_parser.rb'
|
9366
|
+
result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
|
9367
|
+
result.dataset? # This method call will return the FASTA sequence.
|
9368
|
+
|
9369
|
+
Note that this currently (<b>July 2022</b>) only grabs one entry. In
|
9370
|
+
the upcoming rewrite in the future the parser will be able to parse
|
9371
|
+
all entries, and then present them to the user. Stay tuned in this
|
9372
|
+
regard.
|
9373
|
+
|
9374
|
+
## Parsers in general
|
9375
|
+
|
9376
|
+
The bioroebe project will store most parsers in the parsers/ subdirectory
|
9377
|
+
since as of <b>July 2022</b>.
|
9378
|
+
|
9379
|
+
Prior to that date different parsers were stored in different subdirectories,
|
9380
|
+
such as the parser for genbank-files being stored in the genbank/
|
9381
|
+
subdirectory. As I found this situation confusing, I settled for
|
9382
|
+
the parsers/ subdirectory since as of <b>July 2022</b>.
|
9186
9383
|
|
9187
9384
|
## Possibly useful links in regards to molecular biology and science in general
|
9188
9385
|
|