bioroebe 0.10.80 → 0.11.12
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of bioroebe might be problematic. Click here for more details.
- checksums.yaml +4 -4
- data/README.md +507 -310
- data/bioroebe.gemspec +3 -3
- data/doc/README.gen +506 -309
- data/doc/todo/bioroebe_todo.md +29 -40
- data/lib/bioroebe/aminoacids/display_aminoacid_table.rb +1 -0
- data/lib/bioroebe/base/colours_for_base/colours_for_base.rb +18 -8
- data/lib/bioroebe/base/commandline_application/commandline_arguments.rb +13 -11
- data/lib/bioroebe/base/commandline_application/misc.rb +18 -8
- data/lib/bioroebe/base/prototype/misc.rb +1 -1
- data/lib/bioroebe/codons/show_codon_tables.rb +6 -2
- data/lib/bioroebe/constants/aminoacids_and_proteins.rb +1 -0
- data/lib/bioroebe/constants/files_and_directories.rb +8 -1
- data/lib/bioroebe/count/count_amount_of_nucleotides.rb +3 -0
- data/lib/bioroebe/gui/gtk3/protein_to_DNA/protein_to_DNA.rb +18 -18
- data/lib/bioroebe/gui/shared_code/protein_to_DNA/protein_to_DNA_module.rb +14 -14
- data/lib/bioroebe/parsers/genbank_parser.rb +353 -24
- data/lib/bioroebe/python/README.md +1 -0
- data/lib/bioroebe/python/__pycache__/mymodule.cpython-39.pyc +0 -0
- data/lib/bioroebe/python/gui/gtk3/widget1.py +22 -0
- data/lib/bioroebe/python/mymodule.py +8 -0
- data/lib/bioroebe/python/protein_to_dna.py +30 -0
- data/lib/bioroebe/python/shell/shell.py +19 -0
- data/lib/bioroebe/python/to_rna.py +14 -0
- data/lib/bioroebe/python/toplevel_methods/to_camelcase.py +11 -0
- data/lib/bioroebe/sequence/nucleotide_module/nucleotide_module.rb +28 -25
- data/lib/bioroebe/sequence/sequence.rb +54 -2
- data/lib/bioroebe/shell/menu.rb +3336 -3304
- data/lib/bioroebe/shell/readline/readline.rb +1 -1
- data/lib/bioroebe/shell/shell.rb +11233 -28
- data/lib/bioroebe/siRNA/siRNA.rb +81 -1
- data/lib/bioroebe/string_matching/find_longest_substring.rb +3 -2
- data/lib/bioroebe/toplevel_methods/aminoacids_and_proteins.rb +31 -24
- data/lib/bioroebe/toplevel_methods/nucleotides.rb +22 -5
- data/lib/bioroebe/toplevel_methods/open_in_browser.rb +2 -0
- data/lib/bioroebe/toplevel_methods/to_camelcase.rb +5 -0
- data/lib/bioroebe/version/version.rb +2 -2
- data/lib/bioroebe/yaml/configuration/browser.yml +1 -1
- data/lib/bioroebe/yaml/restriction_enzymes/restriction_enzymes.yml +3 -3
- metadata +17 -36
- data/doc/setup.rb +0 -1655
- data/lib/bioroebe/genbank/genbank_parser.rb +0 -291
- data/lib/bioroebe/shell/add.rb +0 -108
- data/lib/bioroebe/shell/assign.rb +0 -360
- data/lib/bioroebe/shell/chop_and_cut.rb +0 -281
- data/lib/bioroebe/shell/constants.rb +0 -166
- data/lib/bioroebe/shell/download.rb +0 -335
- data/lib/bioroebe/shell/enable_and_disable.rb +0 -158
- data/lib/bioroebe/shell/enzymes.rb +0 -310
- data/lib/bioroebe/shell/fasta.rb +0 -345
- data/lib/bioroebe/shell/gtk.rb +0 -76
- data/lib/bioroebe/shell/history.rb +0 -132
- data/lib/bioroebe/shell/initialize.rb +0 -217
- data/lib/bioroebe/shell/loop.rb +0 -74
- data/lib/bioroebe/shell/misc.rb +0 -4341
- data/lib/bioroebe/shell/prompt.rb +0 -107
- data/lib/bioroebe/shell/random.rb +0 -289
- data/lib/bioroebe/shell/reset.rb +0 -335
- data/lib/bioroebe/shell/scan_and_parse.rb +0 -135
- data/lib/bioroebe/shell/search.rb +0 -337
- data/lib/bioroebe/shell/sequences.rb +0 -200
- data/lib/bioroebe/shell/show_report_and_display.rb +0 -2901
- data/lib/bioroebe/shell/startup.rb +0 -127
- data/lib/bioroebe/shell/taxonomy.rb +0 -14
- data/lib/bioroebe/shell/tk.rb +0 -23
- data/lib/bioroebe/shell/user_input.rb +0 -88
- data/lib/bioroebe/shell/xorg.rb +0 -45
data/doc/README.gen
CHANGED
@@ -5,7 +5,7 @@ ADD_TIME_STAMP
|
|
5
5
|
|
6
6
|
## Bioroebe
|
7
7
|
|
8
|
-
<img src="
|
8
|
+
<img src="https://i.imgur.com/mAoP7AP.png">
|
9
9
|
<img src="https://i.imgur.com/YqYxRBZ.png" style="margin: 4px; margin-left: 12px;"/>
|
10
10
|
<img src="https://i.imgur.com/k7mMlg2.png" style="margin: 4px; margin-left: 12px;"/>
|
11
11
|
|
@@ -332,41 +332,6 @@ so I opted to go the yaml route. But if people want to use a hash
|
|
332
332
|
instead, they can do so, too - see the <b>API</b> for codon tables
|
333
333
|
lateron. Simply define your own constants and pass them to the
|
334
334
|
appropriate methods.
|
335
|
-
|
336
|
-
## Support for other programming languages
|
337
|
-
|
338
|
-
The main programming language for the bioroebe project is **ruby**.
|
339
|
-
Ruby, from a language design point of view, is a great programming
|
340
|
-
language - not necessarily all of ruby, but the subset that I use.
|
341
|
-
It is very easy to quickly prototype ideas via ruby.
|
342
|
-
|
343
|
-
However had, ruby is known to **not** be among the fastest programming
|
344
|
-
languages about on this planet; so, it makes sense to use other
|
345
|
-
languages too from this point of view. Additionally there are some
|
346
|
-
software stacks in use in **other** programming languages, such as
|
347
|
-
matplotlib and various more.
|
348
|
-
|
349
|
-
Thus, it is important to **support other programming languages** as
|
350
|
-
well, if there are useful libraries. The bioroebe project, after
|
351
|
-
all, tries to be **practical**: it focuses on getting things done,
|
352
|
-
no matter the language.
|
353
|
-
|
354
|
-
This means that support for other programming languages can be
|
355
|
-
found in this project as well, often using system() or similar
|
356
|
-
functionality to tap into these other programming languages. Do
|
357
|
-
not be surprised when that happens - the bioroebe project will
|
358
|
-
also try to act as a **practical glue** towards functionality
|
359
|
-
enabled via other projects. We want to get things done, no
|
360
|
-
matter the programming language at hand!
|
361
|
-
|
362
|
-
Whenever possible, though, the bioroebe project will try to be
|
363
|
-
flexible in this regard, so ideally the same solution should
|
364
|
-
work for many different programming languages.
|
365
|
-
|
366
|
-
While Ruby is the primary language for this project, since as
|
367
|
-
of 2021 I will try to officially support **java**, **jruby**
|
368
|
-
and the **GraalVM**. This is on my TODO list, though - stay
|
369
|
-
tuned for more updates in this regard.
|
370
335
|
|
371
336
|
## Readline support in the BioRoebe project
|
372
337
|
|
@@ -550,16 +515,16 @@ the DNA-to-Protein translation is somewhat simply kept as a
|
|
550
515
|
Once you are inside a **running Bioshell**, you can do other **commands**
|
551
516
|
such as this one here:
|
552
517
|
|
553
|
-
random # ← This will generate a random DNA sequence.
|
518
|
+
random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.
|
554
519
|
|
555
520
|
To **assign** a DNA sequence, do:
|
556
521
|
|
557
522
|
assign ATAGGGCTTTT
|
558
523
|
|
559
|
-
Note that since the year 2016
|
560
|
-
the one above, without any other commands/words, then we will assume
|
524
|
+
Note that since as of the year <b>2016</b>, if you input a nucleotide sequence
|
525
|
+
like the one above, without any other commands/words, then we will assume
|
561
526
|
that you did mean to do an assignment as-is anyway. The "assign" part
|
562
|
-
then becomes superfluous.
|
527
|
+
then becomes superfluous and can be omitted.
|
563
528
|
|
564
529
|
This is how this is simply done, by omitting the "assign" part of the
|
565
530
|
above instruction altogether:
|
@@ -1070,18 +1035,18 @@ The text **banana** thus has the following suffixes:
|
|
1070
1035
|
|
1071
1036
|
This subsection deals with some aspects of **HMMs**.
|
1072
1037
|
|
1073
|
-
Why are HMMs useful in biology? They can be used to represent protein
|
1074
|
-
families
|
1038
|
+
Why are HMMs useful in biology? They can be used to <b>represent protein
|
1039
|
+
families</b>, for example (via <b>pHMMs</b> - profile hidden markov models).
|
1075
1040
|
|
1076
1041
|
Furthermore, they can show some bias in the mutation rate that can be
|
1077
1042
|
observed. Different genomes are known to have different hotspots where
|
1078
|
-
mutations are more likely to happen. These are
|
1079
|
-
may be useful.
|
1043
|
+
mutations are more likely to happen, for various reasons. These are
|
1044
|
+
examples where a HMM may be useful.
|
1080
1045
|
|
1081
|
-
HMMs are usually based on the Shannon model where you assign different
|
1046
|
+
HMMs are usually based on the <b>Shannon model</b> where you assign different
|
1082
1047
|
probabilities to "change" events. An example that was mentioned back
|
1083
|
-
in 1948 was the english alphabet - some letters, and combinations
|
1084
|
-
letters, are more commonly seen. Shannon gave the example of "E"
|
1048
|
+
in <b>1948</b> was the english alphabet - some letters, and combinations
|
1049
|
+
of letters, are more commonly seen. Shannon gave the example of "E"
|
1085
1050
|
versus "W", as shown in the following graph (a **finite state
|
1086
1051
|
graph**):
|
1087
1052
|
|
@@ -1095,40 +1060,47 @@ DNA sequence, a 10-mer would be equivalent to **10 base pairs**.
|
|
1095
1060
|
The individual transition states are based on an assumption of
|
1096
1061
|
"randomness", but ensuring that these are truly random is not
|
1097
1062
|
necessarily trivial. Computers do not really 'generate' true
|
1098
|
-
randomness, at the least not when they are working solo
|
1099
|
-
can even 'predict' some randomness here or there
|
1100
|
-
|
1101
|
-
|
1102
|
-
|
1103
|
-
|
1104
|
-
of
|
1105
|
-
|
1106
|
-
given position, but this is not
|
1107
|
-
|
1108
|
-
|
1109
|
-
|
1110
|
-
|
1111
|
-
|
1112
|
-
|
1063
|
+
randomness, at the least not when they are working solo, "on
|
1064
|
+
their own". You can even 'predict' some randomness here or there
|
1065
|
+
via various techniques - see vulnerabilities such as <b>Specter</b>
|
1066
|
+
or similar variants where software can read from areas of the
|
1067
|
+
memory that should be inaccessible to them. Some of this is based
|
1068
|
+
on co-predictions. For distributed computers, you may often use
|
1069
|
+
random noise or decay of atoms as 'a source of randomness'. For
|
1070
|
+
any DNA nucleotide sequence, we would assume that each base pair
|
1071
|
+
has a 25% chance to exist at any given position, but this is not
|
1072
|
+
necessarily true, again for various reasons.
|
1073
|
+
|
1074
|
+
An interesting thought is ... why is <b>ATP</b> so important?
|
1075
|
+
Yes, of course due to it being 'the energy currency in a cell' but ..
|
1076
|
+
why is this ATP, aka adenine? Why not GTP, aka guanine or any of
|
1077
|
+
the other two nucleotides? (GTP is used too, but why? Why not
|
1078
|
+
CTP and TTP?) I can not answer this question; there may
|
1079
|
+
be many reasons, including differential chemical storage power
|
1080
|
+
as well as mere random chance event in evolution, but for whatever
|
1113
1081
|
the reason, you will not find a complete 25% percentage value
|
1114
1082
|
for every given "slot" in DNA, depending on the organism.
|
1115
1083
|
|
1116
1084
|
From a practical point of view, how can we approach Hidden Markov
|
1117
|
-
Models?
|
1085
|
+
Models and use them?
|
1118
1086
|
|
1119
|
-
Let's take the following sequence:
|
1087
|
+
Let's take the following simple sequence:
|
1120
1088
|
|
1121
1089
|
ACGTACGC
|
1122
1090
|
|
1123
1091
|
From this sequence we can see that the <b>3-mer</b> "ACG"
|
1124
1092
|
is followed by either a T, or a C. Have a look at the sequence
|
1125
|
-
to see if you can identify the two ACG subsequences
|
1093
|
+
again to see if you can identify the two ACG subsequences
|
1094
|
+
there. You can see one at the start, and the other one
|
1095
|
+
following a bit later, hence why we come to the conclusion
|
1096
|
+
that either a T or a C will follow this <b>3-mer</b>.
|
1126
1097
|
|
1127
|
-
The probability of either T or C
|
1128
|
-
for A and G to follow there
|
1129
|
-
be ignored.
|
1098
|
+
The probability of either T or C to occur on <b>that</b>
|
1099
|
+
position, thus, is 0.5 (50%); for A and G to follow there
|
1100
|
+
is 0% so the latter two can be ignored.
|
1130
1101
|
|
1131
|
-
Thus, we could use a ruby Hash as follows
|
1102
|
+
Thus, we could use a ruby Hash as follows that should
|
1103
|
+
describe these probabilities:
|
1132
1104
|
|
1133
1105
|
probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well
|
1134
1106
|
|
@@ -1214,34 +1186,6 @@ each edge.
|
|
1214
1186
|
Parsimony assumes that substitutions are rare and that back-mutations
|
1215
1187
|
do not occur.
|
1216
1188
|
|
1217
|
-
## Random stuff
|
1218
|
-
|
1219
|
-
You can generate random DNA sequences in the shell:
|
1220
|
-
|
1221
|
-
random dna 20
|
1222
|
-
random dna 25
|
1223
|
-
random dna 30
|
1224
|
-
|
1225
|
-
This will generate random DNA sequences, with a length
|
1226
|
-
of 20, 25, 30, respectively. This may not be very useful
|
1227
|
-
but it was important that this functionality is made
|
1228
|
-
available somewhere.
|
1229
|
-
|
1230
|
-
You can also use some toplevel-methods to generate, e. g.
|
1231
|
-
20 random aminoacids:
|
1232
|
-
|
1233
|
-
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
1234
|
-
|
1235
|
-
Note that there may exist other APIs within the Bioroebe project
|
1236
|
-
that do the same as well.
|
1237
|
-
|
1238
|
-
If you would like to use a ruby-gtk3 widget have a look
|
1239
|
-
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
1240
|
-
It works with aminoacids, DNA and RNA, and allows the user to
|
1241
|
-
create random sequences. (If you need weighted randomness then
|
1242
|
-
you currently have to use the commandline variant. Perhaps I may
|
1243
|
-
add support into the GUI directly for this one day.)
|
1244
|
-
|
1245
1189
|
## Displaying the main sequence with delimiter characters
|
1246
1190
|
|
1247
1191
|
From within the <b>bioshell</b>, you can use some alternative ways to
|
@@ -2711,18 +2655,6 @@ This may look as follows:
|
|
2711
2655
|
|
2712
2656
|
<img src="https://i.imgur.com/gAZg8qG.png" style="margin: 1em; margin-left: 3em">
|
2713
2657
|
|
2714
|
-
## Obtaining a subsequence from a Bioroebe::Sequence object
|
2715
|
-
|
2716
|
-
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
2717
|
-
|
2718
|
-
There are several ways how to obtain a subsequence from
|
2719
|
-
this. One variant will be shown next, by making use of
|
2720
|
-
the method called **.subseq()**.
|
2721
|
-
|
2722
|
-
Example:
|
2723
|
-
|
2724
|
-
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
2725
|
-
|
2726
2658
|
## Bioroebe::Protein
|
2727
2659
|
|
2728
2660
|
This class is a subclass of class **Bioroebe::Sequence**. The
|
@@ -2737,16 +2669,6 @@ functionality is also available in another method.
|
|
2737
2669
|
For now keep this in mind; at some later point I may decide whether
|
2738
2670
|
this class is to be kept or not.
|
2739
2671
|
|
2740
|
-
## Permanently disabling showing the startup-introduction of the Bioshell
|
2741
|
-
|
2742
|
-
If you do not want to see the start-up intro, you can try
|
2743
|
-
any of the following:
|
2744
|
-
|
2745
|
-
bioshell --permanently-disable-startup-intro
|
2746
|
-
bioshell --permanently-disable-startup-notice
|
2747
|
-
bioshell --permanently-no-startup-intro
|
2748
|
-
bioshell --permanently-no-startup-info
|
2749
|
-
|
2750
2672
|
## Decoding aminoacids
|
2751
2673
|
|
2752
2674
|
Decoding aminoacids means to take the aminoacid at hand, ideally
|
@@ -3173,47 +3095,45 @@ can try to use:
|
|
3173
3095
|
On class Bioroebe::Sequence. More customizability may be added
|
3174
3096
|
to that method in this regard, if users need this.
|
3175
3097
|
|
3176
|
-
|
3098
|
+
### Obtaining a subsequence from a Bioroebe::Sequence object
|
3177
3099
|
|
3178
|
-
|
3179
|
-
the **bioshell**.
|
3100
|
+
Say that you have the DNA sequence **ATGCATGCAAAA**.
|
3180
3101
|
|
3181
|
-
|
3102
|
+
There are several ways how to obtain a subsequence from
|
3103
|
+
this. One variant will be shown next, by making use of
|
3104
|
+
the method called **.subseq()**.
|
3182
3105
|
|
3183
|
-
|
3106
|
+
Example:
|
3184
3107
|
|
3185
|
-
|
3108
|
+
seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
|
3186
3109
|
|
3187
|
-
You can
|
3188
|
-
code:
|
3110
|
+
You can also randomize the sequence, via .randomize().
|
3189
3111
|
|
3190
|
-
|
3112
|
+
Example:
|
3191
3113
|
|
3192
|
-
|
3193
|
-
returned representing that nucleotide sequence.
|
3114
|
+
x = Bioroebe::Sequence.new; x.randomize
|
3194
3115
|
|
3195
|
-
|
3196
|
-
should be generated.
|
3116
|
+
This is similar to the method in Bioruby here:
|
3197
3117
|
|
3198
|
-
|
3118
|
+
https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243
|
3199
3119
|
|
3200
|
-
|
3201
|
-
such as by issuing the following command:
|
3120
|
+
## The Hydropathy index
|
3202
3121
|
|
3203
|
-
|
3122
|
+
You can display the hydropathy index for aminoacids from within
|
3123
|
+
the **bioshell**.
|
3204
3124
|
|
3205
|
-
|
3125
|
+
Simply issue:
|
3206
3126
|
|
3207
|
-
|
3127
|
+
hydropathy?
|
3208
3128
|
|
3209
|
-
|
3129
|
+
## The GFF file format
|
3210
3130
|
|
3211
|
-
|
3131
|
+
From within the **bioshell** you can analyze .gff and .gff3 files,
|
3132
|
+
such as by issuing the following command:
|
3212
3133
|
|
3213
|
-
|
3134
|
+
gff3? foobar.gff3
|
3214
3135
|
|
3215
|
-
|
3216
|
-
compositions of the same nucleotide.
|
3136
|
+
Evidently for this to work the file at hand has to exist.
|
3217
3137
|
|
3218
3138
|
## The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)
|
3219
3139
|
|
@@ -3350,47 +3270,6 @@ nucleotides by issuing:
|
|
3350
3270
|
|
3351
3271
|
show_individual_weight_of_the_four_dna_nucleotides
|
3352
3272
|
|
3353
|
-
## Truncating output in the bioroebe-shell
|
3354
|
-
![alt text][cat1]
|
3355
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3356
|
-
|
3357
|
-
**DNA/RNA sequences** can become very long and then become
|
3358
|
-
quite difficult to view, read and handle on the commandline.
|
3359
|
-
|
3360
|
-
Normally the bioroebe shell will truncate output of DNA sequences
|
3361
|
-
that are "too long". This is mostly done so that working with
|
3362
|
-
very long sequences becomes a bit more convenient.
|
3363
|
-
|
3364
|
-
Sometimes this can become an antifeature, though, so the user
|
3365
|
-
must be able to toggle this at his or her own discretion.
|
3366
|
-
|
3367
|
-
By default, the bioroebe-shell (bioshell) will always try
|
3368
|
-
to truncate output, but you can toggle this behaviour by
|
3369
|
-
issuing:
|
3370
|
-
|
3371
|
-
do not truncate
|
3372
|
-
|
3373
|
-
In theory, other "do not" actions are also supported, or will
|
3374
|
-
be supported in the future; right now (Oct 2019) this is a bit
|
3375
|
-
limited.
|
3376
|
-
|
3377
|
-
From the toplevel, you can use this method:
|
3378
|
-
|
3379
|
-
Bioroebe.do_not_truncate
|
3380
|
-
|
3381
|
-
The above instruction will toggle the truncate behaviour
|
3382
|
-
to not truncate, ever.
|
3383
|
-
|
3384
|
-
If you need to do so within the bioshell, this is the way:
|
3385
|
-
|
3386
|
-
no_truncate
|
3387
|
-
|
3388
|
-
Or simply
|
3389
|
-
|
3390
|
-
truncate
|
3391
|
-
|
3392
|
-
This will toggle, like a switch.
|
3393
|
-
|
3394
3273
|
## Rosalind Challenges
|
3395
3274
|
![alt text][cat1]
|
3396
3275
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3527,31 +3406,6 @@ investing more time into Rosalind. Let's focus on solving
|
|
3527
3406
|
real, existing problems instead - at the least as far as
|
3528
3407
|
the Bioroebe project is concerned.
|
3529
3408
|
|
3530
|
-
## Numbers as input in the bioshell
|
3531
|
-
![alt text][cat1]
|
3532
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3533
|
-
|
3534
|
-
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
3535
|
-
|
3536
|
-
This will attempt to <b>display the first 3 nucleotides</b> of
|
3537
|
-
the assigned **main sequence**. It will only work if you have
|
3538
|
-
assigned a sequence prior to that, though.
|
3539
|
-
|
3540
|
-
Examples:
|
3541
|
-
|
3542
|
-
3
|
3543
|
-
33
|
3544
|
-
15
|
3545
|
-
|
3546
|
-
## transeq
|
3547
|
-
![alt text][cat1]
|
3548
|
-
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
3549
|
-
|
3550
|
-
You can convert a DNA sequence into an aminoacid sequence by
|
3551
|
-
doing this:
|
3552
|
-
|
3553
|
-
transeq
|
3554
|
-
|
3555
3409
|
## Align two different sequences
|
3556
3410
|
![alt text][cat1]
|
3557
3411
|
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
@@ -3863,22 +3717,6 @@ does not (yet?) have support for comparing two genomes to
|
|
3863
3717
|
one another and generate a visual map indicating the findings
|
3864
3718
|
there.
|
3865
3719
|
|
3866
|
-
## Do not create directories on startup of the shell
|
3867
|
-
|
3868
|
-
By default the bioshell will try to create some directories
|
3869
|
-
on startup. This may not always be desired by the user
|
3870
|
-
though, so an option has to exist to disable this functionality.
|
3871
|
-
|
3872
|
-
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
3873
|
-
keeps track of whether directories on startup of the shell will
|
3874
|
-
be created.
|
3875
|
-
|
3876
|
-
To disable this behaviour on startup of the bioshell, try
|
3877
|
-
something like this:
|
3878
|
-
|
3879
|
-
bioshell --do-not-create-directories-on-startup
|
3880
|
-
bioshell --do-not-create-directories
|
3881
|
-
|
3882
3720
|
## class Bioroebe::MoveFileToItsCorrectLocation
|
3883
3721
|
|
3884
3722
|
This class will move a bio-file to its "correct" location, with respect
|
@@ -4047,39 +3885,6 @@ has". Genes in itself are not that well-defined, so they are not necessarily
|
|
4047
3885
|
the primary means of complexity. Think of this more as an interactome,
|
4048
3886
|
where RNAs play a major dynamic role as well.
|
4049
3887
|
|
4050
|
-
## Bioroebe::ProfilePattern
|
4051
|
-
|
4052
|
-
This class can be used to generate nucleotide sequences that
|
4053
|
-
are not quite "random". For example, to generate sequences
|
4054
|
-
that may "simulate" a TATA box.
|
4055
|
-
|
4056
|
-
The idea for this class is to be extended into allowing
|
4057
|
-
HMMs (Hidden Markov Models) one day.
|
4058
|
-
|
4059
|
-
Usage example:
|
4060
|
-
|
4061
|
-
_ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
|
4062
|
-
_.generate_sequence_based_on_this_profile
|
4063
|
-
|
4064
|
-
Such a profile will encode the profile specifying the preferred sequence
|
4065
|
-
letters for each position in a section of DNA. You have to provide
|
4066
|
-
the Hash into the method generate_sequence_based_on_this_profile() -
|
4067
|
-
or you use the default Hash, which is stored in the constant
|
4068
|
-
called **PER_POSITION_HASH**.
|
4069
|
-
|
4070
|
-
That profile should be a Hash, with keys pointing to A, T, C, G
|
4071
|
-
and the values being an Array of likelihood chance there,
|
4072
|
-
as a number, such as 140. These values are also called
|
4073
|
-
**scores**. Each score contains a number for each position
|
4074
|
-
that indicates how likely it is to find the given
|
4075
|
-
nucleotide at that location.
|
4076
|
-
|
4077
|
-
You can also use this class to generate a random DNA string,
|
4078
|
-
similar to the method called
|
4079
|
-
**Bioroebe.generate_random_dna_sequence()**. The difference
|
4080
|
-
is that class ProfilePattern allows for a bit more fine-tuned
|
4081
|
-
control. The class will likely be extended in the future too.
|
4082
|
-
|
4083
3888
|
## class Bioroebe::DisplayOpenReadingFrames
|
4084
3889
|
|
4085
3890
|
**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
|
@@ -4459,28 +4264,6 @@ the BioRoebe-Shell, then you can use either of the following:
|
|
4459
4264
|
|
4460
4265
|
seq?
|
4461
4266
|
seq_with_tab?
|
4462
|
-
|
4463
|
-
## Prompt (the shell prompt9
|
4464
|
-
|
4465
|
-
You can set a <b>custom prompt</b>, via the keywords
|
4466
|
-
"prompt" or "set_prompt".
|
4467
|
-
|
4468
|
-
To display the <b>current working directory</b>, do:
|
4469
|
-
|
4470
|
-
prompt pwd
|
4471
|
-
|
4472
|
-
To revert to the old default again, do this:
|
4473
|
-
|
4474
|
-
prompt REVERT
|
4475
|
-
prompt revert
|
4476
|
-
prompt DEFAULT
|
4477
|
-
prompt default
|
4478
|
-
|
4479
|
-
If you do not want to set any prompt, do:
|
4480
|
-
|
4481
|
-
prompt none
|
4482
|
-
|
4483
|
-
|
4484
4267
|
|
4485
4268
|
## Leader and Trailer
|
4486
4269
|
|
@@ -5761,6 +5544,9 @@ like this:
|
|
5761
5544
|
|
5762
5545
|
<img src="https://i.imgur.com/vr2kEBz.png" style="margin: 1em; margin-left: 3em">
|
5763
5546
|
|
5547
|
+
Since as of <b>July 2022</b> invalid amino acids will be automatically
|
5548
|
+
filtered away before being assigned to the input.
|
5549
|
+
|
5764
5550
|
## Colourizing hydrophilic and hydrophobic aminoacids on the commandline
|
5765
5551
|
|
5766
5552
|
Via class **Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids** you
|
@@ -5774,35 +5560,36 @@ Example output for this:
|
|
5774
5560
|
|
5775
5561
|
This subsection contains some information about proteases.
|
5776
5562
|
|
5777
|
-
|
5563
|
+
Trypsin:
|
5778
5564
|
https://en.wikipedia.org/wiki/Trypsin
|
5779
|
-
cuts at
|
5565
|
+
<b>cuts at</b>: Trypsin cuts peptide chains mainly at the carboxyl
|
5780
5566
|
side of the amino acids lysine or arginine.
|
5781
5567
|
|
5782
|
-
|
5568
|
+
Chymotrypsin:
|
5783
5569
|
https://en.wikipedia.org/wiki/Chymotrypsin
|
5784
|
-
cuts at
|
5570
|
+
<b>cuts at</b>: Chymotrypsin preferentially cleaves peptide amide
|
5785
5571
|
bonds where the side chain of the amino acid N-terminal
|
5786
|
-
to the scissile amide bond is a large hydrophobic amino
|
5787
|
-
acid (tyrosine, tryptophan, and phenylalanine).
|
5572
|
+
to the scissile amide bond is <b>a large hydrophobic amino</b>
|
5573
|
+
acid (specifically: tyrosine, tryptophan, and phenylalanine).
|
5574
|
+
Chymotrypsin will cleave proteins on the <b>carboxyl side</b>
|
5575
|
+
of aromatic or large hydrophobic amino acids.
|
5788
5576
|
|
5789
|
-
|
5577
|
+
Thrombin:
|
5790
5578
|
https://en.wikipedia.org/wiki/Thrombin
|
5791
|
-
cuts at
|
5579
|
+
<b>cuts at</b>: Thrombin acts as a serine protease that converts
|
5792
5580
|
soluble fibrinogen into insoluble strands of fibrin. It
|
5793
5581
|
catalyzes the hydrolysis of <b>Arg-Gly</b> bonds in
|
5794
5582
|
particular peptide sequences only.
|
5795
5583
|
|
5796
|
-
|
5584
|
+
Plasmin:
|
5797
5585
|
https://en.wikipedia.org/wiki/Plasmin
|
5798
|
-
cuts at
|
5586
|
+
<b>cuts at</b>: Plasmin is a serine protease.
|
5799
5587
|
|
5800
|
-
|
5588
|
+
Papain:
|
5801
5589
|
https://en.wikipedia.org/wiki/Papain
|
5802
|
-
cuts at
|
5803
|
-
|
5804
|
-
|
5805
|
-
not followed by a valine.
|
5590
|
+
<b>cuts at</b>: Papain prefers to cleave after an arginine or
|
5591
|
+
lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile,
|
5592
|
+
Phe, Trp, Tyr) and not followed by a valine.
|
5806
5593
|
|
5807
5594
|
factor Xa:
|
5808
5595
|
|
@@ -5814,8 +5601,8 @@ Some proteins may permanently reside in the lumen of the
|
|
5814
5601
|
Often such proteins will have a special signal sequence attached
|
5815
5602
|
to their **C-terminal part**, such as **KDEL** (Lys-Asp-Glu-Leu).
|
5816
5603
|
|
5817
|
-
KDEL is not the only signal that may be used, though. Some
|
5818
|
-
may use different signals, such as:
|
5604
|
+
<b>KDEL</b> is not the only signal that may be used, though. Some
|
5605
|
+
species may use different signals, such as:
|
5819
5606
|
|
5820
5607
|
aminoacids | species
|
5821
5608
|
-------------|------------------------------------------------------------
|
@@ -5825,8 +5612,9 @@ may use different signals, such as:
|
|
5825
5612
|
ADEL | Schizosaccharomyces pombe (fission yeast)
|
5826
5613
|
SDEL | Plasmodium falciparum
|
5827
5614
|
|
5828
|
-
If you work with the bioshell then you can simply use this
|
5829
|
-
to query whether the given aminoacid sequence has a KDEL
|
5615
|
+
If you work with the <b>bioshell</b> then you can simply use this
|
5616
|
+
method to query whether the given aminoacid sequence has a KDEL
|
5617
|
+
sequence:
|
5830
5618
|
|
5831
5619
|
KDEL?
|
5832
5620
|
|
@@ -7362,16 +7150,6 @@ This would notify the bioshell that only nucleotides from position
|
|
7362
7150
|
51 to (including) position 3251 will be colourized, when doing another
|
7363
7151
|
"ORF?" invocation.
|
7364
7152
|
|
7365
|
-
## Longest substring
|
7366
|
-
|
7367
|
-
Within the Bioroebe::Shell you can determine the longest substring,
|
7368
|
-
including gaps, like s:'
|
7369
|
-
|
7370
|
-
longest_substring? ATTATTGTT | ATTATTCTT'
|
7371
|
-
|
7372
|
-
Note that this will make use of the diff-lcs gem, which uses
|
7373
|
-
the McIlroy-Hunt algorithm.
|
7374
|
-
|
7375
7153
|
## Restriction Enzymes
|
7376
7154
|
|
7377
7155
|
This **subsection** will eventually be expanded to explain various things about
|
@@ -8730,6 +8508,22 @@ The images that can be generated via this may look as follows:
|
|
8730
8508
|
|
8731
8509
|
<img src="https://i.imgur.com/fWwD1fj.png" style="margin: 1em; margin-left: 2em">
|
8732
8510
|
|
8511
|
+
Let's look at another example.
|
8512
|
+
|
8513
|
+
Say you input the following sequences there:
|
8514
|
+
|
8515
|
+
AGVV
|
8516
|
+
AGVV
|
8517
|
+
AGVV
|
8518
|
+
AGVV
|
8519
|
+
AGGV
|
8520
|
+
AGGV
|
8521
|
+
AGGV
|
8522
|
+
|
8523
|
+
The resulting image that is generated is:
|
8524
|
+
|
8525
|
+
<img src="https://i.imgur.com/3wWApIQ.png" style="margin: 1em; margin-left: 2em">
|
8526
|
+
|
8733
8527
|
## The Kozak Sequence
|
8734
8528
|
|
8735
8529
|
The ribosome usually scans for a **AUG** codon. But there are
|
@@ -9180,6 +8974,409 @@ time being it is what it is. At a later point in time test cases
|
|
9180
8974
|
may be added to check whether it performs correctly or whether it
|
9181
8975
|
does not.
|
9182
8976
|
|
8977
|
+
The other rules, also published in 2004, are the Reynolds rules. Code
|
8978
|
+
support was added to the Bioroebe project in <b>June 2022</b>, but
|
8979
|
+
it was not tested yet, so the implementation may be incorrect.
|
8980
|
+
|
8981
|
+
## The Bioroebe::Shell interface
|
8982
|
+
|
8983
|
+
The following subsection specifically handles information
|
8984
|
+
pertaining to the <b>Bioroebe::Shell</b> interface of the
|
8985
|
+
<b>bioroebe project</b>. It is also called <b>bioshell</b>,
|
8986
|
+
to simplify spelling it.
|
8987
|
+
|
8988
|
+
### Numbers as input in the bioshell
|
8989
|
+
![alt text][cat1]
|
8990
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
8991
|
+
|
8992
|
+
You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
|
8993
|
+
|
8994
|
+
This will attempt to <b>display the first 3 nucleotides</b> of
|
8995
|
+
the assigned **main sequence**. It will only work if you have
|
8996
|
+
assigned a sequence prior to that, though.
|
8997
|
+
|
8998
|
+
Examples:
|
8999
|
+
|
9000
|
+
3
|
9001
|
+
33
|
9002
|
+
15
|
9003
|
+
|
9004
|
+
### transeq
|
9005
|
+
![alt text][cat1]
|
9006
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9007
|
+
|
9008
|
+
You can convert a DNA sequence into an aminoacid sequence by
|
9009
|
+
doing this:
|
9010
|
+
|
9011
|
+
transeq
|
9012
|
+
|
9013
|
+
### Shuffling the DNA/RNA string in the bioshell
|
9014
|
+
![alt text][cat1]
|
9015
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9016
|
+
|
9017
|
+
Via
|
9018
|
+
|
9019
|
+
shuffle
|
9020
|
+
|
9021
|
+
you can <b>randomly rearrange the main DNA/RNA string</b>
|
9022
|
+
that is used by the <b>Bioroebe::Shell</b>.
|
9023
|
+
|
9024
|
+
This can be useful if you just wish to quickly "test"
|
9025
|
+
new compositions of the same nucleotide.
|
9026
|
+
|
9027
|
+
### Permanently disabling showing the startup-introduction of the Bioshell
|
9028
|
+
![alt text][cat1]
|
9029
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9030
|
+
|
9031
|
+
If you do not want to see the start-up intro, you can try
|
9032
|
+
any of the following:
|
9033
|
+
|
9034
|
+
bioshell --permanently-disable-startup-intro
|
9035
|
+
bioshell --permanently-disable-startup-notice
|
9036
|
+
bioshell --permanently-no-startup-intro
|
9037
|
+
bioshell --permanently-no-startup-info
|
9038
|
+
|
9039
|
+
### Longest substring
|
9040
|
+
![alt text][cat1]
|
9041
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9042
|
+
|
9043
|
+
Within the Bioroebe::Shell you can determine the longest substring,
|
9044
|
+
including gaps, like s:'
|
9045
|
+
|
9046
|
+
longest_substring? ATTATTGTT | ATTATTCTT'
|
9047
|
+
|
9048
|
+
Note that this will make use of the diff-lcs gem, which uses
|
9049
|
+
the McIlroy-Hunt algorithm.
|
9050
|
+
|
9051
|
+
### Do not create directories on startup of the shell
|
9052
|
+
![alt text][cat1]
|
9053
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9054
|
+
|
9055
|
+
By default the <b>bioshell</b> will try to create some directories
|
9056
|
+
on startup. This may not always be desired by the user, though,
|
9057
|
+
so an option has to exist to <b>disable</b> this functionality.
|
9058
|
+
|
9059
|
+
Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
|
9060
|
+
keeps track of whether directories on startup of the shell will
|
9061
|
+
be created.
|
9062
|
+
|
9063
|
+
To disable this behaviour on startup of the bioshell, try
|
9064
|
+
something like this:
|
9065
|
+
|
9066
|
+
bioshell --do-not-create-directories-on-startup
|
9067
|
+
bioshell --do-not-create-directories
|
9068
|
+
|
9069
|
+
### Generating and assigning a random amount of nucleotides
|
9070
|
+
![alt text][cat1]
|
9071
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9072
|
+
|
9073
|
+
Via:
|
9074
|
+
|
9075
|
+
random 555
|
9076
|
+
|
9077
|
+
you can "generate" 555 random nucleotides (DNA that is) and
|
9078
|
+
assign it to the main sequence in use by the bioshell. This
|
9079
|
+
is mostly a convenience feature, if you want to debug something
|
9080
|
+
quickly.
|
9081
|
+
|
9082
|
+
### Determining the log directory for the Bioroebe::Shell component
|
9083
|
+
![alt text][cat1]
|
9084
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9085
|
+
|
9086
|
+
Via:
|
9087
|
+
|
9088
|
+
bioshell_log_dir?
|
9089
|
+
|
9090
|
+
you can determine the log-directory output for the bioshell
|
9091
|
+
component. On my home system this will default to
|
9092
|
+
<b>/home/Temp/bioroebe/bioshell/</b>.
|
9093
|
+
|
9094
|
+
### Prompt (the shell prompt of the bioshell)
|
9095
|
+
![alt text][cat1]
|
9096
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9097
|
+
|
9098
|
+
You can set a <b>custom prompt</b> in the bioshell, via
|
9099
|
+
the keywords "<b>prompt</b>" or "<b>set_prompt</b>".
|
9100
|
+
|
9101
|
+
To display the <b>current working directory</b>, do:
|
9102
|
+
|
9103
|
+
prompt pwd
|
9104
|
+
|
9105
|
+
To revert to the old default again, do this:
|
9106
|
+
|
9107
|
+
prompt REVERT
|
9108
|
+
prompt revert
|
9109
|
+
prompt DEFAULT
|
9110
|
+
prompt default
|
9111
|
+
|
9112
|
+
If you do not want to set any prompt, do:
|
9113
|
+
|
9114
|
+
prompt none
|
9115
|
+
|
9116
|
+
### Random stuff - generating random DNA sequences in the bioshell
|
9117
|
+
![alt text][cat1]
|
9118
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9119
|
+
|
9120
|
+
You can <b>generate random DNA sequences</b> in the
|
9121
|
+
<b>bioshell</b> via:
|
9122
|
+
|
9123
|
+
random dna 20
|
9124
|
+
random dna 25
|
9125
|
+
random dna 30
|
9126
|
+
# or simpler
|
9127
|
+
random 20
|
9128
|
+
random 25
|
9129
|
+
random 30
|
9130
|
+
|
9131
|
+
This will generate random DNA sequences, with a length
|
9132
|
+
of 20, 25, 30, respectively. This may not be very useful
|
9133
|
+
but it was important that this functionality is made
|
9134
|
+
available somewhere. Sometimes you may not even care
|
9135
|
+
about the sequence and just use the a "filler" sequence,
|
9136
|
+
so randomness has to be part of the Bioroebe project
|
9137
|
+
as well.
|
9138
|
+
|
9139
|
+
You can also use some toplevel-methods to generate, e. g.
|
9140
|
+
20 random aminoacids. Have a look at the following
|
9141
|
+
<b>toplevel API</b>:
|
9142
|
+
|
9143
|
+
Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
|
9144
|
+
|
9145
|
+
Note that there may exist other APIs within the Bioroebe project
|
9146
|
+
that do the same as well.
|
9147
|
+
|
9148
|
+
If you would like to use a ruby-gtk3 widget have a look
|
9149
|
+
at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
|
9150
|
+
It works with aminoacids, DNA and RNA, and allows the user to
|
9151
|
+
create random sequences. (If you need weighted randomness then
|
9152
|
+
you currently have to use the commandline variant. Perhaps I may
|
9153
|
+
add support into the GUI directly for this one day.)
|
9154
|
+
|
9155
|
+
### Deprecations within the Bioroebe::Shell
|
9156
|
+
![alt text][cat1]
|
9157
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9158
|
+
|
9159
|
+
Over the years the Bioroebe::Shell changed quite a bit.
|
9160
|
+
|
9161
|
+
This subsection here will list a few of these changes
|
9162
|
+
or rather, the deprecations.
|
9163
|
+
|
9164
|
+
**raw_sequence**: removed in June 2022 completely. It is
|
9165
|
+
simpler to handle sequences via Bioroebe::Sequence
|
9166
|
+
instead.
|
9167
|
+
|
9168
|
+
<b>@internal_hash[:array_sequences]</b> was no longer in
|
9169
|
+
use, so it was removed in July 2022.
|
9170
|
+
|
9171
|
+
### Chop off nucleotides within the Bioroebe::Shell
|
9172
|
+
![alt text][cat1]
|
9173
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9174
|
+
|
9175
|
+
You can use the following syntax to chop away until you find
|
9176
|
+
a particular substring, in the bioshell:
|
9177
|
+
|
9178
|
+
chop_to ATG
|
9179
|
+
|
9180
|
+
This functionality was specifically added to find the first
|
9181
|
+
ATG codon.
|
9182
|
+
|
9183
|
+
### Truncating output in the bioroebe-shell
|
9184
|
+
![alt text][cat1]
|
9185
|
+
[cat1]: https://i.imgur.com/Qmd7R0p.png
|
9186
|
+
|
9187
|
+
**DNA/RNA sequences** can become very long and then become
|
9188
|
+
quite difficult to view, read and handle on the commandline.
|
9189
|
+
|
9190
|
+
Normally the bioroebe shell will truncate output of DNA sequences
|
9191
|
+
that are "too long". This is mostly done so that working with
|
9192
|
+
very long sequences becomes a bit more convenient.
|
9193
|
+
|
9194
|
+
Sometimes this can become an antifeature, though, so the user
|
9195
|
+
must be able to toggle this at his or her own discretion.
|
9196
|
+
|
9197
|
+
By default, the bioroebe-shell (bioshell) will always try
|
9198
|
+
to truncate output, but you can toggle this behaviour by
|
9199
|
+
issuing:
|
9200
|
+
|
9201
|
+
do not truncate
|
9202
|
+
|
9203
|
+
In theory, other "do not" actions are also supported, or will
|
9204
|
+
be supported in the future; right now (Oct 2019) this is a bit
|
9205
|
+
limited.
|
9206
|
+
|
9207
|
+
From the toplevel, you can use this method:
|
9208
|
+
|
9209
|
+
Bioroebe.do_not_truncate
|
9210
|
+
|
9211
|
+
The above instruction will toggle the truncate behaviour
|
9212
|
+
to not truncate, ever.
|
9213
|
+
|
9214
|
+
If you need to do so within the bioshell, this is the way:
|
9215
|
+
|
9216
|
+
no_truncate
|
9217
|
+
|
9218
|
+
Or simply
|
9219
|
+
|
9220
|
+
truncate
|
9221
|
+
|
9222
|
+
This will toggle, like a switch.
|
9223
|
+
|
9224
|
+
## Support for other programming languages
|
9225
|
+
|
9226
|
+
The main programming language for the bioroebe project is **ruby**.
|
9227
|
+
Ruby, from a language design point of view, is a great programming
|
9228
|
+
language - not necessarily all of ruby, but the subset that I use.
|
9229
|
+
It is very easy to quickly prototype ideas via ruby.
|
9230
|
+
|
9231
|
+
However had, ruby is known to **not** be among the fastest programming
|
9232
|
+
languages about on this planet; so, it makes sense to use other
|
9233
|
+
languages too from this point of view. Additionally there are some
|
9234
|
+
software stacks in use in **other** programming languages, such as
|
9235
|
+
matplotlib and various more.
|
9236
|
+
|
9237
|
+
Thus, it is important to **support other programming languages** as
|
9238
|
+
well, if there are useful libraries. The bioroebe project, after
|
9239
|
+
all, tries to be **practical**: it focuses on getting things done,
|
9240
|
+
no matter the language.
|
9241
|
+
|
9242
|
+
This means that support for other programming languages can be
|
9243
|
+
found in this project as well, often using system() or similar
|
9244
|
+
functionality to tap into these other programming languages. Do
|
9245
|
+
not be surprised when that happens - the bioroebe project will
|
9246
|
+
also try to act as a **practical glue** towards functionality
|
9247
|
+
enabled via other projects. We want to get things done, no
|
9248
|
+
matter the programming language at hand!
|
9249
|
+
|
9250
|
+
Whenever possible, though, the bioroebe project will try to be
|
9251
|
+
flexible in this regard, so ideally the same solution should
|
9252
|
+
work for many different programming languages.
|
9253
|
+
|
9254
|
+
While Ruby is the primary language for this project, since as
|
9255
|
+
of 2021 I will try to officially support **java**, **jruby**
|
9256
|
+
and the **GraalVM**. This is on my TODO list, though - stay
|
9257
|
+
tuned for more updates in this regard. See also the
|
9258
|
+
subsection <b>Support for Python</b>.
|
9259
|
+
|
9260
|
+
## Support for Python
|
9261
|
+
|
9262
|
+
In <b>June 2022</b> I decided to add support for Python to bioroebe.
|
9263
|
+
|
9264
|
+
While people can - and should - easily use <b>biopython</b> instead,
|
9265
|
+
I simply wanted to see how much python-support I can add to
|
9266
|
+
bioroebe. This may lag behind some years compared to biopython,
|
9267
|
+
but I wanted to extend python support as well, so there you go.
|
9268
|
+
It is simply an additional option for the bioroebe project.
|
9269
|
+
<b>Ruby</b> will remain the primary language for the project,
|
9270
|
+
though, at the least for now.
|
9271
|
+
|
9272
|
+
## Bioroebe::ProfilePattern
|
9273
|
+
|
9274
|
+
This class can be used to generate nucleotide sequences that
|
9275
|
+
are not quite "random". For example, to generate sequences
|
9276
|
+
that may "simulate" a TATA box.
|
9277
|
+
|
9278
|
+
The idea for this class is to be extended into allowing
|
9279
|
+
HMMs (Hidden Markov Models) one day.
|
9280
|
+
|
9281
|
+
Usage example:
|
9282
|
+
|
9283
|
+
_ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
|
9284
|
+
_.generate_sequence_based_on_this_profile
|
9285
|
+
|
9286
|
+
Such a profile will encode the profile specifying the preferred sequence
|
9287
|
+
letters for each position in a section of DNA. You have to provide
|
9288
|
+
the Hash into the method generate_sequence_based_on_this_profile() -
|
9289
|
+
or you use the default Hash, which is stored in the constant
|
9290
|
+
called **PER_POSITION_HASH**.
|
9291
|
+
|
9292
|
+
That profile should be a Hash, with keys pointing to A, T, C, G
|
9293
|
+
and the values being an Array of likelihood chance there,
|
9294
|
+
as a number, such as 140. These values are also called
|
9295
|
+
**scores**. Each score contains a number for each position
|
9296
|
+
that indicates how likely it is to find the given
|
9297
|
+
nucleotide at that location.
|
9298
|
+
|
9299
|
+
You can also use this class to generate a random DNA string,
|
9300
|
+
similar to the method called
|
9301
|
+
**Bioroebe.generate_random_dna_sequence()**. The difference
|
9302
|
+
is that class ProfilePattern allows for a bit more fine-tuned
|
9303
|
+
control. The class will likely be extended in the future too.
|
9304
|
+
|
9305
|
+
## Generate DNA via Bioroebe.random_dna
|
9306
|
+
|
9307
|
+
You can "generate" random DNA strings by making use of the
|
9308
|
+
following code:
|
9309
|
+
|
9310
|
+
x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
|
9311
|
+
|
9312
|
+
As can be seen in the example above, after the #, a String will be
|
9313
|
+
returned representing that nucleotide sequence. In the case above
|
9314
|
+
it'll be 50 nucleotides in length.
|
9315
|
+
|
9316
|
+
The number given to <b>.random_dna()</b> tells the method how many
|
9317
|
+
nucleotides should be generated.
|
9318
|
+
|
9319
|
+
The method accepts a second argument, which should be a Hash.
|
9320
|
+
If it is a hash then the generated DNA will be based on the
|
9321
|
+
**probabilities** given to that Hash.
|
9322
|
+
|
9323
|
+
Let's look at specific example here:
|
9324
|
+
|
9325
|
+
Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"
|
9326
|
+
|
9327
|
+
As you can see, in the Hash defined above, the likelihood for
|
9328
|
+
incorporating a Guanine is much higher than for Adenine
|
9329
|
+
(70 : 10). This will be reflected in the generated DNA
|
9330
|
+
sequence which, as can be seen, contains many more
|
9331
|
+
Guanines than Adenines.
|
9332
|
+
|
9333
|
+
There is yet a third use case for the above. If you pass a **String**
|
9334
|
+
as the second argument rather than a Hash, then that String will be
|
9335
|
+
used as basis for generating the DNA string at hand.
|
9336
|
+
|
9337
|
+
Again, let's look at a specific example here:
|
9338
|
+
|
9339
|
+
Bioroebe.random_dna(10, 'ATCGATCGGG')
|
9340
|
+
|
9341
|
+
Here we add more G than A, T or C, so the new DNA sequence should
|
9342
|
+
contain these nucleotides as well.
|
9343
|
+
|
9344
|
+
More usage examples in this regard:
|
9345
|
+
|
9346
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
|
9347
|
+
Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"
|
9348
|
+
|
9349
|
+
Note that this is similar to the .randomize() method in the bioruby
|
9350
|
+
project:
|
9351
|
+
|
9352
|
+
hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
|
9353
|
+
puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)
|
9354
|
+
|
9355
|
+
## Parsing genbank (.gbk) files
|
9356
|
+
|
9357
|
+
You could use Bioroebe::GenbankParser to parse .gbk files, at the
|
9358
|
+
least if you want to obtain the raw sequence, in FASTA format.
|
9359
|
+
|
9360
|
+
Example for this:
|
9361
|
+
|
9362
|
+
require 'bioroebe/genbank/genbank_parser.rb'
|
9363
|
+
result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
|
9364
|
+
result.dataset? # This method call will return the FASTA sequence.
|
9365
|
+
|
9366
|
+
Note that this currently (<b>July 2022</b>) only grabs one entry. In
|
9367
|
+
the upcoming rewrite in the future the parser will be able to parse
|
9368
|
+
all entries, and then present them to the user. Stay tuned in this
|
9369
|
+
regard.
|
9370
|
+
|
9371
|
+
## Parsers in general
|
9372
|
+
|
9373
|
+
The bioroebe project will store most parsers in the parsers/ subdirectory
|
9374
|
+
since as of <b>July 2022</b>.
|
9375
|
+
|
9376
|
+
Prior to that date different parsers were stored in different subdirectories,
|
9377
|
+
such as the parser for genbank-files being stored in the genbank/
|
9378
|
+
subdirectory. As I found this situation confusing, I settled for
|
9379
|
+
the parsers/ subdirectory since as of <b>July 2022</b>.
|
9183
9380
|
|
9184
9381
|
## Possibly useful links in regards to molecular biology and science in general
|
9185
9382
|
|