RubyGems - bioroebe - Versions diffs - 0.10.80 → 0.11.24 - Mend

bioroebe 0.10.80 → 0.11.24

Potentially problematic release.

This version of bioroebe might be problematic. Click here for more details.

Files changed (129) hide show

checksums.yaml +4 -4
data/README.md +1204 -772
data/bioroebe.gemspec +3 -3
data/doc/README.gen +1203 -771
data/doc/todo/bioroebe_todo.md +391 -365
data/lib/bioroebe/aminoacids/aminoacid_substitution.rb +1 -9
data/lib/bioroebe/aminoacids/codon_percentage.rb +1 -9
data/lib/bioroebe/aminoacids/deduce_aminoacid_sequence.rb +1 -9
data/lib/bioroebe/aminoacids/display_aminoacid_table.rb +1 -0
data/lib/bioroebe/aminoacids/show_hydrophobicity.rb +1 -6
data/lib/bioroebe/base/colours_for_base/colours_for_base.rb +18 -8
data/lib/bioroebe/base/commandline_application/commandline_arguments.rb +13 -11
data/lib/bioroebe/base/commandline_application/misc.rb +18 -8
data/lib/bioroebe/base/misc.rb +16 -0
data/lib/bioroebe/base/prototype/misc.rb +1 -1
data/lib/bioroebe/codons/show_codon_tables.rb +6 -2
data/lib/bioroebe/codons/show_codon_usage.rb +2 -1
data/lib/bioroebe/constants/aminoacids_and_proteins.rb +1 -0
data/lib/bioroebe/constants/database_constants.rb +1 -1
data/lib/bioroebe/constants/files_and_directories.rb +20 -1
data/lib/bioroebe/constants/misc.rb +20 -0
data/lib/bioroebe/count/count_amount_of_nucleotides.rb +3 -0
data/lib/bioroebe/crystal/README.md +2 -0
data/lib/bioroebe/crystal/to_rna.cr +19 -0
data/lib/bioroebe/data/README.md +11 -8
data/lib/bioroebe/data/electron_microscopy/pos_example.pos +396 -0
data/lib/bioroebe/data/electron_microscopy/test_particles.star +36 -0
data/lib/bioroebe/{shell/tk.rb → electron_microscopy/electron_microscopy_module.rb} +15 -10
data/lib/bioroebe/electron_microscopy/simple_star_file_generator.rb +4 -9
data/lib/bioroebe/fasta_and_fastq/show_fasta_headers.rb +27 -12
data/lib/bioroebe/genome/README.md +4 -0
data/lib/bioroebe/genome/genome.rb +67 -0
data/lib/bioroebe/gui/gtk3/protein_to_DNA/protein_to_DNA.rb +18 -18
data/lib/bioroebe/gui/gtk3/random_sequence/random_sequence.rb +19 -11
data/lib/bioroebe/gui/shared_code/protein_to_DNA/protein_to_DNA_module.rb +14 -14
data/lib/bioroebe/misc/ruler.rb +1 -0
data/lib/bioroebe/parsers/genbank_parser.rb +353 -24
data/lib/bioroebe/parsers/gff.rb +1 -9
data/lib/bioroebe/pdb/parse_pdb_file.rb +1 -9
data/lib/bioroebe/project/project.rb +1 -1
data/lib/bioroebe/python/README.md +1 -0
data/lib/bioroebe/python/__pycache__/mymodule.cpython-39.pyc +0 -0
data/lib/bioroebe/python/gui/gtk3/all_in_one.css +4 -0
data/lib/bioroebe/python/gui/gtk3/all_in_one.py +59 -0
data/lib/bioroebe/python/gui/gtk3/widget1.py +20 -0
data/lib/bioroebe/python/gui/tkinter/all_in_one.py +91 -0
data/lib/bioroebe/python/mymodule.py +8 -0
data/lib/bioroebe/python/protein_to_dna.py +33 -0
data/lib/bioroebe/python/shell/shell.py +19 -0
data/lib/bioroebe/python/to_rna.py +14 -0
data/lib/bioroebe/python/toplevel_methods/open_in_browser.py +20 -0
data/lib/bioroebe/python/toplevel_methods/palindromes.py +42 -0
data/lib/bioroebe/python/toplevel_methods/rds.py +13 -0
data/lib/bioroebe/python/toplevel_methods/three_delimiter.py +34 -0
data/lib/bioroebe/python/toplevel_methods/time_and_date.py +43 -0
data/lib/bioroebe/python/toplevel_methods/to_camelcase.py +11 -0
data/lib/bioroebe/requires/require_the_bioroebe_project.rb +3 -1
data/lib/bioroebe/sequence/nucleotide_module/nucleotide_module.rb +28 -25
data/lib/bioroebe/sequence/protein.rb +105 -3
data/lib/bioroebe/sequence/sequence.rb +61 -2
data/lib/bioroebe/shell/menu.rb +3451 -3366
data/lib/bioroebe/shell/misc.rb +51 -4311
data/lib/bioroebe/shell/readline/readline.rb +1 -1
data/lib/bioroebe/shell/shell.rb +11192 -28
data/lib/bioroebe/siRNA/siRNA.rb +81 -1
data/lib/bioroebe/string_matching/find_longest_substring.rb +3 -2
data/lib/bioroebe/taxonomy/class_methods.rb +3 -8
data/lib/bioroebe/taxonomy/constants.rb +4 -3
data/lib/bioroebe/taxonomy/edit.rb +2 -1
data/lib/bioroebe/taxonomy/help/help.rb +10 -10
data/lib/bioroebe/taxonomy/info/check_available.rb +15 -9
data/lib/bioroebe/taxonomy/info/info.rb +17 -2
data/lib/bioroebe/taxonomy/info/is_dna.rb +46 -36
data/lib/bioroebe/taxonomy/interactive.rb +139 -95
data/lib/bioroebe/taxonomy/menu.rb +27 -18
data/lib/bioroebe/taxonomy/parse_fasta.rb +3 -1
data/lib/bioroebe/taxonomy/shared.rb +1 -0
data/lib/bioroebe/taxonomy/taxonomy.rb +1 -0
data/lib/bioroebe/toplevel_methods/aminoacids_and_proteins.rb +31 -24
data/lib/bioroebe/toplevel_methods/databases.rb +1 -1
data/lib/bioroebe/toplevel_methods/fasta_and_fastq.rb +101 -63
data/lib/bioroebe/toplevel_methods/misc.rb +17 -16
data/lib/bioroebe/toplevel_methods/nucleotides.rb +22 -5
data/lib/bioroebe/toplevel_methods/open_in_browser.rb +2 -0
data/lib/bioroebe/toplevel_methods/palindromes.rb +1 -2
data/lib/bioroebe/toplevel_methods/taxonomy.rb +2 -2
data/lib/bioroebe/toplevel_methods/to_camelcase.rb +5 -0
data/lib/bioroebe/utility_scripts/align_open_reading_frames.rb +1 -9
data/lib/bioroebe/utility_scripts/check_for_mismatches/check_for_mismatches.rb +1 -9
data/lib/bioroebe/utility_scripts/compacter.rb +1 -9
data/lib/bioroebe/utility_scripts/compseq/compseq.rb +1 -9
data/lib/bioroebe/utility_scripts/create_batch_entrez_file.rb +1 -9
data/lib/bioroebe/utility_scripts/dot_alignment.rb +1 -9
data/lib/bioroebe/utility_scripts/move_file_to_its_correct_location.rb +1 -4
data/lib/bioroebe/utility_scripts/showorf/constants.rb +0 -5
data/lib/bioroebe/utility_scripts/showorf/reset.rb +1 -4
data/lib/bioroebe/version/version.rb +2 -2
data/lib/bioroebe/www/embeddable_interface.rb +101 -52
data/lib/bioroebe/www/sinatra/sinatra.rb +186 -70
data/lib/bioroebe/yaml/aminoacids/amino_acids_long_name_to_one_letter.yml +2 -2
data/lib/bioroebe/yaml/configuration/browser.yml +1 -1
data/lib/bioroebe/yaml/genomes/README.md +3 -4
data/lib/bioroebe/yaml/restriction_enzymes/restriction_enzymes.yml +3 -3
metadata +32 -35
data/doc/setup.rb +0 -1655
data/lib/bioroebe/genbank/genbank_parser.rb +0 -291
data/lib/bioroebe/shell/add.rb +0 -108
data/lib/bioroebe/shell/assign.rb +0 -360
data/lib/bioroebe/shell/chop_and_cut.rb +0 -281
data/lib/bioroebe/shell/constants.rb +0 -166
data/lib/bioroebe/shell/download.rb +0 -335
data/lib/bioroebe/shell/enable_and_disable.rb +0 -158
data/lib/bioroebe/shell/enzymes.rb +0 -310
data/lib/bioroebe/shell/fasta.rb +0 -345
data/lib/bioroebe/shell/gtk.rb +0 -76
data/lib/bioroebe/shell/history.rb +0 -132
data/lib/bioroebe/shell/initialize.rb +0 -217
data/lib/bioroebe/shell/loop.rb +0 -74
data/lib/bioroebe/shell/prompt.rb +0 -107
data/lib/bioroebe/shell/random.rb +0 -289
data/lib/bioroebe/shell/reset.rb +0 -335
data/lib/bioroebe/shell/scan_and_parse.rb +0 -135
data/lib/bioroebe/shell/search.rb +0 -337
data/lib/bioroebe/shell/sequences.rb +0 -200
data/lib/bioroebe/shell/show_report_and_display.rb +0 -2901
data/lib/bioroebe/shell/startup.rb +0 -127
data/lib/bioroebe/shell/taxonomy.rb +0 -14
data/lib/bioroebe/shell/user_input.rb +0 -88
data/lib/bioroebe/shell/xorg.rb +0 -45

data/README.md CHANGED Viewed

@@ -2,13 +2,13 @@
 [![forthebadge](https://forthebadge.com/images/badges/made-with-ruby.svg)](https://www.ruby-lang.org/en/)
 [![Gem Version](https://badge.fury.io/rb/bioroebe.svg)](https://badge.fury.io/rb/bioroebe)
-This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">24.06.2022</span> (dd.mm.yyyy notation), at <span style="color: steelblue; font-weight: bold">22:13:29</span> o'clock.
+This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">03.08.2022</span> (dd.mm.yyyy notation), at <span style="color: steelblue; font-weight: bold">23:23:28</span> o'clock.
 # The Bioroebe Project
 ## Bioroebe
-<img src="http://shevy.bplaced.net/BIOROEBE.png">
+<img src="https://i.imgur.com/mAoP7AP.png">
 <img src="https://i.imgur.com/YqYxRBZ.png" style="margin: 4px; margin-left: 12px;"/>
 <img src="https://i.imgur.com/k7mMlg2.png" style="margin: 4px; margin-left: 12px;"/>
@@ -335,41 +335,6 @@ so I opted to go the yaml route. But if people want to use a hash
 instead, they can do so, too - see the <b>API</b> for codon tables
 lateron. Simply define your own constants and pass them to the
 appropriate methods.
-## Support for other programming languages
-The main programming language for the bioroebe project is **ruby**.
-Ruby, from a language design point of view, is a great programming
-language - not necessarily all of ruby, but the subset that I use.
-It is very easy to quickly prototype ideas via ruby.
-However had, ruby is known to **not** be among the fastest programming
-languages about on this planet; so, it makes sense to use other
-languages too from this point of view. Additionally there are some
-software stacks in use in **other** programming languages, such as
-matplotlib and various more.
-Thus, it is important to **support other programming languages** as
-well, if there are useful libraries. The bioroebe project, after
-all, tries to be **practical**: it focuses on getting things done,
-no matter the language.
-This means that support for other programming languages can be
-found in this project as well, often using system() or similar
-functionality to tap into these other programming languages. Do
-not be surprised when that happens - the bioroebe project will
-also try to act as a **practical glue** towards functionality
-enabled via other projects. We want to get things done, no
-matter the programming language at hand!
-Whenever possible, though, the bioroebe project will try to be
-flexible in this regard, so ideally the same solution should
-work for many different programming languages.
-While Ruby is the primary language for this project, since as
-of 2021 I will try to officially support **java**, **jruby**
-and the **GraalVM**. This is on my TODO list, though - stay
-tuned for more updates in this regard.
 ## Readline support in the BioRoebe project
@@ -553,16 +518,16 @@ the DNA-to-Protein translation is somewhat simply kept as a
 Once you are inside a **running Bioshell**, you can do other **commands**
 such as this one here:
-    random # ← This will generate a random DNA sequence.
+    random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.
 To **assign** a DNA sequence, do:
     assign ATAGGGCTTTT
-Note that since the year 2016, if you input a nucleotide sequence like
-the one above, without any other commands/words, then we will assume
+Note that since as of the year <b>2016</b>, if you input a nucleotide sequence
+like the one above, without any other commands/words, then we will assume
 that you did mean to do an assignment as-is anyway. The "assign" part
-then becomes superfluous.
+then becomes superfluous and can be omitted.
 This is how this is simply done, by omitting the "assign" part of the
 above instruction altogether:
@@ -1073,18 +1038,18 @@ The text **banana** thus has the following suffixes:
 This subsection deals with some aspects of **HMMs**.
-Why are HMMs useful in biology? They can be used to represent protein
-families, for example (via pHMMs - profile hidden markov models).
+Why are HMMs useful in biology? They can be used to <b>represent protein
+families</b>, for example (via <b>pHMMs</b> - profile hidden markov models).
 Furthermore, they can show some bias in the mutation rate that can be
 observed. Different genomes are known to have different hotspots where
-mutations are more likely to happen. These are examples where a HMM
-may be useful.
+mutations are more likely to happen, for various reasons. These are
+examples where a HMM may be useful.
-HMMs are usually based on the Shannon model where you assign different
+HMMs are usually based on the <b>Shannon model</b> where you assign different
 probabilities to "change" events. An example that was mentioned back
-in 1948 was the english alphabet - some letters, and combinations of
-letters, are more commonly seen. Shannon gave the example of "E"
+in <b>1948</b> was the english alphabet - some letters, and combinations
+of letters, are more commonly seen. Shannon gave the example of "E"
 versus "W", as shown in the following graph (a **finite state
 graph**):
@@ -1098,40 +1063,47 @@ DNA sequence, a 10-mer would be equivalent to **10 base pairs**.
 The individual transition states are based on an assumption of
 "randomness", but ensuring that these are truly random is not
 necessarily trivial. Computers do not really 'generate' true
-randomness, at the least not when they are working solo. You
-can even 'predict' some randomness here or there - see vulnerabilities
-such as Specter or similar variants where software can read from
-areas of the memory that should be inaccessible to them. Some
-of this is based on co-predictions. For distributed computers,
-you may often use random noise or decay of atoms as 'a source
-of randomness''. For any DNA nucleotide sequence, we would
-assume that each base pair has a 25% chance to exist at any
-given position, but this is not necessarily true, for various
-reasons. An interesting thought is ... why is ATP so important?
-Yes, due to it being 'the energy currency in a cell' but .. why
-is this ATP aka adenine? Why not GTP, aka guanine or any of
-the other two nucleotides? I can not answer the question; there may
-be many reasons, including differential chemical storage power as
-well as mere random chance event in evolution, but for whatever
+randomness, at the least not when they are working solo, "on
+their own". You can even 'predict' some randomness here or there
+via various techniques - see vulnerabilities such as <b>Specter</b>
+or similar variants where software can read from areas of the
+memory that should be inaccessible to them. Some of this is based
+on co-predictions. For distributed computers, you may often use
+random noise or decay of atoms as 'a source of randomness'. For
+any DNA nucleotide sequence, we would assume that each base pair
+has a 25% chance to exist at any given position, but this is not
+necessarily true, again for various reasons.
+An interesting thought is ... why is <b>ATP</b> so important?
+Yes, of course due to it being 'the energy currency in a cell' but ..
+why is this ATP, aka adenine? Why not GTP, aka guanine or any of
+the other two nucleotides? (GTP is used too, but why? Why not
+CTP and TTP?) I can not answer this question; there may
+be many reasons, including differential chemical storage power
+as well as mere random chance event in evolution, but for whatever
 the reason, you will not find a complete 25% percentage value
 for every given "slot" in DNA, depending on the organism.
 From a practical point of view, how can we approach Hidden Markov
-Models?
+Models and use them?
-Let's take the following sequence:
+Let's take the following simple sequence:
     ACGTACGC
 From this sequence we can see that the <b>3-mer</b> "ACG"
 is followed by either a T, or a C. Have a look at the sequence
-to see if you can identify the two ACG subsequences there.
+again to see if you can identify the two ACG subsequences
+there. You can see one at the start, and the other one
+following a bit later, hence why we come to the conclusion
+that either a T or a C will follow this <b>3-mer</b>.
-The probability of either T or C, thus, is 0.5 (50%);
-for A and G to follow there is 0% so the latter two can
-be ignored.
+The probability of either T or C to occur on <b>that</b>
+position, thus, is 0.5 (50%); for A and G to follow there
+is 0% so the latter two can be ignored.
-Thus, we could use a ruby Hash as follows:
+Thus, we could use a ruby Hash as follows that should
+describe these probabilities:
     probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well
@@ -1217,34 +1189,6 @@ each edge.
 Parsimony assumes that substitutions are rare and that back-mutations
 do not occur.
-## Random stuff
-You can generate random DNA sequences in the shell:
-    random dna 20
-    random dna 25
-    random dna 30
-This will generate random DNA sequences, with a length
-of 20, 25, 30, respectively. This may not be very useful
-but it was important that this functionality is made
-available somewhere.
-You can also use some toplevel-methods to generate, e. g.
-20 random aminoacids:
-    Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
-Note that there may exist other APIs within the Bioroebe project
-that do the same as well.
-If you would like to use a ruby-gtk3 widget have a look
-at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
-It works with aminoacids, DNA and RNA, and allows the user to
-create random sequences. (If you need weighted randomness then
-you currently have to use the commandline variant. Perhaps I may
-add support into the GUI directly for this one day.)
 ## Displaying the main sequence with delimiter characters
 From within the <b>bioshell</b>, you can use some alternative ways to
@@ -1486,24 +1430,9 @@ You can simulate this via the following API:
     Bioroebe.cleave_with_trypsin(sequence_goes_in_here)
     Bioroebe.cleave :with_trypsin, sequence_goes_in_here
-Currently (July 2021) only support for Trypsin is included, but
+Currently (<b>July 2021</b>) only support for Trypsin is included, but
 in the long run the goal is to add as many digestive (peptide-bond
 cleaving) enzymes here as possible.
-## Freezing the main sequence - and unfreezing it again
-You can **freeze** the BioShell, meaning that it will no longer allow
-for the main sequence to be modified, via:
-    freeze
-To unfreeze again, issue:
-    unfreeze
-This functionality has been added because the shell may sometimes be
-quite eager to change the main sequence, so we needed a way to disable
-any further modifications (until "unfreeze" is issued that is).
 ## MUMmer
@@ -2714,18 +2643,6 @@ This may look as follows:
 <img src="https://i.imgur.com/gAZg8qG.png" style="margin: 1em; margin-left: 3em">
-## Obtaining a subsequence from a Bioroebe::Sequence object
-Say that you have the DNA sequence **ATGCATGCAAAA**.
-There are several ways how to obtain a subsequence from
-this. One variant will be shown next, by making use of
-the method called **.subseq()**.
-Example:
-    seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
 ## Bioroebe::Protein
 This class is a subclass of class **Bioroebe::Sequence**. The
@@ -2740,15 +2657,26 @@ functionality is also available in another method.
 For now keep this in mind; at some later point I may decide whether
 this class is to be kept or not.
-## Permanently disabling showing the startup-introduction of the Bioshell
+In July 2022 I noticed that the bio-gem has the following method:
-If you do not want to see the start-up intro, you can try
-any of the following:
+    p Bio::AminoAcid['A'] # => "Ala"
-    bioshell --permanently-disable-startup-intro
-    bioshell --permanently-disable-startup-notice
-    bioshell --permanently-no-startup-intro
-    bioshell --permanently-no-startup-info
+I liked this functionality, but class Bioroebe::Protein already
+has a [] method which is used to instantiate a new
+instance of class Bioroebe::Protein. So, a toplevel method
+was added instead.
+Usage example:
+    Bioroebe::Aminoacids.one_to_three('A') # => Ala
+So this is the equivalent to what the bio-gem does, more or
+less.
+If you want to find out the name of a one-letter aminoacid
+you can also use this method:
+    Bioroebe::Protein.name('A') # => "alanine"
 ## Decoding aminoacids
@@ -2934,27 +2862,6 @@ Note that presently (April 2020) not all of PROSITE may be supported
 via this regex, but in the long run the plan is to support all
 of PROSITE's regex expression.
-## Determining how many stop codons existing in a given sequence
-You can use **bin/n_stop_codons_in_this_sequence** to determine
-how many stop codons exist in a given sequence at hand.
-Usage example from the commandline:
-    n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
-You can also separate these via a ' ' spacer on the commandline of
-course:
-    n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
-Internally this makes use of the method called
-<b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
-aliased names. Usage example for the method, just as in the
-first example shown above:
-    Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
 ## AT and GC content
 ![alt text][cat1]
 [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3176,47 +3083,45 @@ can try to use:
 On class Bioroebe::Sequence. More customizability may be added
 to that method in this regard, if users need this.
-## The Hydropathy index
+### Obtaining a subsequence from a Bioroebe::Sequence object
-You can display the hydropathy index for aminoacids from within
-the **bioshell**.
+Say that you have the DNA sequence **ATGCATGCAAAA**.
-Simply issue:
+There are several ways how to obtain a subsequence from
+this. One variant will be shown next, by making use of
+the method called **.subseq()**.
-    hydropathy?
+Example:
-## Generate DNA
+    seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
-You can generate random DNA strings by issuing the following
-code:
+You can also randomize the sequence, via .randomize().
-    x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
+Example:
-As can be seen in the example above, after the #, a String will be
-returned representing that nucleotide sequence.
+    x = Bioroebe::Sequence.new; x.randomize
-The number given to .random_dna() tells the method how many nucleotides
-should be generated.
+This is similar to the method in Bioruby here:
-## The GFF file format
+https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243
-From within the **bioshell** you can analyze .gff and .gff3 files,
-such as by issuing the following command:
+## The Hydropathy index
-    gff3? foobar.gff3
+You can display the hydropathy index for aminoacids from within
+the **bioshell**.
-Evidently for this to work the file at hand has to exist.
+Simply issue:
-## Shuffling the DNA/RNA string in the bioshell
+    hydropathy?
-Via
+## The GFF file format
-    shuffle
+From within the **bioshell** you can analyze .gff and .gff3 files,
+such as by issuing the following command:
-you can randomly rearrange the main DNA/RNA string.
+    gff3? foobar.gff3
-This can be useful if you just wish to quickly "test" new
-compositions of the same nucleotide.
+Evidently for this to work the file at hand has to exist.
 ## The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)
@@ -3353,47 +3258,6 @@ nucleotides by issuing:
     show_individual_weight_of_the_four_dna_nucleotides
-## Truncating output in the bioroebe-shell
-![alt text][cat1]
-[cat1]: https://i.imgur.com/Qmd7R0p.png
-**DNA/RNA sequences** can become very long and then become
-quite difficult to view, read and handle on the commandline.
-Normally the bioroebe shell will truncate output of DNA sequences
-that are "too long". This is mostly done so that working with
-very long sequences becomes a bit more convenient.
-Sometimes this can become an antifeature, though, so the user
-must be able to toggle this at his or her own discretion.
-By default, the bioroebe-shell (bioshell) will always try
-to truncate output, but you can toggle this behaviour by
-issuing:
-    do not truncate
-In theory, other "do not" actions are also supported, or will
-be supported in the future; right now (Oct 2019) this is a bit
-limited.
-From the toplevel, you can use this method:
-    Bioroebe.do_not_truncate
-The above instruction will toggle the truncate behaviour
-to not truncate, ever.
-If you need to do so within the bioshell, this is the way:
-    no_truncate
-Or simply
-    truncate
-This will toggle, like a switch.
 ## Rosalind Challenges
 ![alt text][cat1]
 [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3530,31 +3394,6 @@ investing more time into Rosalind. Let's focus on solving
 real, existing problems instead - at the least as far as
 the Bioroebe project is concerned.
-## Numbers as input in the bioshell
-![alt text][cat1]
-[cat1]: https://i.imgur.com/Qmd7R0p.png
-You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
-This will attempt to <b>display the first 3 nucleotides</b> of
-the assigned **main sequence**. It will only work if you have
-assigned a sequence prior to that, though.
-Examples:
-    3
-    33
-    15
-## transeq
-![alt text][cat1]
-[cat1]: https://i.imgur.com/Qmd7R0p.png
-You can convert a DNA sequence into an aminoacid sequence by
-doing this:
-    transeq
 ## Align two different sequences
 ![alt text][cat1]
 [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3866,22 +3705,6 @@ does not (yet?) have support for comparing two genomes to
 one another and generate a visual map indicating the findings
 there.
-## Do not create directories on startup of the shell
-By default the bioshell will try to create some directories
-on startup. This may not always be desired by the user
-though, so an option has to exist to disable this functionality.
-Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
-keeps track of whether directories on startup of the shell will
-be created.
-To disable this behaviour on startup of the bioshell, try
-something like this:
-    bioshell --do-not-create-directories-on-startup
-    bioshell --do-not-create-directories
 ## class Bioroebe::MoveFileToItsCorrectLocation
 This class will move a bio-file to its "correct" location, with respect
@@ -3924,15 +3747,6 @@ synonymous, aka aliases):
     ruler2 25 # ← use 25 characters per line
     ruler2 50 # ← use 50 characters per line
-## Generating a random nucleotide sequence based on frequencies
-If you ever need to generate a nucleotide frequency then you can use
-the following method:
-    Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
-    Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
-    Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
 ## The Mouse
 This subsection is about the **mouse**, in particular relevant
@@ -4050,57 +3864,24 @@ has". Genes in itself are not that well-defined, so they are not necessarily
 the primary means of complexity. Think of this more as an interactome,
 where RNAs play a major dynamic role as well.
-## Bioroebe::ProfilePattern
+## class Bioroebe::DisplayOpenReadingFrames
-This class can be used to generate nucleotide sequences that
-are not quite "random". For example, to generate sequences
-that may "simulate" a TATA box.
+**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
+will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
+**class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
+flexible. It shall also support **sixpack** and **showorf** from the
+**Emboss online tools**. (In fact, supporting these two use cases
+was the original reason as to why this class has been created.)
-The idea for this class is to be extended into allowing
-HMMs (Hidden Markov Models) one day.
+Where does the code to this class reside?
-Usage example:
+It can be found here:
-    _ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
-    _.generate_sequence_based_on_this_profile
+    bioroebe/utility_scripts/display_open_reading_frames/
+    require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
-Such a profile will encode the profile specifying the preferred sequence
-letters for each position in a section of DNA. You have to provide
-the Hash into the method generate_sequence_based_on_this_profile() -
-or you use the default Hash, which is stored in the constant
-called **PER_POSITION_HASH**.
-That profile should be a Hash, with keys pointing to A, T, C, G
-and the values being an Array of likelihood chance there,
-as a number, such as 140. These values are also called
-**scores**. Each score contains a number for each position
-that indicates how likely it is to find the given
-nucleotide at that location.
-You can also use this class to generate a random DNA string,
-similar to the method called
-**Bioroebe.generate_random_dna_sequence()**. The difference
-is that class ProfilePattern allows for a bit more fine-tuned
-control. The class will likely be extended in the future too.
-## class Bioroebe::DisplayOpenReadingFrames
-**class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
-will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
-**class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
-flexible. It shall also support **sixpack** and **showorf** from the
-**Emboss online tools**. (In fact, supporting these two use cases
-was the original reason as to why this class has been created.)
-Where does the code to this class reside?
-It can be found here:
-    bioroebe/utility_scripts/display_open_reading_frames/
-    require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
-The display of this class is typically aimed for the commandline,
-but it is planned to use the class on the www too (via sinatra).
+The display of this class is typically aimed for the commandline,
+but it is planned to use the class on the www too (via sinatra).
 Take note that this class also reports how many ORFs (open reading
 frames) have been found. The number displayed here differs from
@@ -4462,28 +4243,6 @@ the BioRoebe-Shell, then you can use either of the following:
     seq?
     seq_with_tab?
-## Prompt (the shell prompt9
-You can set a <b>custom prompt</b>, via the keywords
-"prompt" or "set_prompt".
-To display the <b>current working directory</b>, do:
-    prompt pwd
-To revert to the old default again, do this:
-    prompt REVERT
-    prompt revert
-    prompt DEFAULT
-    prompt default
-If you do not want to set any prompt, do:
-    prompt none
 ## Leader and Trailer
@@ -4971,17 +4730,17 @@ For now, here is the list:
 ## The T-Bacteriophages
-The following table only shows a short summary for the **T-phages**.
+The following table only shows a short summary for the <b>T-phages</b>.
- name of the phage | Plaque size  |  phage-head diameter (nm) | tail diameter  | latent period (in minutes) | Burst size
--------------------|--------------|---------------------------|----------------|----------------------------|-------------
-        T1         |   medium     |           50              |    150 x 15    |             13             |   180
-        T2         |   small      |         65 x 80           |    120 x 20    |             21             |   120
-        T3         |   large      |           45              |    invisible   |             13             |   300
-        T4         |   small      |         65 x 80           |    120 x 20    |             23.5           |   300
-        T5         |   small      |          100              |      tiny      |             40             |   300
-        T6         |   small      |         65 x 80           |    120 x 20    |             25.5           |   200-300
-        T7         |   large      |           45              |    invisible   |             13             |   300
+ name of the phage | Plaque size  |  phage-head diameter (nm) | tail diameter  | latent period (in minutes) | Burst size  |   n genes
+-------------------|--------------|---------------------------|----------------|----------------------------|-------------|------------
+        T1         |   medium     |           50              |    150 x 15    |             13             |   180       |
+        T2         |   small      |         65 x 80           |    120 x 20    |             21             |   120       |
+        T3         |   large      |           45              |    invisible   |             13             |   300       |
+        T4         |   small      |         65 x 80           |    120 x 20    |             23.5           |   300       |    300
+        T5         |   small      |          100              |      tiny      |             40             |   300       |
+        T6         |   small      |         65 x 80           |    120 x 20    |             25.5           |   200-300   |
+        T7         |   large      |           45              |    invisible   |             13             |   300       |
 The next table will show some phage genomes.
@@ -5392,215 +5151,6 @@ that format.
 Presently (**May 2020**) there is no support for the mmCIF format
 in the Bioroebe project, but this will eventually change.
-## Working with PDB files (.pdb)
-![alt text][cat1]
-[cat1]: https://i.imgur.com/Qmd7R0p.png
-The **PDB**, founded in the year **1971**, holds lots of **atomic
-structures of proteins**.
-In **July 2016** it contained **121000 structures**.
-In **February 2018** it contained **~124000 structures**
-(from X-ray crystallography), and about **~12000 NMR
-structures**. <b>NMR</b> is limited to about <b>350 amino
-acids maximum length</b>, give or take.
-In **April 2020** the PDB contained **163141 structures**.
-We can see that more and more structures are available
-nowadays - a trend that will most likely continue or
-even accelerate. (Let's hope the quality also remains
-high.)
-A typical .pdb file contains entries such as this:
-    RTyp  Num  Atm Res Ch  ResN  X       Y       Z      Occ  Temp   PDB   Line
-    ATOM    1  N   ASP L   1     4.060   7.307   5.186  1.00 51.58  1FDL  93
-    ATOM    2  CA  ASP L   1     4.042   7.776   6.553  1.00 48.05  1FDL  94
-    ATOM    3  N   VAL A  25    32.433  16.336  57.540  1.00 11.92   A1    N
-    ATOM    4  CA  VAL A  25    31.132  16.439  58.160  1.00 11.85   A1    C
-    ATOM    5  C   VAL A  25    30.447  15.105  58.363  1.00 12.34   A1    C
-(Not the first line; **RTyp** is just an explanation for the ATOM
-entries below that line).
-The sequence starts from the N-terminal residue for proteins; see
-the <b>Atm</b> entry at <b>Num 1</b>.
-The **meaning of these entries** is as follows:
-    1) RTyp: Record Type
-    2) Num:  Serial number of the atom.  Each atom has a unique serial number.
-    3) Atm:  Atom name (in IUPAC format).
-    4) Res:  Residue name (IUPAC format).
-    5) Ch:   Chain to which the atom belongs (in this case, L for light chain of an antibody).
-    6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
-    7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
-    10) Occ: Occupancy factor
-    11) Temp: Temperature factor (atoms disordered in the crystal have high
-              temperature factors; they are "wobbly" with a high factor.
-              This is also called the B-factor).
-    12) PDB: The PDB data file unique identifier.
-    13) Line: Line (record) number in the data file.
-Typically the entry on the most right area, the last one, specifies
-which atom it is. A **H** stands for a hydrogen atom; the other atoms
-are "heavy" atoms (heavier than hydrogen most definitely).
-Most .pdb files will contain **SEQRES** entries. These entries will list
-the primary sequence of the polymeric molecules present in the entry.
-You can notice this by looking at the standard 3-character code
-used by SEQRES here, for the canonical amino acids. So, for instance,
-the amino acids that will be mentioned in a SEQRES entry are
-ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
-PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
-method **Bioroebe.three_to_one()** to convert back to the
-one-letter chain such as follows:
-    Bioroebe.three_to_one('PHE') # => "F"
-The data in a .pdb file need not necessarily only be a protein, with
-a specific aminoacid sequence. It may also include DNA. An example
-for such a molecule is
-<b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
-which includes a protein chain and a DNA chain.
-As far as the **bioroebe project** is concerned, you can parse .pdb files
-via the following class:
-    Bioroebe::ParsePdbFile.new
-    Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
-    Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
-This class also allows some shortcuts for integrated .pdb files,
-that is files that are bundled with the bioroebe project:
-    Bioroebe::ParsePdbFile.new ':1fat'
-This requires a String because ruby symbols may not start with
-a number. Note that this also works through the commandline,
-such as:
-    parse_pdb_file :1fat
-A shell such as bash does not understand ruby symbols, so instead
-a string will be passed in, being :1fat. The ParsePdbFile will
-handle this correctly internally.
-Note that a small bug was fixed in the file parse_pdb_file.rb;
-some entries were skipped due to an erroneous loop in the ruby
-file. This was corrected in **May 2020**.
-In **March 2021** the ability to use entries such as ':1fat'
-was removed again; the code remains though. The reason why
-this was removed was that the .pdb files are quite large,
-so distributing them via the bioroebe project makes no real
-sense. Consider simply downloading the .pdb files; you
-can use this from the bioshell or via something
-like:
-    pdb 5TIM
-Note that you can also return the aminoacid-sequence from a
-.pdb file directly, since as of **May 2020**.
-Example for this:
-    Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
-The first argument should be **the path to the (local)
-.pdb file at hand**. (In theory support for remote .pdb
-files could also be added easily, but right now this
-is not possible, so you have to download it first.)
-The **specification for .pdb files** can be read at the following
-two remote resources:
-http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
-http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
-Note that the parse_pdb_file.rb can also do some additional
-things, such as calculating the maximum distance between
-atoms in that file, via the method
-**.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
-If you wish to report the secondary structures from a given .pdb file
-then you can use the following class:
-    require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
-    Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
-    Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
-If you wish to obtain the FASTA sequence of a particular remote
-.pdb file then you can use this API:
-    x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
-Keep in mind that this is the FASTA sequence; the .pdb file itself
-has another format, and contains a lot more information, such as
-the various ATOM entries.
-Since as of **June 2020** the command **fetch** also works from
-within the Bioshell, similar to how pymol **works**. This allows
-us to quickly download a remote .pdb file.
-    fetch 2BTS
-You can also use the following toplevel-API to download a remote
-.pdb file:
-    Bioroebe.download_this_pdb
-    Bioroebe.download_this_pdb '355D'
-    Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
-Note that this will be automatically moved to the "correct" default
-position in the bioroebe-project, under the **pdb/** subdirectory.
-You can also invoke this script from the commandline via
-**bin/download_this_pdb**, like in this way:
-    download_this_pdb 355D
-This works with several .pdb files in one go as well:
-    download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
-They would all be downloaded one after the other. Be aware that
-this will overwrite the old .pdb files on that position, so
-if you don't want this, I recommend to do a backup on the
-**pdb/** subdirectory before invoking the above call.
-You can also turn the FASTA sequence stored in a .pdb file into
-a .fasta file, via **--create-fasta-file**.
-Usage examples:
-    parsedb 1NR6 --create-fasta-file
-    parsedb 2F9Q --create-fasta-file
-    parsedb 3TDA --create-fasta-file
-    parsedb 2HI4 --create-fasta-file
-    parsedb 2V0M --create-fasta-file
-So if you have a file called <b>1NR6.pdb</b> and you use
-the first input, a .fasta file will be created. If such
-a .pdb file does not exist then this will not work, so
-make sure to download the .pdb file before invoking
-this commandline-flag.
-Last but not least, the following table shall document the
-PDB format - it is not yet complete, but it is intended
-to add the remaining datasets eventually:
-    Record Name  Describes
-    MODRES       Modifications to standard residues
-    HET          Nonstandard residues (as well as ligands, ions and water)
-    HETNAM       Full chemical name of the residue
-    HETSYM       Synonyms for the residue
-    FORMUL       Chemical formula of the residue
-    KEYWDS       specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
 ## Sugars and glyco-patterns
 I am currently having to do an assignment related to glyco-patterns
@@ -5764,6 +5314,9 @@ like this:
 <img src="https://i.imgur.com/vr2kEBz.png" style="margin: 1em; margin-left: 3em">
+Since as of <b>July 2022</b> invalid amino acids will be automatically
+filtered away before being assigned to the input.
 ## Colourizing hydrophilic and hydrophobic aminoacids on the commandline
 Via class **Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids** you
@@ -5777,35 +5330,36 @@ Example output for this:
 This subsection contains some information about proteases.
-trypsin:
+Trypsin:
 https://en.wikipedia.org/wiki/Trypsin
-cuts at: Trypsin cuts peptide chains mainly at the carboxyl
+<b>cuts at</b>: Trypsin cuts peptide chains mainly at the carboxyl
 side of the amino acids lysine or arginine.
-chymotrypsin:
+Chymotrypsin:
 https://en.wikipedia.org/wiki/Chymotrypsin
-cuts at: Chymotrypsin preferentially cleaves peptide amide
+<b>cuts at</b>: Chymotrypsin preferentially cleaves peptide amide
 bonds where the side chain of the amino acid N-terminal
-to the scissile amide bond is a large hydrophobic amino
-acid (tyrosine, tryptophan, and phenylalanine).
+to the scissile amide bond is <b>a large hydrophobic amino</b>
+acid (specifically: tyrosine, tryptophan, and phenylalanine).
+Chymotrypsin will cleave proteins on the <b>carboxyl side</b>
+of aromatic or large hydrophobic amino acids.
-thrombin:
+Thrombin:
 https://en.wikipedia.org/wiki/Thrombin
-cuts at: Thrombin acts as a serine protease that converts
+<b>cuts at</b>: Thrombin acts as a serine protease that converts
 soluble fibrinogen into insoluble strands of fibrin. It
 catalyzes the hydrolysis of <b>Arg-Gly</b> bonds in
 particular peptide sequences only.
-plasmin:
+Plasmin:
 https://en.wikipedia.org/wiki/Plasmin
-cuts at: Plasmin is a serine protease.
+<b>cuts at</b>: Plasmin is a serine protease.
-papain:
+Papain:
 https://en.wikipedia.org/wiki/Papain
-cuts at: Papain prefers to cleave after an
-arginine or lysine preceded by a hydrophobic
-unit (Ala, Val, Leu, Ile, Phe, Trp, Tyr) and
-not followed by a valine.
+<b>cuts at</b>: Papain prefers to cleave after an arginine or
+lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile,
+Phe, Trp, Tyr) and not followed by a valine.
 factor Xa:
@@ -5817,8 +5371,8 @@ Some proteins may permanently reside in the lumen of the
 Often such proteins will have a special signal sequence attached
 to their **C-terminal part**, such as **KDEL** (Lys-Asp-Glu-Leu).
-KDEL is not the only signal that may be used, though. Some species
-may use different signals, such as:
+<b>KDEL</b> is not the only signal that may be used, though. Some
+species may use different signals, such as:
  aminoacids  | species
 -------------|------------------------------------------------------------
@@ -5828,8 +5382,9 @@ may use different signals, such as:
   ADEL       | Schizosaccharomyces pombe (fission yeast)
   SDEL       | Plasmodium falciparum
-If you work with the bioshell then you can simply use this method
-to query whether the given aminoacid sequence has a KDEL sequence:
+If you work with the <b>bioshell</b> then you can simply use this
+method to query whether the given aminoacid sequence has a KDEL
+sequence:
     KDEL?
@@ -6240,8 +5795,6 @@ Next, do something such as this:
 This will show the distribution of the oligos.
 ## Number of chromomes in different species
-![alt text][cat1]
-[cat1]: https://i.imgur.com/Qmd7R0p.png
 Name of the organism | Latin name   | Number of chromosomes
 ---------------------|--------------|-----------------------
@@ -6319,112 +5872,6 @@ So this is what would be returned:
     Bioroebe::DetectMinimalCodon[["TTT", "TTC"]] # => ["TTY"]
-## Codon Usage
-This **paragraph** deals with some aspects of **codon usage** in different
-organisms.
-Let us first define the term <b>codon usage</b>. In order to do so,
-we also have to define what a <b>codon</b> is, so let's start with that.
-A <span style="color: darkgreen; font-weight: bold">codon</span> is
-essentially the basic code used in DNA to denote which particular
-**aminoacid** corresponds to these (three) nucleotide base pairs.
-A codon is thus **a series of three nucleotides, also called
-a <b>triplet</b>.
-When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
-abbreviated as <b>dsDNA</b>. The codon is, however had, only found
-in a single stranded molecule, even within dsDNA. Since some parts of
-a **dsDNA** in any given genome gives rise to a, more or less, complementary
-copy into **mRNA**, the codons that are actually used, are found in the
-corresponding mRNA. (Remember that mRNA differs from DNA in that there
-will be Uracil rather than Thymine; otherwise it is the same, sequence-wise.
-Of course it uses another sugar (Ribose), but remember we are here mostly
-interested in the **information-containing part**, not the full chemical
-structure.)
-The codon is thus found on the mRNA and since mRNA is mostly
-single-stranded, the codon is a component of the mRNA. It is
-where the two subunits of the ribosome are assembled (or more
-accurately, the smaller subunit scans along the mRNA until it
-detects a start codon). Mind you, this subsection will not go into
-all relevant details, so just keep in mind that the codon is the
-part that will eventually be "translated" at the ribosome into
-a corresponding aminoacid, excluding stop codons at the end.
-Now - different organisms use **different frequencies of codons**.
-**Codon usage** thus describes the fact that many proteins in
-these different organisms make use of certain codons with a
-**substantially higher frequency than other codons**. We can
-use statistics to infer this on a global (proteome) level
-too.
-Remember that the genetic code is **degenerate**, meaning that
-you have a few aminoacids that are encoded only by one codon
-(<b>Tryptophan</b> and <b>Methionin</b>), whereas the other
-aminoacids are encoded by more than one codon - thus, at the
-very least two codons. Note that the latter codons, if they
-code for the **same** aminoacid, are also called <b>synonymous
-codons</b>.
-This means that if you have any given aminoacid chain, you can have
-several different sequences (and codons in these sequences, which
-ultimtely means that you can have different DNA sequences code for
-the very same aminoacid chain).
-Usually the third base of a codon has the least influence on
-codon meaning. This is also called <b>wobbling</b> - since
-the anticodon loop on the tRNA is in the reverse direction,
-and the wobble position refers to the tRNA, this means that
-the wobble-position is at the 5'-end of the tRNA anticodon.
-Now a few words about functionality related to codons and codon
-usage in the Bioroebe project.
-Say that you have a long DNA sequence; let's pick a sample
-for now, such as:
-    ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
-You can analyze the codons used via class **ShowCodonUsage**:
-    show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
-This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
-It will report the top 5 codons in use and also output the
-frequency hash on the commandline.
-You can use this from ruby too, via this toplevel method:
-    Bioroebe.codon_frequencies_of_this_sequence(ARGV)
-If you want to look at the actual codon frequencies used
-by different organisms, have a look here:
-http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
-This is an excellent resource.
-## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
-If you quickly wish to determine the aminoacid composition, as a
-Hash, you can use **bin/aminoacid_frequencies**.
-Example from the commandline for this:
-    aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
-Example from within bioroebe itself (and thus ruby):
-    require 'bioroebe/frequencies.rb'
-    Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
-The latter will return a Hash that you can then further make use for, such as:
-    {"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
 ## The Levensthein distance
 The <b>Levensthein distance</b> - also called a '**string metric**' - was formulated
@@ -6842,6 +6289,34 @@ change A: teal or C: slateblue to some other colour; these are HTML
 colours, so it is recommended to use the names of these HTML
 colours).
+In <b>July 2022</b> the method <b>Bioroebe.colourize_this_fasta_sequence</b>
+was extended slightly. You can now attach a "ruler" to the output, that
+is a numbered series that shows the nucleotide position, on the commandline.
+Example for this:
+    puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
+     'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
+     'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
+    ) { :with_ruler }
+By default this will use a white colour on black background. If you want to
+modify the foreground colour you can pass the colour name to the method,
+such as via:
+    puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
+     'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
+     'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
+    ) { :with_ruler_steelblue_colour }
+The following image shows how this can be used on the commandline:
+<img src="https://i.imgur.com/ucVEVnK.png" style="margin: 1em; border: 3px solid black">
+At a later time this may be extended to allow for use in a webpage,
+that is to embed these strings directly into HTML or .php or
+.cgi.
 If you wish to show a **chunked display** of the dataset (nucleotides
 normally) then you can use the following API:
@@ -7365,16 +6840,6 @@ This would notify the bioshell that only nucleotides from position
 51 to (including) position 3251 will be colourized, when doing another
 "ORF?" invocation.
-## Longest substring
-Within the Bioroebe::Shell you can determine the longest substring,
-  including gaps, like s:'
-    longest_substring? ATTATTGTT | ATTATTCTT'
-Note that this will make use of the diff-lcs gem, which uses
-the McIlroy-Hunt algorithm.
 ## Restriction Enzymes
 This **subsection** will eventually be expanded to explain various things about
@@ -8733,6 +8198,22 @@ The images that can be generated via this may look as follows:
 <img src="https://i.imgur.com/fWwD1fj.png" style="margin: 1em; margin-left: 2em">
+Let's look at another example.
+Say you input the following sequences there:
+    AGVV
+    AGVV
+    AGVV
+    AGVV
+    AGGV
+    AGGV
+    AGGV
+The resulting image that is generated is:
+<img src="https://i.imgur.com/3wWApIQ.png" style="margin: 1em; margin-left: 2em">
 ## The Kozak Sequence
 The ribosome usually scans for a **AUG** codon. But there are
@@ -8872,85 +8353,6 @@ Usage Example:
     pfasta insulin_mRNA.fasta  --toprotein
-## Determining the codon frequencies from the commandline
-In April 2022 I noticed that one use case is to show the codon
-frequencies of a given sequence - typically a nucleotide sequence.
-For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
-So, following that logic, a new executable was added at
-**bin/codon_frequency**. This will show the Hash of the codon
-frequencies, as a String, on the commandline.
-Usage example:
-     codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
-The output of this would be the following:
-    AUU: 2
-    CGU: 2
-    ACG: 1
-    AUC: 1
-    GAC: 1
-    UGA: 1
-    CUG: 1
-    ACA: 1
-    GUC: 1
-## Showing the codon frequency via countcodon
-https://www.kazusa.or.jp/codon/countcodon.html offers a rather useful
-functionality via a simple web-interface, in that you can pass in a mRNA
-sequence, and it will then show the codon frequency/likelihood of that
-sequence - all codons in that sequence, that is. This can be extended
-to all protein-coding genes in a given genome, and will thus be useful
-for a researcher who may be interested in determining the codon frequency
-in general, across all genes in that given genome.
-You can test it with an input sequence. For instance, the following
-sequence:
-    ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
-Would yield this result:
-    fields: [triplet] [frequency: per thousand] ([number])
-    UUU  0.0(     0)  UCU  0.0(     0)  UAU  0.0(     0)  UGU  0.0(     0)
-    UUC  0.0(     0)  UCC  0.0(     0)  UAC 25.6(     1)  UGC  0.0(     0)
-    UUA  0.0(     0)  UCA 25.6(     1)  UAA 25.6(     1)  UGA102.6(     4)
-    UUG  0.0(     0)  UCG 25.6(     1)  UAG  0.0(     0)  UGG  0.0(     0)
-    CUU  0.0(     0)  CCU  0.0(     0)  CAU 25.6(     1)  CGU 76.9(     3)
-    CUC  0.0(     0)  CCC  0.0(     0)  CAC  0.0(     0)  CGC  0.0(     0)
-    CUA  0.0(     0)  CCA  0.0(     0)  CAA  0.0(     0)  CGA 25.6(     1)
-    CUG102.6(     4)  CCG  0.0(     0)  CAG 25.6(     1)  CGG  0.0(     0)
-    AUU 76.9(     3)  ACU 25.6(     1)  AAU  0.0(     0)  AGU 51.3(     2)
-    AUC 76.9(     3)  ACC  0.0(     0)  AAC  0.0(     0)  AGC  0.0(     0)
-    AUA  0.0(     0)  ACA 76.9(     3)  AAA  0.0(     0)  AGA  0.0(     0)
-    AUG  0.0(     0)  ACG 76.9(     3)  AAG  0.0(     0)  AGG  0.0(     0)
-    GUU  0.0(     0)  GCU  0.0(     0)  GAU 25.6(     1)  GGU  0.0(     0)
-    GUC 51.3(     2)  GCC  0.0(     0)  GAC 76.9(     3)  GGC  0.0(     0)
-    GUA  0.0(     0)  GCA  0.0(     0)  GAA  0.0(     0)  GGA  0.0(     0)
-    GUG  0.0(     0)  GCG  0.0(     0)  GAG  0.0(     0)  GGG  0.0(     0)
-At any rate, the individual functionality for that is also available
-within the Bioroebe project since as of **April 2022**.
-The method that does so is:
-    Bioroebe.frequency_per_thousand
-    Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
-At a later time sinatra-bindings as well as ruby-gtk3 bindings will
-be added, and possibly ruby-libui bindings as well, for windows
-support. What is missing is support for different codon tables in
-different species, but that may be added at a later time as well
-- for now it seemed more important to offer the functionality.
 ## class Bioroebe::Protein
 **class Bioroebe::Protein** can be used to store a protein sequence.
@@ -9183,6 +8585,1036 @@ time being it is what it is. At a later point in time test cases
 may be added to check whether it performs correctly or whether it
 does not.
+The other rules, also published in 2004, are the Reynolds rules. Code
+support was added to the Bioroebe project in <b>June 2022</b>, but
+it was not tested yet, so the implementation may be incorrect.
+## The Bioroebe::Shell interface
+The following subsection specifically handles information
+pertaining to the <b>Bioroebe::Shell</b> interface of the
+<b>bioroebe project</b>. It is also called <b>bioshell</b>,
+to simplify spelling it.
+### Numbers as input in the bioshell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
+This will attempt to <b>display the first 3 nucleotides</b> of
+the assigned **main sequence**. It will only work if you have
+assigned a sequence prior to that, though.
+Examples:
+    3
+    33
+    15
+### transeq
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+You can convert a DNA sequence into an aminoacid sequence by
+doing this:
+    transeq
+### Shuffling the DNA/RNA string in the bioshell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+Via
+    shuffle
+you can <b>randomly rearrange the main DNA/RNA string</b>
+that is used by the <b>Bioroebe::Shell</b>.
+This can be useful if you just wish to quickly "test"
+new compositions of the same nucleotide.
+### Permanently disabling showing the startup-introduction of the Bioshell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+If you do not want to see the start-up intro, you can try
+any of the following:
+    bioshell --permanently-disable-startup-intro
+    bioshell --permanently-disable-startup-notice
+    bioshell --permanently-no-startup-intro
+    bioshell --permanently-no-startup-info
+### Longest substring
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+Within the Bioroebe::Shell you can determine the longest substring,
+  including gaps, like s:'
+    longest_substring? ATTATTGTT | ATTATTCTT'
+Note that this will make use of the diff-lcs gem, which uses
+the McIlroy-Hunt algorithm.
+### Do not create directories on startup of the shell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+By default the <b>bioshell</b> will try to create some directories
+on startup. This may not always be desired by the user, though,
+so an option has to exist to <b>disable</b> this functionality.
+Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
+keeps track of whether directories on startup of the shell will
+be created.
+To disable this behaviour on startup of the bioshell, try
+something like this:
+    bioshell --do-not-create-directories-on-startup
+    bioshell --do-not-create-directories
+### Generating and assigning a random amount of nucleotides
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+Via:
+    random 555
+you can "generate" 555 random nucleotides (DNA that is) and
+assign it to the main sequence in use by the bioshell. This
+is mostly a convenience feature, if you want to debug something
+quickly.
+### Determining the log directory for the Bioroebe::Shell component
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+Via:
+    bioshell_log_dir?
+you can determine the log-directory output for the bioshell
+component. On my home system this will default to
+<b>/home/Temp/bioroebe/bioshell/</b>.
+### Prompt (the shell prompt of the bioshell)
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+You can set a <b>custom prompt</b> in the bioshell, via
+the keywords "<b>prompt</b>" or "<b>set_prompt</b>".
+To display the <b>current working directory</b>, do:
+    prompt pwd
+To revert to the old default again, do this:
+    prompt REVERT
+    prompt revert
+    prompt DEFAULT
+    prompt default
+If you do not want to set any prompt, do:
+    prompt none
+### Random stuff - generating random DNA sequences in the bioshell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+You can <b>generate random DNA sequences</b> in the
+<b>bioshell</b> via:
+    random dna 20
+    random dna 25
+    random dna 30
+    # or simpler
+    random 20
+    random 25
+    random 30
+This will generate random DNA sequences, with a length
+of 20, 25, 30, respectively. This may not be very useful
+but it was important that this functionality is made
+available somewhere. Sometimes you may not even care
+about the sequence and just use the a "filler" sequence,
+so randomness has to be part of the Bioroebe project
+as well.
+You can also use some toplevel-methods to generate, e. g.
+20 random aminoacids. Have a look at the following
+<b>toplevel API</b>:
+    Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
+Note that there may exist other APIs within the Bioroebe project
+that do the same as well.
+If you would like to use a ruby-gtk3 widget have a look
+at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
+It works with aminoacids, DNA and RNA, and allows the user to
+create random sequences. (If you need weighted randomness then
+you currently have to use the commandline variant. Perhaps I may
+add support into the GUI directly for this one day.)
+### Deprecations within the Bioroebe::Shell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+Over the years the Bioroebe::Shell changed quite a bit.
+This subsection here will list a few of these changes
+or rather, the deprecations.
+**raw_sequence**: removed in June 2022 completely. It is
+simpler to handle sequences via Bioroebe::Sequence
+instead.
+<b>@internal_hash[:array_sequences]</b> was no longer in
+use, so it was removed in July 2022.
+### Chop off nucleotides within the Bioroebe::Shell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+You can use the following syntax to chop away until you find
+a particular substring, in the bioshell:
+    chop_to ATG
+This functionality was specifically added to find the first
+ATG codon.
+### Truncating output in the bioroebe-shell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+**DNA/RNA sequences** can become very long and then become
+quite difficult to view, read and handle on the commandline.
+Normally the bioroebe shell will truncate output of DNA sequences
+that are "too long". This is mostly done so that working with
+very long sequences becomes a bit more convenient.
+Sometimes this can become an antifeature, though, so the user
+must be able to toggle this at his or her own discretion.
+By default, the bioroebe-shell (bioshell) will always try
+to truncate output, but you can toggle this behaviour by
+issuing:
+    do not truncate
+In theory, other "do not" actions are also supported, or will
+be supported in the future; right now (Oct 2019) this is a bit
+limited.
+From the toplevel, you can use this method:
+    Bioroebe.do_not_truncate
+The above instruction will toggle the truncate behaviour
+to not truncate, ever.
+If you need to do so within the bioshell, this is the way:
+    no_truncate
+Or simply
+    truncate
+This will toggle, like a switch.
+### Working with .pdb files in the bioshell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+This subsection only very briefly mentions how to work with
+.pdb files in the bioshell. See other parts of this
+document for a more extensive overview how you can work
+with .pdb files via the Bioroebe project.
+If you input something like this, if it ends with .pdb:
+    1fat.pdb
+And if no such file currently exists at
+/home/Temp/bioroebe/pdb/1fat.pdb then it will be
+downloaded and moved towards
+**/home/Temp/bioroebe/pdb/**.
+This feature exists just to simplify using the
+**bioshell**.
+### Showing the stop codons in frame1, frame2 and frame2 in the bioshell
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+When you have a given sequence assigned to the bioshell, such
+as via "random 99", you can then show all stop codons in
+frame1, frame2 and frame3.
+The corresponding input for this will be:
+    stop_frame1?
+    stop_frame2?
+    stop_frame3?
+An image shows this next, where we first did input "random 120",
+before issuing the above-mentioned instructions one after
+the other:
+<img src="https://i.imgur.com/HpHF4jq.png" style="margin: 1em; border: 1px solid black">
+### Freezing the main sequence in the bioshell - and unfreezing it again
+![alt text][cat1]
+[cat1]: https://i.imgur.com/Qmd7R0p.png
+You can **freeze** the BioShell, meaning that it will no longer
+allow for the main sequence to be modified, via the following
+command:
+    freeze
+To <b>unfreeze</b> the sequence again, issue:
+    unfreeze
+This functionality has been added because the shell may sometimes be
+quite eager to change the main sequence, so we needed a way to
+disable any further modifications (until "unfreeze" is issued
+that is).
+## Support for other programming languages
+The main programming language for the bioroebe project is **ruby**.
+Ruby, from a language design point of view, is a great programming
+language - not necessarily all of ruby, but the subset that I use.
+It is very easy to quickly prototype ideas via ruby.
+However had, ruby is known to **not** be among the fastest programming
+languages about on this planet; so, it makes sense to use other
+languages too from this point of view. Additionally there are some
+software stacks in use in **other** programming languages, such as
+matplotlib and various more.
+Thus, it is important to **support other programming languages** as
+well, if there are useful libraries. The bioroebe project, after
+all, tries to be **practical**: it focuses on getting things done,
+no matter the language.
+This means that support for other programming languages can be
+found in this project as well, often using system() or similar
+functionality to tap into these other programming languages. Do
+not be surprised when that happens - the bioroebe project will
+also try to act as a **practical glue** towards functionality
+enabled via other projects. We want to get things done, no
+matter the programming language at hand!
+Whenever possible, though, the bioroebe project will try to be
+flexible in this regard, so ideally the same solution should
+work for many different programming languages.
+While Ruby is the primary language for this project, since as
+of 2021 I will try to officially support **java**, **jruby**
+and the **GraalVM**. This is on my TODO list, though - stay
+tuned for more updates in this regard. See also the
+subsection <b>Support for Python</b>.
+## Support for Python
+In <b>June 2022</b> I decided to add support for Python to bioroebe.
+While people can - and should - easily use <b>biopython</b> instead,
+I simply wanted to see how much python-support I can add to
+bioroebe. This may lag behind some years compared to biopython,
+but I wanted to extend python support as well, so there you go.
+It is simply an additional option for the bioroebe project.
+<b>Ruby</b> will remain the primary language for the project,
+though, at the least for now.
+## Bioroebe::ProfilePattern
+This class can be used to generate nucleotide sequences that
+are not quite "random". For example, to generate sequences
+that may "simulate" a TATA box.
+The idea for this class is to be extended into allowing
+HMMs (Hidden Markov Models) one day.
+Usage example:
+    _ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
+    _.generate_sequence_based_on_this_profile
+Such a profile will encode the profile specifying the preferred sequence
+letters for each position in a section of DNA. You have to provide
+the Hash into the method generate_sequence_based_on_this_profile() -
+or you use the default Hash, which is stored in the constant
+called **PER_POSITION_HASH**.
+That profile should be a Hash, with keys pointing to A, T, C, G
+and the values being an Array of likelihood chance there,
+as a number, such as 140. These values are also called
+**scores**. Each score contains a number for each position
+that indicates how likely it is to find the given
+nucleotide at that location.
+You can also use this class to generate a random DNA string,
+similar to the method called
+**Bioroebe.generate_random_dna_sequence()**. The difference
+is that class ProfilePattern allows for a bit more fine-tuned
+control. The class will likely be extended in the future too.
+## Generate DNA via Bioroebe.random_dna
+You can "generate" random DNA strings by making use of the
+following code:
+    x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
+As can be seen in the example above, after the #, a String will be
+returned representing that nucleotide sequence. In the case above
+it'll be 50 nucleotides in length.
+The number given to <b>.random_dna()</b> tells the method how many
+nucleotides should be generated.
+The method accepts a second argument, which should be a Hash.
+If it is a hash then the generated DNA will be based on the
+**probabilities** given to that Hash.
+Let's look at specific example here:
+    Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"
+As you can see, in the Hash defined above, the likelihood for
+incorporating a Guanine is much higher than for Adenine
+(70 : 10). This will be reflected in the generated DNA
+sequence which, as can be seen, contains many more
+Guanines than Adenines.
+There is yet a third use case for the above. If you pass a **String**
+as the second argument rather than a Hash, then that String will be
+used as basis for generating the DNA string at hand.
+Again, let's look at a specific example here:
+    Bioroebe.random_dna(10, 'ATCGATCGGG')
+Here we add more G than A, T or C, so the new DNA sequence should
+contain these nucleotides as well.
+More usage examples in this regard:
+    Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
+    Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"
+Note that this is similar to the .randomize() method in the bioruby
+project:
+    hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
+    puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)
+## Generating a random nucleotide sequence based on frequencies
+If you ever need to generate a nucleotide frequency then you can use
+the following method:
+    Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
+    Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
+    Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
+## Parsing genbank (.gbk) files
+You could use class <b>Bioroebe::GenbankParser</b> to parse .gbk files, at
+the least if you want to obtain the raw sequence, in FASTA format.
+Example for this:
+    require 'bioroebe/genbank/genbank_parser.rb'
+    result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
+    result.dataset? # This method call will return the FASTA sequence.
+Note that this currently (<b>July 2022</b>) only grabs one entry. In
+the upcoming rewrite in the future the parser will be able to parse
+all entries, and then present them to the user. Stay tuned in this
+regard.
+## Parsers in general
+The bioroebe project will store most parsers in the parsers/ subdirectory
+since as of <b>July 2022</b>.
+Prior to that date different parsers were stored in different subdirectories,
+such as the parser for genbank-files being stored in the genbank/
+subdirectory. As I found this situation confusing, I settled for
+the parsers/ subdirectory since as of <b>July 2022</b>.
+## Coomassie staining of proteins
+Coomassie staining is typically done on proteins, giving them a blue
+or blueish colour. <b>Coomassie staining</b> is <b>the most popular
+anionic protein dye</b>.
+This may look like this:
+<img src="https://i.imgur.com/6eUN7HR.png" style="margin: 1em; border: 1px solid black">
+This picture shows five different bands. The molecular weight of the
+marker can be seen on the very left hand side, in <b>kDa</b>. The
+larger fragments can be seen on top, so the farther the band has
+moved, the smaller the fragment must be (in kDa). That means that
+the larger proteins can be found on top; the smaller proteins on
+the bottom.
+Some bands are missing, and this gives information - that is
+that a particular protein is missing. Probably it was not
+synthesized in the given tissue at hand.
+The staining for a Coomassie Blue stain is typically done
+via G-250, with a 0.5% density prepared in
+50% methanol and 10% acetic acid. The staining duration is
+usually done for 5 minutes.
+Note that the G-250 stain is the dimethyl derivative from
+R-250 - the <b>R</b> stands for <b>red</b> or <b>reddish</b>.
+Both dyes will bind via electrostatic interaction with <b>protonated
+basic amino acids</b>: that is <b>lysine</b>, <b>arginine</b>,
+and <b>histidine</b>. They can also bind via hydrophobic
+associations to aromatic residues.
+Coomassie stains are in principle reversible. They are not
+as sensitive as silver staining, but significantly cheaper,
+which is one reason why they have become so popular.
+Not every protein has all aminoacids, so staining may be difficult.
+For instance, the <b>glycomacropeptide</b> is the only known
+naturally occurring protein that contains no Phe (Phenylalanine; F).
+A protein that lacks lysine, arginine, histidine or aromatic
+acids may be undetectable via Coomassie staining. However had,
+this does not seem to be a universal rule; some groups report
+that they even managed to stain "unstainable" proteins via
+Coomassie staining.
+The paper at https://www.jbc.org/article/S0021-9258(17)39198-6/pdf,
+titled "Why Does Coomassie Brilliant Blue R Interact Differently
+with Different Proteins?" and published in the year 1985, tries
+to give some explanations to different groups yielding different
+results via Coomassie staining.
+They specifically point out that "there is a striking correlation
+between intensity of response to Coomassie dyes and the basicity
+of a protein which depends on the number of lysine, histidine,
+and arginine residues, as well as the NH₂-terminal amino group"
+(aka the aminoterminus of the protein at hand). The concluding
+remark from that paper is that <b>"Coomassie R Interacts
+Differently with Different Proteins"</b>.
+On class <b>Bioroebe::Protein</b> you can determine whether
+a given protein can be stained via coomassie through the
+following method:
+    .can_be_stained_via_coomassie?
+This isn't an ideal check, so don't rely on it. It will simply
+check whether the sequence has at the least one lysine,
+or one histidine, or one arginine, or any of the aromatic
+amino acids.
+## Codon Usage
+This **paragraph** deals with some aspects of **codon usage** in different
+organisms.
+Let us first define the term <b>codon usage</b> so we can base any further
+analysis on this definition. In order to do so, we also have to define
+what a <b>codon</b> is, so let's start with that actually.
+A <span style="color: darkgreen; font-weight: bold">codon</span> is
+essentially the basic code used in DNA to denote which particular
+**aminoacid** corresponds to these (three) nucleotide base pairs.
+A codon is thus <b>a series of three nucleotides</b>, also called
+a <b>triplet</b>, such as <b>ATG</b>.
+When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
+abbreviated as <b>dsDNA</b>. The codon is, however had, only found
+in a single stranded molecule, even within dsDNA. Since some parts of
+a **dsDNA** in any given genome give rise to a, more or less, complementary
+copy into **mRNA**, the codons that are actually used, are found in the
+corresponding mRNA as well, excluding the codon that codes for a stop
+signal (a so-called <b>stop codon</b>). (Remember that mRNA differs from
+DNA in that there will be Uracil rather than Thymine; otherwise it is
+the same, sequence-wise. Of course it uses another sugar (Ribose), but
+remember we are here mostly interested in the **information-containing
+part**, not the full chemical structure.)
+The <b>codon</b> is thus found on the mRNA and since mRNA is mostly
+single-stranded, the codon is a component of the mRNA. The two subunits
+of the ribosome are assembled on a mRNA, at the least in prokaryotes (or
+more accurately, the smaller subunit scans along the mRNA until it
+<b>detects</b> a start codon). Mind you, this subsection will not go
+into all relevant details, so just keep in mind that the codon is the
+part that will eventually be "<i>translated</i>" at the ribosome into
+a corresponding aminoacid, excluding stop codons at the end.
+Now - different organisms use **different frequencies of codons**.
+<b style="color:darkblue">Codon usage</b> thus describes the fact
+that many proteins in these different organisms make use of certain
+codons with a **substantially higher frequency than other codons**.
+We can use statistics to infer this on a global (proteome)
+level too.
+Remember that the genetic code is **degenerate**, meaning that
+you have a few aminoacids that are encoded only by one codon
+(<b>Tryptophan</b> and <b>Methionine</b>), whereas the other
+aminoacids are encoded by more than one codon - thus, at the
+very least two codons. Note that the latter codons, if they
+code for the **same** aminoacid, are also called
+<b style="font-style: italic">synonymous codons</b>.
+This means that if you have any given aminoacid chain, you can have
+several different sequences that would yield to the very same
+amino acid chain (and codons in these sequences, which
+ultimately means that you can have different DNA sequences
+code for the very same aminoacid chain).
+Usually the third base of a codon has the least influence on
+codon meaning. This is also called <b>wobbling</b> - since
+the anticodon loop on the tRNA is in the reverse direction,
+and the wobble position refers to the tRNA, this means that
+the wobble-position is at the 5'-end of the tRNA anticodon.
+Now a few words about functionality related to codons and codon
+usage in the Bioroebe project.
+Say that you have a long DNA sequence; let's pick a sample
+for now, such as:
+    ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
+You can analyze the codons used via class **ShowCodonUsage**
+and the corresponding entry at <b>bin/show_codon_usage</b>:
+    show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
+This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
+It will report the top 5 codons in use and also output the
+frequency hash on the commandline.
+On my computer at home the output it yields via the commandline,
+on a KDE konsole terminal, looks like this:
+<img src="https://i.imgur.com/h55Thdu.png" style="margin: 1em; border: 3px solid black">
+You can use this from within ruby code too, via the following
+toplevel method:
+    Bioroebe.codon_frequencies_of_this_sequence(ARGV)
+To get the hash of the codon frequencies you can use the .hash? method:
+    hash = Bioroebe.codon_frequencies_of_this_sequence('ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG').hash?
+If you want to look at the actual codon frequencies used
+by different organisms, have a look here:
+http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
+This is an excellent resource.
+For instance, the <i>E. coli</i> K strain can be found here:
+https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=83333&aa=9&style=N
+## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
+If you quickly wish to determine the aminoacid composition, as a
+Hash, you can use **bin/aminoacid_frequencies**.
+Example from the commandline for this:
+    aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
+Example from within bioroebe itself (and thus ruby):
+    require 'bioroebe/frequencies.rb'
+    Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
+The latter will return a Hash that you can then further make use for, such as:
+    {"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
+## Determining the codon frequencies from the commandline
+In <b>April 2022</b> I noticed that one use case is to show the
+codon frequencies of a given sequence - typically a nucleotide sequence.
+For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
+So, following that logic, a new executable was added at
+**bin/codon_frequency**. This will show the Hash of the codon
+frequencies, as a String, on the commandline.
+Usage example:
+     codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
+The output of this would be the following:
+    AUU: 2
+    CGU: 2
+    ACG: 1
+    AUC: 1
+    GAC: 1
+    UGA: 1
+    CUG: 1
+    ACA: 1
+    GUC: 1
+## Showing the codon frequency via countcodon
+The excellent website at https://www.kazusa.or.jp/codon/countcodon.html offers
+a rather useful functionality via a simple web-interface, in that you can pass
+in a mRNA sequence, and it will then show the codon frequency/likelihood of
+that sequence - all codons in that sequence, that is. This can be extended
+to <b>all protein-coding genes in a given genome</b>, and will thus be
+useful for a researcher who may be interested in determining the codon
+frequency in general, across all genes in that given genome.
+You can test it with an input sequence.
+For instance, the following sequence:
+    ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
+Would yield this result:
+    fields: [triplet] [frequency: per thousand] ([number])
+    UUU  0.0(     0)  UCU  0.0(     0)  UAU  0.0(     0)  UGU  0.0(     0)
+    UUC  0.0(     0)  UCC  0.0(     0)  UAC 25.6(     1)  UGC  0.0(     0)
+    UUA  0.0(     0)  UCA 25.6(     1)  UAA 25.6(     1)  UGA102.6(     4)
+    UUG  0.0(     0)  UCG 25.6(     1)  UAG  0.0(     0)  UGG  0.0(     0)
+    CUU  0.0(     0)  CCU  0.0(     0)  CAU 25.6(     1)  CGU 76.9(     3)
+    CUC  0.0(     0)  CCC  0.0(     0)  CAC  0.0(     0)  CGC  0.0(     0)
+    CUA  0.0(     0)  CCA  0.0(     0)  CAA  0.0(     0)  CGA 25.6(     1)
+    CUG102.6(     4)  CCG  0.0(     0)  CAG 25.6(     1)  CGG  0.0(     0)
+    AUU 76.9(     3)  ACU 25.6(     1)  AAU  0.0(     0)  AGU 51.3(     2)
+    AUC 76.9(     3)  ACC  0.0(     0)  AAC  0.0(     0)  AGC  0.0(     0)
+    AUA  0.0(     0)  ACA 76.9(     3)  AAA  0.0(     0)  AGA  0.0(     0)
+    AUG  0.0(     0)  ACG 76.9(     3)  AAG  0.0(     0)  AGG  0.0(     0)
+    GUU  0.0(     0)  GCU  0.0(     0)  GAU 25.6(     1)  GGU  0.0(     0)
+    GUC 51.3(     2)  GCC  0.0(     0)  GAC 76.9(     3)  GGC  0.0(     0)
+    GUA  0.0(     0)  GCA  0.0(     0)  GAA  0.0(     0)  GGA  0.0(     0)
+    GUG  0.0(     0)  GCG  0.0(     0)  GAG  0.0(     0)  GGG  0.0(     0)
+At any rate, the individual functionality for that is also available
+within the Bioroebe project since as of **April 2022**.
+The method that does so is:
+    Bioroebe.frequency_per_thousand
+    Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
+Sinatra-bindings exist to this functionality since as of July 2022,
+but they are not very well-polished. Ruby-gtk3 bindings may be
+added at a later time, and possibly ruby-libui bindings as well, for
+windows support. What is missing is support for different codon tables in
+different species, but that may be added at a later time as well - for now
+it seemed more important to offer the functionality.
+## Working with PDB files (.pdb)
+The **PDB**, founded in the year **1971**, holds lots of **atomic
+structures of proteins**.
+For instance, in **July 2016** it contained **121000 structures**.
+In **February 2018** it contained **~124000 structures**
+(from X-ray crystallography), and about **~12000 NMR
+structures**. <b>NMR</b> is limited to about <b>350 amino
+acids maximum length</b>, give or take.
+In **April 2020** the PDB contained **163141 structures**.
+We can see that more and more structures are available nowadays -
+a trend that will most likely continue or even accelerate.
+(Let's hope the quality also remains high.)
+A typical .pdb file contains entries such as this:
+    RTyp  Num  Atm Res Ch  ResN  X       Y       Z      Occ  Temp   PDB   Line
+    ATOM    1  N   ASP L   1     4.060   7.307   5.186  1.00 51.58  1FDL  93
+    ATOM    2  CA  ASP L   1     4.042   7.776   6.553  1.00 48.05  1FDL  94
+    ATOM    3  N   VAL A  25    32.433  16.336  57.540  1.00 11.92   A1    N
+    ATOM    4  CA  VAL A  25    31.132  16.439  58.160  1.00 11.85   A1    C
+    ATOM    5  C   VAL A  25    30.447  15.105  58.363  1.00 12.34   A1    C
+(Not the first line; **RTyp** is just an explanation for the ATOM
+entries below that line).
+The sequence starts from the N-terminal residue for proteins; see
+the <b>Atm</b> entry at <b>Num 1</b>.
+The **meaning of these entries** is as follows:
+    1) RTyp: Record Type
+    2) Num:  Serial number of the atom.  Each atom has a unique serial number.
+    3) Atm:  Atom name (in IUPAC format).
+    4) Res:  Residue name (IUPAC format).
+    5) Ch:   Chain to which the atom belongs (in this case, L for light chain of an antibody).
+    6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
+    7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
+    10) Occ: Occupancy factor
+    11) Temp: Temperature factor (atoms disordered in the crystal have high
+              temperature factors; they are "wobbly" with a high factor.
+              This is also called the B-factor).
+    12) PDB: The PDB data file unique identifier.
+    13) Line: Line (record) number in the data file.
+Typically the entry on the most right area, the last one, specifies
+which atom it is. A **H** stands for a hydrogen atom; the other atoms
+are "heavy" atoms (heavier than hydrogen most definitely).
+Most .pdb files will contain **SEQRES** entries. These entries will list
+the primary sequence of the polymeric molecules present in the entry.
+You can notice this by looking at the standard 3-character code
+used by SEQRES here, for the canonical amino acids. So, for instance,
+the amino acids that will be mentioned in a SEQRES entry are
+ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
+PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
+method **Bioroebe.three_to_one()** to convert back to the
+one-letter chain such as follows:
+    Bioroebe.three_to_one('PHE') # => "F"
+The data in a .pdb file need not necessarily only be a protein, with
+a specific aminoacid sequence. It may also include DNA. An example
+for such a molecule is
+<b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
+which includes a protein chain and a DNA chain.
+As far as the **bioroebe project** is concerned, you can parse .pdb files
+via the following class:
+    Bioroebe::ParsePdbFile.new
+    Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
+    Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
+This class also allows some shortcuts for integrated .pdb files,
+that is files that are bundled with the bioroebe project:
+    Bioroebe::ParsePdbFile.new ':1fat'
+This requires a String because ruby symbols may not start with
+a number. Note that this also works through the commandline,
+such as:
+    parse_pdb_file :1fat
+A shell such as bash does not understand ruby symbols, so instead
+a string will be passed in, being :1fat. The ParsePdbFile will
+handle this correctly internally.
+Note that a small bug was fixed in the file parse_pdb_file.rb;
+some entries were skipped due to an erroneous loop in the ruby
+file. This was corrected in **May 2020**.
+In **March 2021** the ability to use entries such as ':1fat'
+was removed again; the code remains though. The reason why
+this was removed was that the .pdb files are quite large,
+so distributing them via the bioroebe project makes no real
+sense. Consider simply downloading the .pdb files; you
+can use this from the bioshell or via something
+like:
+    pdb 5TIM
+Note that you can also return the aminoacid-sequence from a
+.pdb file directly, since as of **May 2020**.
+Example for this:
+    Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
+The first argument should be **the path to the (local)
+.pdb file at hand**. (In theory support for remote .pdb
+files could also be added easily, but right now this
+is not possible, so you have to download it first.)
+The **specification for .pdb files** can be read at the following
+two remote resources:
+http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
+http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
+Note that the parse_pdb_file.rb can also do some additional
+things, such as calculating the maximum distance between
+atoms in that file, via the method
+**.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
+If you wish to report the secondary structures from a given .pdb file
+then you can use the following class:
+    require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
+    Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
+    Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
+If you wish to obtain the FASTA sequence of a particular remote
+.pdb file then you can use this API:
+    x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
+Keep in mind that this is the FASTA sequence; the .pdb file itself
+has another format, and contains a lot more information, such as
+the various ATOM entries.
+Since as of **June 2020** the command **fetch** also works from
+within the Bioshell, similar to how pymol **works**. This allows
+us to quickly download a remote .pdb file.
+    fetch 2BTS
+You can also use the following toplevel-API to download a remote
+.pdb file:
+    Bioroebe.download_this_pdb
+    Bioroebe.download_this_pdb '355D'
+    Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
+    Bioroebe.download_this_pdb '1fat.pdb' # Lectin Phytohemagglutinin
+This will refer to a remote URL such as
+https://files.rcsb.org/view/1FAT.pdb.
+Note that this will be automatically moved to the "correct" default
+position in the bioroebe-project, under the **pdb/** subdirectory.
+You can also invoke this script from the commandline via
+**bin/download_this_pdb**, like in this way:
+    download_this_pdb 355D
+This works with several .pdb files in one go as well:
+    download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
+They would all be downloaded one after the other. Be aware that
+this will overwrite the old .pdb files on that position, so
+if you don't want this, I recommend to do a backup on the
+**pdb/** subdirectory before invoking the above call.
+You can also turn the FASTA sequence stored in a .pdb file into
+a .fasta file, via **--create-fasta-file**.
+Usage examples:
+    parsedb 1NR6 --create-fasta-file
+    parsedb 2F9Q --create-fasta-file
+    parsedb 3TDA --create-fasta-file
+    parsedb 2HI4 --create-fasta-file
+    parsedb 2V0M --create-fasta-file
+So if you have a file called <b>1NR6.pdb</b> and you use
+the first input, a .fasta file will be created. If such
+a .pdb file does not exist then this will not work, so
+make sure to download the .pdb file before invoking
+this commandline-flag.
+Last but not least, the following table shall document the
+PDB format - it is not yet complete, but it is intended
+to add the remaining datasets eventually:
+    Record Name  Describes
+    MODRES       Modifications to standard residues
+    HET          Nonstandard residues (as well as ligands, ions and water)
+    HETNAM       Full chemical name of the residue
+    HETSYM       Synonyms for the residue
+    FORMUL       Chemical formula of the residue
+    KEYWDS       specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
+## Determining how many stop codons existing in a given sequence
+You can use **bin/n_stop_codons_in_this_sequence** to determine
+how many stop codons exist in a given sequence at hand.
+Usage example from the commandline:
+    n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
+You can also separate these via a ' ' spacer on the commandline of
+course:
+    n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
+Internally this makes use of the method called
+<b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
+aliased names. Usage example for the method, just as in the
+first example shown above:
+    Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
+## The Aliphatic Index of Globular Proteins
+In a paper from 1980, Atsushi IKAI provided a formula with which one can
+calculate the aliphatic index of a globular protein, in a short paper
+titled "Thermostability and aliphatic index of globular proteins"
+(<b>PMID: 7462208</b>,
+<a href="https://www.jstage.jst.go.jp/article/biochemistry1922/88/6/88_6_1895/_article">
+see here</a>).
+Atsushi provided a statistical analysis of proteins, and determined
+that the aliphatic index - which is defined as the relative volume
+of a protein occupied by <b>aliphatic side chains</b> (alanine, valine,
+isoleucine, and leucine) - of proteins of thermophilic bacteria
+is significantly higher than that of ordinary proteins.
+Atsushi reasoned that the index may be regarded as a positive
+factor for the <b>increase of thermostability of globular
+proteins</b>. The enzymes of some organisms are more stable
+at higher temperature than the enzymes of other organisms,
+in particular among <b>thermostable proteins</b>.
+Thus, there is a good correlation between the "aliphatic
+index" on the one hand, and the thermostability of proteins
+on the other hand.
+Atsushi gave the following formula for calculating this:
+    Aliphatic Index = XA + aXV + b (xI+XL)
+The four letters A, V, I and L refer to the four aminoacids
+Alanine, Valine, Isoleucine and Leucine. The two coefficients
+a and b are the relative volumes of the side chains of
+Alanine. A has a value range of 2.8-3.0 and
+b has a value range of 3.8-4.0.
+The method called <b>.aliphatic_index()</b> is making use of that
+formula. As values for a and b the two values <b>2.9</b> and
+<b>3.9</b> have been taken. The code in the bioroebe project
+for this has been inspired by: https://github.com/wwood/bioruby-aliphatic_index
+It yields the following usage example for bioruby:
+    Bio::Sequence::AA.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
+In bioroebe, the equivalent would be:
+    Bioroebe::Protein.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
 ## Possibly useful links in regards to molecular biology and science in general