bioroebe 0.10.80 → 0.11.24

Sign up to get free protection for your applications and to get access to all the features.

Potentially problematic release.


This version of bioroebe might be problematic. Click here for more details.

Files changed (129) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +1204 -772
  3. data/bioroebe.gemspec +3 -3
  4. data/doc/README.gen +1203 -771
  5. data/doc/todo/bioroebe_todo.md +391 -365
  6. data/lib/bioroebe/aminoacids/aminoacid_substitution.rb +1 -9
  7. data/lib/bioroebe/aminoacids/codon_percentage.rb +1 -9
  8. data/lib/bioroebe/aminoacids/deduce_aminoacid_sequence.rb +1 -9
  9. data/lib/bioroebe/aminoacids/display_aminoacid_table.rb +1 -0
  10. data/lib/bioroebe/aminoacids/show_hydrophobicity.rb +1 -6
  11. data/lib/bioroebe/base/colours_for_base/colours_for_base.rb +18 -8
  12. data/lib/bioroebe/base/commandline_application/commandline_arguments.rb +13 -11
  13. data/lib/bioroebe/base/commandline_application/misc.rb +18 -8
  14. data/lib/bioroebe/base/misc.rb +16 -0
  15. data/lib/bioroebe/base/prototype/misc.rb +1 -1
  16. data/lib/bioroebe/codons/show_codon_tables.rb +6 -2
  17. data/lib/bioroebe/codons/show_codon_usage.rb +2 -1
  18. data/lib/bioroebe/constants/aminoacids_and_proteins.rb +1 -0
  19. data/lib/bioroebe/constants/database_constants.rb +1 -1
  20. data/lib/bioroebe/constants/files_and_directories.rb +20 -1
  21. data/lib/bioroebe/constants/misc.rb +20 -0
  22. data/lib/bioroebe/count/count_amount_of_nucleotides.rb +3 -0
  23. data/lib/bioroebe/crystal/README.md +2 -0
  24. data/lib/bioroebe/crystal/to_rna.cr +19 -0
  25. data/lib/bioroebe/data/README.md +11 -8
  26. data/lib/bioroebe/data/electron_microscopy/pos_example.pos +396 -0
  27. data/lib/bioroebe/data/electron_microscopy/test_particles.star +36 -0
  28. data/lib/bioroebe/{shell/tk.rb → electron_microscopy/electron_microscopy_module.rb} +15 -10
  29. data/lib/bioroebe/electron_microscopy/simple_star_file_generator.rb +4 -9
  30. data/lib/bioroebe/fasta_and_fastq/show_fasta_headers.rb +27 -12
  31. data/lib/bioroebe/genome/README.md +4 -0
  32. data/lib/bioroebe/genome/genome.rb +67 -0
  33. data/lib/bioroebe/gui/gtk3/protein_to_DNA/protein_to_DNA.rb +18 -18
  34. data/lib/bioroebe/gui/gtk3/random_sequence/random_sequence.rb +19 -11
  35. data/lib/bioroebe/gui/shared_code/protein_to_DNA/protein_to_DNA_module.rb +14 -14
  36. data/lib/bioroebe/misc/ruler.rb +1 -0
  37. data/lib/bioroebe/parsers/genbank_parser.rb +353 -24
  38. data/lib/bioroebe/parsers/gff.rb +1 -9
  39. data/lib/bioroebe/pdb/parse_pdb_file.rb +1 -9
  40. data/lib/bioroebe/project/project.rb +1 -1
  41. data/lib/bioroebe/python/README.md +1 -0
  42. data/lib/bioroebe/python/__pycache__/mymodule.cpython-39.pyc +0 -0
  43. data/lib/bioroebe/python/gui/gtk3/all_in_one.css +4 -0
  44. data/lib/bioroebe/python/gui/gtk3/all_in_one.py +59 -0
  45. data/lib/bioroebe/python/gui/gtk3/widget1.py +20 -0
  46. data/lib/bioroebe/python/gui/tkinter/all_in_one.py +91 -0
  47. data/lib/bioroebe/python/mymodule.py +8 -0
  48. data/lib/bioroebe/python/protein_to_dna.py +33 -0
  49. data/lib/bioroebe/python/shell/shell.py +19 -0
  50. data/lib/bioroebe/python/to_rna.py +14 -0
  51. data/lib/bioroebe/python/toplevel_methods/open_in_browser.py +20 -0
  52. data/lib/bioroebe/python/toplevel_methods/palindromes.py +42 -0
  53. data/lib/bioroebe/python/toplevel_methods/rds.py +13 -0
  54. data/lib/bioroebe/python/toplevel_methods/three_delimiter.py +34 -0
  55. data/lib/bioroebe/python/toplevel_methods/time_and_date.py +43 -0
  56. data/lib/bioroebe/python/toplevel_methods/to_camelcase.py +11 -0
  57. data/lib/bioroebe/requires/require_the_bioroebe_project.rb +3 -1
  58. data/lib/bioroebe/sequence/nucleotide_module/nucleotide_module.rb +28 -25
  59. data/lib/bioroebe/sequence/protein.rb +105 -3
  60. data/lib/bioroebe/sequence/sequence.rb +61 -2
  61. data/lib/bioroebe/shell/menu.rb +3451 -3366
  62. data/lib/bioroebe/shell/misc.rb +51 -4311
  63. data/lib/bioroebe/shell/readline/readline.rb +1 -1
  64. data/lib/bioroebe/shell/shell.rb +11192 -28
  65. data/lib/bioroebe/siRNA/siRNA.rb +81 -1
  66. data/lib/bioroebe/string_matching/find_longest_substring.rb +3 -2
  67. data/lib/bioroebe/taxonomy/class_methods.rb +3 -8
  68. data/lib/bioroebe/taxonomy/constants.rb +4 -3
  69. data/lib/bioroebe/taxonomy/edit.rb +2 -1
  70. data/lib/bioroebe/taxonomy/help/help.rb +10 -10
  71. data/lib/bioroebe/taxonomy/info/check_available.rb +15 -9
  72. data/lib/bioroebe/taxonomy/info/info.rb +17 -2
  73. data/lib/bioroebe/taxonomy/info/is_dna.rb +46 -36
  74. data/lib/bioroebe/taxonomy/interactive.rb +139 -95
  75. data/lib/bioroebe/taxonomy/menu.rb +27 -18
  76. data/lib/bioroebe/taxonomy/parse_fasta.rb +3 -1
  77. data/lib/bioroebe/taxonomy/shared.rb +1 -0
  78. data/lib/bioroebe/taxonomy/taxonomy.rb +1 -0
  79. data/lib/bioroebe/toplevel_methods/aminoacids_and_proteins.rb +31 -24
  80. data/lib/bioroebe/toplevel_methods/databases.rb +1 -1
  81. data/lib/bioroebe/toplevel_methods/fasta_and_fastq.rb +101 -63
  82. data/lib/bioroebe/toplevel_methods/misc.rb +17 -16
  83. data/lib/bioroebe/toplevel_methods/nucleotides.rb +22 -5
  84. data/lib/bioroebe/toplevel_methods/open_in_browser.rb +2 -0
  85. data/lib/bioroebe/toplevel_methods/palindromes.rb +1 -2
  86. data/lib/bioroebe/toplevel_methods/taxonomy.rb +2 -2
  87. data/lib/bioroebe/toplevel_methods/to_camelcase.rb +5 -0
  88. data/lib/bioroebe/utility_scripts/align_open_reading_frames.rb +1 -9
  89. data/lib/bioroebe/utility_scripts/check_for_mismatches/check_for_mismatches.rb +1 -9
  90. data/lib/bioroebe/utility_scripts/compacter.rb +1 -9
  91. data/lib/bioroebe/utility_scripts/compseq/compseq.rb +1 -9
  92. data/lib/bioroebe/utility_scripts/create_batch_entrez_file.rb +1 -9
  93. data/lib/bioroebe/utility_scripts/dot_alignment.rb +1 -9
  94. data/lib/bioroebe/utility_scripts/move_file_to_its_correct_location.rb +1 -4
  95. data/lib/bioroebe/utility_scripts/showorf/constants.rb +0 -5
  96. data/lib/bioroebe/utility_scripts/showorf/reset.rb +1 -4
  97. data/lib/bioroebe/version/version.rb +2 -2
  98. data/lib/bioroebe/www/embeddable_interface.rb +101 -52
  99. data/lib/bioroebe/www/sinatra/sinatra.rb +186 -70
  100. data/lib/bioroebe/yaml/aminoacids/amino_acids_long_name_to_one_letter.yml +2 -2
  101. data/lib/bioroebe/yaml/configuration/browser.yml +1 -1
  102. data/lib/bioroebe/yaml/genomes/README.md +3 -4
  103. data/lib/bioroebe/yaml/restriction_enzymes/restriction_enzymes.yml +3 -3
  104. metadata +32 -35
  105. data/doc/setup.rb +0 -1655
  106. data/lib/bioroebe/genbank/genbank_parser.rb +0 -291
  107. data/lib/bioroebe/shell/add.rb +0 -108
  108. data/lib/bioroebe/shell/assign.rb +0 -360
  109. data/lib/bioroebe/shell/chop_and_cut.rb +0 -281
  110. data/lib/bioroebe/shell/constants.rb +0 -166
  111. data/lib/bioroebe/shell/download.rb +0 -335
  112. data/lib/bioroebe/shell/enable_and_disable.rb +0 -158
  113. data/lib/bioroebe/shell/enzymes.rb +0 -310
  114. data/lib/bioroebe/shell/fasta.rb +0 -345
  115. data/lib/bioroebe/shell/gtk.rb +0 -76
  116. data/lib/bioroebe/shell/history.rb +0 -132
  117. data/lib/bioroebe/shell/initialize.rb +0 -217
  118. data/lib/bioroebe/shell/loop.rb +0 -74
  119. data/lib/bioroebe/shell/prompt.rb +0 -107
  120. data/lib/bioroebe/shell/random.rb +0 -289
  121. data/lib/bioroebe/shell/reset.rb +0 -335
  122. data/lib/bioroebe/shell/scan_and_parse.rb +0 -135
  123. data/lib/bioroebe/shell/search.rb +0 -337
  124. data/lib/bioroebe/shell/sequences.rb +0 -200
  125. data/lib/bioroebe/shell/show_report_and_display.rb +0 -2901
  126. data/lib/bioroebe/shell/startup.rb +0 -127
  127. data/lib/bioroebe/shell/taxonomy.rb +0 -14
  128. data/lib/bioroebe/shell/user_input.rb +0 -88
  129. data/lib/bioroebe/shell/xorg.rb +0 -45
data/README.md CHANGED
@@ -2,13 +2,13 @@
2
2
  [![forthebadge](https://forthebadge.com/images/badges/made-with-ruby.svg)](https://www.ruby-lang.org/en/)
3
3
  [![Gem Version](https://badge.fury.io/rb/bioroebe.svg)](https://badge.fury.io/rb/bioroebe)
4
4
 
5
- This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">24.06.2022</span> (dd.mm.yyyy notation), at <span style="color: steelblue; font-weight: bold">22:13:29</span> o'clock.
5
+ This gem was <b>last updated</b> on the <span style="color: darkblue; font-weight: bold">03.08.2022</span> (dd.mm.yyyy notation), at <span style="color: steelblue; font-weight: bold">23:23:28</span> o'clock.
6
6
 
7
7
  # The Bioroebe Project
8
8
 
9
9
  ## Bioroebe
10
10
 
11
- <img src="http://shevy.bplaced.net/BIOROEBE.png">
11
+ <img src="https://i.imgur.com/mAoP7AP.png">
12
12
  <img src="https://i.imgur.com/YqYxRBZ.png" style="margin: 4px; margin-left: 12px;"/>
13
13
  <img src="https://i.imgur.com/k7mMlg2.png" style="margin: 4px; margin-left: 12px;"/>
14
14
 
@@ -335,41 +335,6 @@ so I opted to go the yaml route. But if people want to use a hash
335
335
  instead, they can do so, too - see the <b>API</b> for codon tables
336
336
  lateron. Simply define your own constants and pass them to the
337
337
  appropriate methods.
338
-
339
- ## Support for other programming languages
340
-
341
- The main programming language for the bioroebe project is **ruby**.
342
- Ruby, from a language design point of view, is a great programming
343
- language - not necessarily all of ruby, but the subset that I use.
344
- It is very easy to quickly prototype ideas via ruby.
345
-
346
- However had, ruby is known to **not** be among the fastest programming
347
- languages about on this planet; so, it makes sense to use other
348
- languages too from this point of view. Additionally there are some
349
- software stacks in use in **other** programming languages, such as
350
- matplotlib and various more.
351
-
352
- Thus, it is important to **support other programming languages** as
353
- well, if there are useful libraries. The bioroebe project, after
354
- all, tries to be **practical**: it focuses on getting things done,
355
- no matter the language.
356
-
357
- This means that support for other programming languages can be
358
- found in this project as well, often using system() or similar
359
- functionality to tap into these other programming languages. Do
360
- not be surprised when that happens - the bioroebe project will
361
- also try to act as a **practical glue** towards functionality
362
- enabled via other projects. We want to get things done, no
363
- matter the programming language at hand!
364
-
365
- Whenever possible, though, the bioroebe project will try to be
366
- flexible in this regard, so ideally the same solution should
367
- work for many different programming languages.
368
-
369
- While Ruby is the primary language for this project, since as
370
- of 2021 I will try to officially support **java**, **jruby**
371
- and the **GraalVM**. This is on my TODO list, though - stay
372
- tuned for more updates in this regard.
373
338
 
374
339
  ## Readline support in the BioRoebe project
375
340
 
@@ -553,16 +518,16 @@ the DNA-to-Protein translation is somewhat simply kept as a
553
518
  Once you are inside a **running Bioshell**, you can do other **commands**
554
519
  such as this one here:
555
520
 
556
- random # ← This will generate a random DNA sequence.
521
+ random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.
557
522
 
558
523
  To **assign** a DNA sequence, do:
559
524
 
560
525
  assign ATAGGGCTTTT
561
526
 
562
- Note that since the year 2016, if you input a nucleotide sequence like
563
- the one above, without any other commands/words, then we will assume
527
+ Note that since as of the year <b>2016</b>, if you input a nucleotide sequence
528
+ like the one above, without any other commands/words, then we will assume
564
529
  that you did mean to do an assignment as-is anyway. The "assign" part
565
- then becomes superfluous.
530
+ then becomes superfluous and can be omitted.
566
531
 
567
532
  This is how this is simply done, by omitting the "assign" part of the
568
533
  above instruction altogether:
@@ -1073,18 +1038,18 @@ The text **banana** thus has the following suffixes:
1073
1038
 
1074
1039
  This subsection deals with some aspects of **HMMs**.
1075
1040
 
1076
- Why are HMMs useful in biology? They can be used to represent protein
1077
- families, for example (via pHMMs - profile hidden markov models).
1041
+ Why are HMMs useful in biology? They can be used to <b>represent protein
1042
+ families</b>, for example (via <b>pHMMs</b> - profile hidden markov models).
1078
1043
 
1079
1044
  Furthermore, they can show some bias in the mutation rate that can be
1080
1045
  observed. Different genomes are known to have different hotspots where
1081
- mutations are more likely to happen. These are examples where a HMM
1082
- may be useful.
1046
+ mutations are more likely to happen, for various reasons. These are
1047
+ examples where a HMM may be useful.
1083
1048
 
1084
- HMMs are usually based on the Shannon model where you assign different
1049
+ HMMs are usually based on the <b>Shannon model</b> where you assign different
1085
1050
  probabilities to "change" events. An example that was mentioned back
1086
- in 1948 was the english alphabet - some letters, and combinations of
1087
- letters, are more commonly seen. Shannon gave the example of "E"
1051
+ in <b>1948</b> was the english alphabet - some letters, and combinations
1052
+ of letters, are more commonly seen. Shannon gave the example of "E"
1088
1053
  versus "W", as shown in the following graph (a **finite state
1089
1054
  graph**):
1090
1055
 
@@ -1098,40 +1063,47 @@ DNA sequence, a 10-mer would be equivalent to **10 base pairs**.
1098
1063
  The individual transition states are based on an assumption of
1099
1064
  "randomness", but ensuring that these are truly random is not
1100
1065
  necessarily trivial. Computers do not really 'generate' true
1101
- randomness, at the least not when they are working solo. You
1102
- can even 'predict' some randomness here or there - see vulnerabilities
1103
- such as Specter or similar variants where software can read from
1104
- areas of the memory that should be inaccessible to them. Some
1105
- of this is based on co-predictions. For distributed computers,
1106
- you may often use random noise or decay of atoms as 'a source
1107
- of randomness''. For any DNA nucleotide sequence, we would
1108
- assume that each base pair has a 25% chance to exist at any
1109
- given position, but this is not necessarily true, for various
1110
- reasons. An interesting thought is ... why is ATP so important?
1111
- Yes, due to it being 'the energy currency in a cell' but .. why
1112
- is this ATP aka adenine? Why not GTP, aka guanine or any of
1113
- the other two nucleotides? I can not answer the question; there may
1114
- be many reasons, including differential chemical storage power as
1115
- well as mere random chance event in evolution, but for whatever
1066
+ randomness, at the least not when they are working solo, "on
1067
+ their own". You can even 'predict' some randomness here or there
1068
+ via various techniques - see vulnerabilities such as <b>Specter</b>
1069
+ or similar variants where software can read from areas of the
1070
+ memory that should be inaccessible to them. Some of this is based
1071
+ on co-predictions. For distributed computers, you may often use
1072
+ random noise or decay of atoms as 'a source of randomness'. For
1073
+ any DNA nucleotide sequence, we would assume that each base pair
1074
+ has a 25% chance to exist at any given position, but this is not
1075
+ necessarily true, again for various reasons.
1076
+
1077
+ An interesting thought is ... why is <b>ATP</b> so important?
1078
+ Yes, of course due to it being 'the energy currency in a cell' but ..
1079
+ why is this ATP, aka adenine? Why not GTP, aka guanine or any of
1080
+ the other two nucleotides? (GTP is used too, but why? Why not
1081
+ CTP and TTP?) I can not answer this question; there may
1082
+ be many reasons, including differential chemical storage power
1083
+ as well as mere random chance event in evolution, but for whatever
1116
1084
  the reason, you will not find a complete 25% percentage value
1117
1085
  for every given "slot" in DNA, depending on the organism.
1118
1086
 
1119
1087
  From a practical point of view, how can we approach Hidden Markov
1120
- Models?
1088
+ Models and use them?
1121
1089
 
1122
- Let's take the following sequence:
1090
+ Let's take the following simple sequence:
1123
1091
 
1124
1092
  ACGTACGC
1125
1093
 
1126
1094
  From this sequence we can see that the <b>3-mer</b> "ACG"
1127
1095
  is followed by either a T, or a C. Have a look at the sequence
1128
- to see if you can identify the two ACG subsequences there.
1096
+ again to see if you can identify the two ACG subsequences
1097
+ there. You can see one at the start, and the other one
1098
+ following a bit later, hence why we come to the conclusion
1099
+ that either a T or a C will follow this <b>3-mer</b>.
1129
1100
 
1130
- The probability of either T or C, thus, is 0.5 (50%);
1131
- for A and G to follow there is 0% so the latter two can
1132
- be ignored.
1101
+ The probability of either T or C to occur on <b>that</b>
1102
+ position, thus, is 0.5 (50%); for A and G to follow there
1103
+ is 0% so the latter two can be ignored.
1133
1104
 
1134
- Thus, we could use a ruby Hash as follows:
1105
+ Thus, we could use a ruby Hash as follows that should
1106
+ describe these probabilities:
1135
1107
 
1136
1108
  probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well
1137
1109
 
@@ -1217,34 +1189,6 @@ each edge.
1217
1189
  Parsimony assumes that substitutions are rare and that back-mutations
1218
1190
  do not occur.
1219
1191
 
1220
- ## Random stuff
1221
-
1222
- You can generate random DNA sequences in the shell:
1223
-
1224
- random dna 20
1225
- random dna 25
1226
- random dna 30
1227
-
1228
- This will generate random DNA sequences, with a length
1229
- of 20, 25, 30, respectively. This may not be very useful
1230
- but it was important that this functionality is made
1231
- available somewhere.
1232
-
1233
- You can also use some toplevel-methods to generate, e. g.
1234
- 20 random aminoacids:
1235
-
1236
- Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
1237
-
1238
- Note that there may exist other APIs within the Bioroebe project
1239
- that do the same as well.
1240
-
1241
- If you would like to use a ruby-gtk3 widget have a look
1242
- at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
1243
- It works with aminoacids, DNA and RNA, and allows the user to
1244
- create random sequences. (If you need weighted randomness then
1245
- you currently have to use the commandline variant. Perhaps I may
1246
- add support into the GUI directly for this one day.)
1247
-
1248
1192
  ## Displaying the main sequence with delimiter characters
1249
1193
 
1250
1194
  From within the <b>bioshell</b>, you can use some alternative ways to
@@ -1486,24 +1430,9 @@ You can simulate this via the following API:
1486
1430
  Bioroebe.cleave_with_trypsin(sequence_goes_in_here)
1487
1431
  Bioroebe.cleave :with_trypsin, sequence_goes_in_here
1488
1432
 
1489
- Currently (July 2021) only support for Trypsin is included, but
1433
+ Currently (<b>July 2021</b>) only support for Trypsin is included, but
1490
1434
  in the long run the goal is to add as many digestive (peptide-bond
1491
1435
  cleaving) enzymes here as possible.
1492
-
1493
- ## Freezing the main sequence - and unfreezing it again
1494
-
1495
- You can **freeze** the BioShell, meaning that it will no longer allow
1496
- for the main sequence to be modified, via:
1497
-
1498
- freeze
1499
-
1500
- To unfreeze again, issue:
1501
-
1502
- unfreeze
1503
-
1504
- This functionality has been added because the shell may sometimes be
1505
- quite eager to change the main sequence, so we needed a way to disable
1506
- any further modifications (until "unfreeze" is issued that is).
1507
1436
 
1508
1437
  ## MUMmer
1509
1438
 
@@ -2714,18 +2643,6 @@ This may look as follows:
2714
2643
 
2715
2644
  <img src="https://i.imgur.com/gAZg8qG.png" style="margin: 1em; margin-left: 3em">
2716
2645
 
2717
- ## Obtaining a subsequence from a Bioroebe::Sequence object
2718
-
2719
- Say that you have the DNA sequence **ATGCATGCAAAA**.
2720
-
2721
- There are several ways how to obtain a subsequence from
2722
- this. One variant will be shown next, by making use of
2723
- the method called **.subseq()**.
2724
-
2725
- Example:
2726
-
2727
- seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
2728
-
2729
2646
  ## Bioroebe::Protein
2730
2647
 
2731
2648
  This class is a subclass of class **Bioroebe::Sequence**. The
@@ -2740,15 +2657,26 @@ functionality is also available in another method.
2740
2657
  For now keep this in mind; at some later point I may decide whether
2741
2658
  this class is to be kept or not.
2742
2659
 
2743
- ## Permanently disabling showing the startup-introduction of the Bioshell
2660
+ In July 2022 I noticed that the bio-gem has the following method:
2744
2661
 
2745
- If you do not want to see the start-up intro, you can try
2746
- any of the following:
2662
+ p Bio::AminoAcid['A'] # => "Ala"
2747
2663
 
2748
- bioshell --permanently-disable-startup-intro
2749
- bioshell --permanently-disable-startup-notice
2750
- bioshell --permanently-no-startup-intro
2751
- bioshell --permanently-no-startup-info
2664
+ I liked this functionality, but class Bioroebe::Protein already
2665
+ has a [] method which is used to instantiate a new
2666
+ instance of class Bioroebe::Protein. So, a toplevel method
2667
+ was added instead.
2668
+
2669
+ Usage example:
2670
+
2671
+ Bioroebe::Aminoacids.one_to_three('A') # => Ala
2672
+
2673
+ So this is the equivalent to what the bio-gem does, more or
2674
+ less.
2675
+
2676
+ If you want to find out the name of a one-letter aminoacid
2677
+ you can also use this method:
2678
+
2679
+ Bioroebe::Protein.name('A') # => "alanine"
2752
2680
 
2753
2681
  ## Decoding aminoacids
2754
2682
 
@@ -2934,27 +2862,6 @@ Note that presently (April 2020) not all of PROSITE may be supported
2934
2862
  via this regex, but in the long run the plan is to support all
2935
2863
  of PROSITE's regex expression.
2936
2864
 
2937
- ## Determining how many stop codons existing in a given sequence
2938
-
2939
- You can use **bin/n_stop_codons_in_this_sequence** to determine
2940
- how many stop codons exist in a given sequence at hand.
2941
-
2942
- Usage example from the commandline:
2943
-
2944
- n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
2945
-
2946
- You can also separate these via a ' ' spacer on the commandline of
2947
- course:
2948
-
2949
- n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
2950
-
2951
- Internally this makes use of the method called
2952
- <b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
2953
- aliased names. Usage example for the method, just as in the
2954
- first example shown above:
2955
-
2956
- Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
2957
-
2958
2865
  ## AT and GC content
2959
2866
  ![alt text][cat1]
2960
2867
  [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3176,47 +3083,45 @@ can try to use:
3176
3083
  On class Bioroebe::Sequence. More customizability may be added
3177
3084
  to that method in this regard, if users need this.
3178
3085
 
3179
- ## The Hydropathy index
3086
+ ### Obtaining a subsequence from a Bioroebe::Sequence object
3180
3087
 
3181
- You can display the hydropathy index for aminoacids from within
3182
- the **bioshell**.
3088
+ Say that you have the DNA sequence **ATGCATGCAAAA**.
3183
3089
 
3184
- Simply issue:
3090
+ There are several ways how to obtain a subsequence from
3091
+ this. One variant will be shown next, by making use of
3092
+ the method called **.subseq()**.
3185
3093
 
3186
- hydropathy?
3094
+ Example:
3187
3095
 
3188
- ## Generate DNA
3096
+ seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
3189
3097
 
3190
- You can generate random DNA strings by issuing the following
3191
- code:
3098
+ You can also randomize the sequence, via .randomize().
3192
3099
 
3193
- x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
3100
+ Example:
3194
3101
 
3195
- As can be seen in the example above, after the #, a String will be
3196
- returned representing that nucleotide sequence.
3102
+ x = Bioroebe::Sequence.new; x.randomize
3197
3103
 
3198
- The number given to .random_dna() tells the method how many nucleotides
3199
- should be generated.
3104
+ This is similar to the method in Bioruby here:
3200
3105
 
3201
- ## The GFF file format
3106
+ https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243
3202
3107
 
3203
- From within the **bioshell** you can analyze .gff and .gff3 files,
3204
- such as by issuing the following command:
3108
+ ## The Hydropathy index
3205
3109
 
3206
- gff3? foobar.gff3
3110
+ You can display the hydropathy index for aminoacids from within
3111
+ the **bioshell**.
3207
3112
 
3208
- Evidently for this to work the file at hand has to exist.
3113
+ Simply issue:
3209
3114
 
3210
- ## Shuffling the DNA/RNA string in the bioshell
3115
+ hydropathy?
3211
3116
 
3212
- Via
3117
+ ## The GFF file format
3213
3118
 
3214
- shuffle
3119
+ From within the **bioshell** you can analyze .gff and .gff3 files,
3120
+ such as by issuing the following command:
3215
3121
 
3216
- you can randomly rearrange the main DNA/RNA string.
3122
+ gff3? foobar.gff3
3217
3123
 
3218
- This can be useful if you just wish to quickly "test" new
3219
- compositions of the same nucleotide.
3124
+ Evidently for this to work the file at hand has to exist.
3220
3125
 
3221
3126
  ## The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)
3222
3127
 
@@ -3353,47 +3258,6 @@ nucleotides by issuing:
3353
3258
 
3354
3259
  show_individual_weight_of_the_four_dna_nucleotides
3355
3260
 
3356
- ## Truncating output in the bioroebe-shell
3357
- ![alt text][cat1]
3358
- [cat1]: https://i.imgur.com/Qmd7R0p.png
3359
-
3360
- **DNA/RNA sequences** can become very long and then become
3361
- quite difficult to view, read and handle on the commandline.
3362
-
3363
- Normally the bioroebe shell will truncate output of DNA sequences
3364
- that are "too long". This is mostly done so that working with
3365
- very long sequences becomes a bit more convenient.
3366
-
3367
- Sometimes this can become an antifeature, though, so the user
3368
- must be able to toggle this at his or her own discretion.
3369
-
3370
- By default, the bioroebe-shell (bioshell) will always try
3371
- to truncate output, but you can toggle this behaviour by
3372
- issuing:
3373
-
3374
- do not truncate
3375
-
3376
- In theory, other "do not" actions are also supported, or will
3377
- be supported in the future; right now (Oct 2019) this is a bit
3378
- limited.
3379
-
3380
- From the toplevel, you can use this method:
3381
-
3382
- Bioroebe.do_not_truncate
3383
-
3384
- The above instruction will toggle the truncate behaviour
3385
- to not truncate, ever.
3386
-
3387
- If you need to do so within the bioshell, this is the way:
3388
-
3389
- no_truncate
3390
-
3391
- Or simply
3392
-
3393
- truncate
3394
-
3395
- This will toggle, like a switch.
3396
-
3397
3261
  ## Rosalind Challenges
3398
3262
  ![alt text][cat1]
3399
3263
  [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3530,31 +3394,6 @@ investing more time into Rosalind. Let's focus on solving
3530
3394
  real, existing problems instead - at the least as far as
3531
3395
  the Bioroebe project is concerned.
3532
3396
 
3533
- ## Numbers as input in the bioshell
3534
- ![alt text][cat1]
3535
- [cat1]: https://i.imgur.com/Qmd7R0p.png
3536
-
3537
- You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
3538
-
3539
- This will attempt to <b>display the first 3 nucleotides</b> of
3540
- the assigned **main sequence**. It will only work if you have
3541
- assigned a sequence prior to that, though.
3542
-
3543
- Examples:
3544
-
3545
- 3
3546
- 33
3547
- 15
3548
-
3549
- ## transeq
3550
- ![alt text][cat1]
3551
- [cat1]: https://i.imgur.com/Qmd7R0p.png
3552
-
3553
- You can convert a DNA sequence into an aminoacid sequence by
3554
- doing this:
3555
-
3556
- transeq
3557
-
3558
3397
  ## Align two different sequences
3559
3398
  ![alt text][cat1]
3560
3399
  [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3866,22 +3705,6 @@ does not (yet?) have support for comparing two genomes to
3866
3705
  one another and generate a visual map indicating the findings
3867
3706
  there.
3868
3707
 
3869
- ## Do not create directories on startup of the shell
3870
-
3871
- By default the bioshell will try to create some directories
3872
- on startup. This may not always be desired by the user
3873
- though, so an option has to exist to disable this functionality.
3874
-
3875
- Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
3876
- keeps track of whether directories on startup of the shell will
3877
- be created.
3878
-
3879
- To disable this behaviour on startup of the bioshell, try
3880
- something like this:
3881
-
3882
- bioshell --do-not-create-directories-on-startup
3883
- bioshell --do-not-create-directories
3884
-
3885
3708
  ## class Bioroebe::MoveFileToItsCorrectLocation
3886
3709
 
3887
3710
  This class will move a bio-file to its "correct" location, with respect
@@ -3924,15 +3747,6 @@ synonymous, aka aliases):
3924
3747
  ruler2 25 # ← use 25 characters per line
3925
3748
  ruler2 50 # ← use 50 characters per line
3926
3749
 
3927
- ## Generating a random nucleotide sequence based on frequencies
3928
-
3929
- If you ever need to generate a nucleotide frequency then you can use
3930
- the following method:
3931
-
3932
- Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
3933
- Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
3934
- Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
3935
-
3936
3750
  ## The Mouse
3937
3751
 
3938
3752
  This subsection is about the **mouse**, in particular relevant
@@ -4050,57 +3864,24 @@ has". Genes in itself are not that well-defined, so they are not necessarily
4050
3864
  the primary means of complexity. Think of this more as an interactome,
4051
3865
  where RNAs play a major dynamic role as well.
4052
3866
 
4053
- ## Bioroebe::ProfilePattern
3867
+ ## class Bioroebe::DisplayOpenReadingFrames
4054
3868
 
4055
- This class can be used to generate nucleotide sequences that
4056
- are not quite "random". For example, to generate sequences
4057
- that may "simulate" a TATA box.
3869
+ **class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
3870
+ will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
3871
+ **class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
3872
+ flexible. It shall also support **sixpack** and **showorf** from the
3873
+ **Emboss online tools**. (In fact, supporting these two use cases
3874
+ was the original reason as to why this class has been created.)
4058
3875
 
4059
- The idea for this class is to be extended into allowing
4060
- HMMs (Hidden Markov Models) one day.
3876
+ Where does the code to this class reside?
4061
3877
 
4062
- Usage example:
3878
+ It can be found here:
4063
3879
 
4064
- _ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
4065
- _.generate_sequence_based_on_this_profile
3880
+ bioroebe/utility_scripts/display_open_reading_frames/
3881
+ require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
4066
3882
 
4067
- Such a profile will encode the profile specifying the preferred sequence
4068
- letters for each position in a section of DNA. You have to provide
4069
- the Hash into the method generate_sequence_based_on_this_profile() -
4070
- or you use the default Hash, which is stored in the constant
4071
- called **PER_POSITION_HASH**.
4072
-
4073
- That profile should be a Hash, with keys pointing to A, T, C, G
4074
- and the values being an Array of likelihood chance there,
4075
- as a number, such as 140. These values are also called
4076
- **scores**. Each score contains a number for each position
4077
- that indicates how likely it is to find the given
4078
- nucleotide at that location.
4079
-
4080
- You can also use this class to generate a random DNA string,
4081
- similar to the method called
4082
- **Bioroebe.generate_random_dna_sequence()**. The difference
4083
- is that class ProfilePattern allows for a bit more fine-tuned
4084
- control. The class will likely be extended in the future too.
4085
-
4086
- ## class Bioroebe::DisplayOpenReadingFrames
4087
-
4088
- **class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
4089
- will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
4090
- **class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
4091
- flexible. It shall also support **sixpack** and **showorf** from the
4092
- **Emboss online tools**. (In fact, supporting these two use cases
4093
- was the original reason as to why this class has been created.)
4094
-
4095
- Where does the code to this class reside?
4096
-
4097
- It can be found here:
4098
-
4099
- bioroebe/utility_scripts/display_open_reading_frames/
4100
- require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
4101
-
4102
- The display of this class is typically aimed for the commandline,
4103
- but it is planned to use the class on the www too (via sinatra).
3883
+ The display of this class is typically aimed for the commandline,
3884
+ but it is planned to use the class on the www too (via sinatra).
4104
3885
 
4105
3886
  Take note that this class also reports how many ORFs (open reading
4106
3887
  frames) have been found. The number displayed here differs from
@@ -4462,28 +4243,6 @@ the BioRoebe-Shell, then you can use either of the following:
4462
4243
 
4463
4244
  seq?
4464
4245
  seq_with_tab?
4465
-
4466
- ## Prompt (the shell prompt9
4467
-
4468
- You can set a <b>custom prompt</b>, via the keywords
4469
- "prompt" or "set_prompt".
4470
-
4471
- To display the <b>current working directory</b>, do:
4472
-
4473
- prompt pwd
4474
-
4475
- To revert to the old default again, do this:
4476
-
4477
- prompt REVERT
4478
- prompt revert
4479
- prompt DEFAULT
4480
- prompt default
4481
-
4482
- If you do not want to set any prompt, do:
4483
-
4484
- prompt none
4485
-
4486
-
4487
4246
 
4488
4247
  ## Leader and Trailer
4489
4248
 
@@ -4971,17 +4730,17 @@ For now, here is the list:
4971
4730
 
4972
4731
  ## The T-Bacteriophages
4973
4732
 
4974
- The following table only shows a short summary for the **T-phages**.
4733
+ The following table only shows a short summary for the <b>T-phages</b>.
4975
4734
 
4976
- name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size
4977
- -------------------|--------------|---------------------------|----------------|----------------------------|-------------
4978
- T1 | medium | 50 | 150 x 15 | 13 | 180
4979
- T2 | small | 65 x 80 | 120 x 20 | 21 | 120
4980
- T3 | large | 45 | invisible | 13 | 300
4981
- T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300
4982
- T5 | small | 100 | tiny | 40 | 300
4983
- T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300
4984
- T7 | large | 45 | invisible | 13 | 300
4735
+ name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size | n genes
4736
+ -------------------|--------------|---------------------------|----------------|----------------------------|-------------|------------
4737
+ T1 | medium | 50 | 150 x 15 | 13 | 180 |
4738
+ T2 | small | 65 x 80 | 120 x 20 | 21 | 120 |
4739
+ T3 | large | 45 | invisible | 13 | 300 |
4740
+ T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300 | 300
4741
+ T5 | small | 100 | tiny | 40 | 300 |
4742
+ T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300 |
4743
+ T7 | large | 45 | invisible | 13 | 300 |
4985
4744
 
4986
4745
  The next table will show some phage genomes.
4987
4746
 
@@ -5392,215 +5151,6 @@ that format.
5392
5151
  Presently (**May 2020**) there is no support for the mmCIF format
5393
5152
  in the Bioroebe project, but this will eventually change.
5394
5153
 
5395
- ## Working with PDB files (.pdb)
5396
- ![alt text][cat1]
5397
- [cat1]: https://i.imgur.com/Qmd7R0p.png
5398
-
5399
- The **PDB**, founded in the year **1971**, holds lots of **atomic
5400
- structures of proteins**.
5401
-
5402
- In **July 2016** it contained **121000 structures**.
5403
-
5404
- In **February 2018** it contained **~124000 structures**
5405
- (from X-ray crystallography), and about **~12000 NMR
5406
- structures**. <b>NMR</b> is limited to about <b>350 amino
5407
- acids maximum length</b>, give or take.
5408
-
5409
- In **April 2020** the PDB contained **163141 structures**.
5410
-
5411
- We can see that more and more structures are available
5412
- nowadays - a trend that will most likely continue or
5413
- even accelerate. (Let's hope the quality also remains
5414
- high.)
5415
-
5416
- A typical .pdb file contains entries such as this:
5417
-
5418
- RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
5419
- ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
5420
- ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
5421
- ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
5422
- ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
5423
- ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
5424
-
5425
- (Not the first line; **RTyp** is just an explanation for the ATOM
5426
- entries below that line).
5427
-
5428
- The sequence starts from the N-terminal residue for proteins; see
5429
- the <b>Atm</b> entry at <b>Num 1</b>.
5430
-
5431
- The **meaning of these entries** is as follows:
5432
-
5433
- 1) RTyp: Record Type
5434
- 2) Num: Serial number of the atom. Each atom has a unique serial number.
5435
- 3) Atm: Atom name (in IUPAC format).
5436
- 4) Res: Residue name (IUPAC format).
5437
- 5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
5438
- 6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
5439
- 7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
5440
- 10) Occ: Occupancy factor
5441
- 11) Temp: Temperature factor (atoms disordered in the crystal have high
5442
- temperature factors; they are "wobbly" with a high factor.
5443
- This is also called the B-factor).
5444
- 12) PDB: The PDB data file unique identifier.
5445
- 13) Line: Line (record) number in the data file.
5446
-
5447
- Typically the entry on the most right area, the last one, specifies
5448
- which atom it is. A **H** stands for a hydrogen atom; the other atoms
5449
- are "heavy" atoms (heavier than hydrogen most definitely).
5450
-
5451
- Most .pdb files will contain **SEQRES** entries. These entries will list
5452
- the primary sequence of the polymeric molecules present in the entry.
5453
- You can notice this by looking at the standard 3-character code
5454
- used by SEQRES here, for the canonical amino acids. So, for instance,
5455
- the amino acids that will be mentioned in a SEQRES entry are
5456
- ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
5457
- PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
5458
- method **Bioroebe.three_to_one()** to convert back to the
5459
- one-letter chain such as follows:
5460
-
5461
- Bioroebe.three_to_one('PHE') # => "F"
5462
-
5463
- The data in a .pdb file need not necessarily only be a protein, with
5464
- a specific aminoacid sequence. It may also include DNA. An example
5465
- for such a molecule is
5466
- <b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
5467
- which includes a protein chain and a DNA chain.
5468
-
5469
- As far as the **bioroebe project** is concerned, you can parse .pdb files
5470
- via the following class:
5471
-
5472
- Bioroebe::ParsePdbFile.new
5473
- Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
5474
- Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
5475
-
5476
- This class also allows some shortcuts for integrated .pdb files,
5477
- that is files that are bundled with the bioroebe project:
5478
-
5479
- Bioroebe::ParsePdbFile.new ':1fat'
5480
-
5481
- This requires a String because ruby symbols may not start with
5482
- a number. Note that this also works through the commandline,
5483
- such as:
5484
-
5485
- parse_pdb_file :1fat
5486
-
5487
- A shell such as bash does not understand ruby symbols, so instead
5488
- a string will be passed in, being :1fat. The ParsePdbFile will
5489
- handle this correctly internally.
5490
-
5491
- Note that a small bug was fixed in the file parse_pdb_file.rb;
5492
- some entries were skipped due to an erroneous loop in the ruby
5493
- file. This was corrected in **May 2020**.
5494
-
5495
- In **March 2021** the ability to use entries such as ':1fat'
5496
- was removed again; the code remains though. The reason why
5497
- this was removed was that the .pdb files are quite large,
5498
- so distributing them via the bioroebe project makes no real
5499
- sense. Consider simply downloading the .pdb files; you
5500
- can use this from the bioshell or via something
5501
- like:
5502
-
5503
- pdb 5TIM
5504
-
5505
- Note that you can also return the aminoacid-sequence from a
5506
- .pdb file directly, since as of **May 2020**.
5507
-
5508
- Example for this:
5509
-
5510
- Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
5511
-
5512
- The first argument should be **the path to the (local)
5513
- .pdb file at hand**. (In theory support for remote .pdb
5514
- files could also be added easily, but right now this
5515
- is not possible, so you have to download it first.)
5516
-
5517
- The **specification for .pdb files** can be read at the following
5518
- two remote resources:
5519
-
5520
- http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
5521
- http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
5522
-
5523
- Note that the parse_pdb_file.rb can also do some additional
5524
- things, such as calculating the maximum distance between
5525
- atoms in that file, via the method
5526
- **.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
5527
-
5528
- If you wish to report the secondary structures from a given .pdb file
5529
- then you can use the following class:
5530
-
5531
- require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
5532
-
5533
- Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
5534
- Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
5535
-
5536
- If you wish to obtain the FASTA sequence of a particular remote
5537
- .pdb file then you can use this API:
5538
-
5539
- x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
5540
-
5541
- Keep in mind that this is the FASTA sequence; the .pdb file itself
5542
- has another format, and contains a lot more information, such as
5543
- the various ATOM entries.
5544
-
5545
- Since as of **June 2020** the command **fetch** also works from
5546
- within the Bioshell, similar to how pymol **works**. This allows
5547
- us to quickly download a remote .pdb file.
5548
-
5549
- fetch 2BTS
5550
-
5551
- You can also use the following toplevel-API to download a remote
5552
- .pdb file:
5553
-
5554
- Bioroebe.download_this_pdb
5555
- Bioroebe.download_this_pdb '355D'
5556
- Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
5557
-
5558
- Note that this will be automatically moved to the "correct" default
5559
- position in the bioroebe-project, under the **pdb/** subdirectory.
5560
-
5561
- You can also invoke this script from the commandline via
5562
- **bin/download_this_pdb**, like in this way:
5563
-
5564
- download_this_pdb 355D
5565
-
5566
- This works with several .pdb files in one go as well:
5567
-
5568
- download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
5569
-
5570
- They would all be downloaded one after the other. Be aware that
5571
- this will overwrite the old .pdb files on that position, so
5572
- if you don't want this, I recommend to do a backup on the
5573
- **pdb/** subdirectory before invoking the above call.
5574
-
5575
- You can also turn the FASTA sequence stored in a .pdb file into
5576
- a .fasta file, via **--create-fasta-file**.
5577
-
5578
- Usage examples:
5579
-
5580
- parsedb 1NR6 --create-fasta-file
5581
- parsedb 2F9Q --create-fasta-file
5582
- parsedb 3TDA --create-fasta-file
5583
- parsedb 2HI4 --create-fasta-file
5584
- parsedb 2V0M --create-fasta-file
5585
-
5586
- So if you have a file called <b>1NR6.pdb</b> and you use
5587
- the first input, a .fasta file will be created. If such
5588
- a .pdb file does not exist then this will not work, so
5589
- make sure to download the .pdb file before invoking
5590
- this commandline-flag.
5591
-
5592
- Last but not least, the following table shall document the
5593
- PDB format - it is not yet complete, but it is intended
5594
- to add the remaining datasets eventually:
5595
-
5596
- Record Name Describes
5597
- MODRES Modifications to standard residues
5598
- HET Nonstandard residues (as well as ligands, ions and water)
5599
- HETNAM Full chemical name of the residue
5600
- HETSYM Synonyms for the residue
5601
- FORMUL Chemical formula of the residue
5602
- KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
5603
-
5604
5154
  ## Sugars and glyco-patterns
5605
5155
 
5606
5156
  I am currently having to do an assignment related to glyco-patterns
@@ -5764,6 +5314,9 @@ like this:
5764
5314
 
5765
5315
  <img src="https://i.imgur.com/vr2kEBz.png" style="margin: 1em; margin-left: 3em">
5766
5316
 
5317
+ Since as of <b>July 2022</b> invalid amino acids will be automatically
5318
+ filtered away before being assigned to the input.
5319
+
5767
5320
  ## Colourizing hydrophilic and hydrophobic aminoacids on the commandline
5768
5321
 
5769
5322
  Via class **Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids** you
@@ -5777,35 +5330,36 @@ Example output for this:
5777
5330
 
5778
5331
  This subsection contains some information about proteases.
5779
5332
 
5780
- trypsin:
5333
+ Trypsin:
5781
5334
  https://en.wikipedia.org/wiki/Trypsin
5782
- cuts at: Trypsin cuts peptide chains mainly at the carboxyl
5335
+ <b>cuts at</b>: Trypsin cuts peptide chains mainly at the carboxyl
5783
5336
  side of the amino acids lysine or arginine.
5784
5337
 
5785
- chymotrypsin:
5338
+ Chymotrypsin:
5786
5339
  https://en.wikipedia.org/wiki/Chymotrypsin
5787
- cuts at: Chymotrypsin preferentially cleaves peptide amide
5340
+ <b>cuts at</b>: Chymotrypsin preferentially cleaves peptide amide
5788
5341
  bonds where the side chain of the amino acid N-terminal
5789
- to the scissile amide bond is a large hydrophobic amino
5790
- acid (tyrosine, tryptophan, and phenylalanine).
5342
+ to the scissile amide bond is <b>a large hydrophobic amino</b>
5343
+ acid (specifically: tyrosine, tryptophan, and phenylalanine).
5344
+ Chymotrypsin will cleave proteins on the <b>carboxyl side</b>
5345
+ of aromatic or large hydrophobic amino acids.
5791
5346
 
5792
- thrombin:
5347
+ Thrombin:
5793
5348
  https://en.wikipedia.org/wiki/Thrombin
5794
- cuts at: Thrombin acts as a serine protease that converts
5349
+ <b>cuts at</b>: Thrombin acts as a serine protease that converts
5795
5350
  soluble fibrinogen into insoluble strands of fibrin. It
5796
5351
  catalyzes the hydrolysis of <b>Arg-Gly</b> bonds in
5797
5352
  particular peptide sequences only.
5798
5353
 
5799
- plasmin:
5354
+ Plasmin:
5800
5355
  https://en.wikipedia.org/wiki/Plasmin
5801
- cuts at: Plasmin is a serine protease.
5356
+ <b>cuts at</b>: Plasmin is a serine protease.
5802
5357
 
5803
- papain:
5358
+ Papain:
5804
5359
  https://en.wikipedia.org/wiki/Papain
5805
- cuts at: Papain prefers to cleave after an
5806
- arginine or lysine preceded by a hydrophobic
5807
- unit (Ala, Val, Leu, Ile, Phe, Trp, Tyr) and
5808
- not followed by a valine.
5360
+ <b>cuts at</b>: Papain prefers to cleave after an arginine or
5361
+ lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile,
5362
+ Phe, Trp, Tyr) and not followed by a valine.
5809
5363
 
5810
5364
  factor Xa:
5811
5365
 
@@ -5817,8 +5371,8 @@ Some proteins may permanently reside in the lumen of the
5817
5371
  Often such proteins will have a special signal sequence attached
5818
5372
  to their **C-terminal part**, such as **KDEL** (Lys-Asp-Glu-Leu).
5819
5373
 
5820
- KDEL is not the only signal that may be used, though. Some species
5821
- may use different signals, such as:
5374
+ <b>KDEL</b> is not the only signal that may be used, though. Some
5375
+ species may use different signals, such as:
5822
5376
 
5823
5377
  aminoacids | species
5824
5378
  -------------|------------------------------------------------------------
@@ -5828,8 +5382,9 @@ may use different signals, such as:
5828
5382
  ADEL | Schizosaccharomyces pombe (fission yeast)
5829
5383
  SDEL | Plasmodium falciparum
5830
5384
 
5831
- If you work with the bioshell then you can simply use this method
5832
- to query whether the given aminoacid sequence has a KDEL sequence:
5385
+ If you work with the <b>bioshell</b> then you can simply use this
5386
+ method to query whether the given aminoacid sequence has a KDEL
5387
+ sequence:
5833
5388
 
5834
5389
  KDEL?
5835
5390
 
@@ -6240,8 +5795,6 @@ Next, do something such as this:
6240
5795
  This will show the distribution of the oligos.
6241
5796
 
6242
5797
  ## Number of chromomes in different species
6243
- ![alt text][cat1]
6244
- [cat1]: https://i.imgur.com/Qmd7R0p.png
6245
5798
 
6246
5799
  Name of the organism | Latin name | Number of chromosomes
6247
5800
  ---------------------|--------------|-----------------------
@@ -6319,112 +5872,6 @@ So this is what would be returned:
6319
5872
 
6320
5873
  Bioroebe::DetectMinimalCodon[["TTT", "TTC"]] # => ["TTY"]
6321
5874
 
6322
- ## Codon Usage
6323
-
6324
- This **paragraph** deals with some aspects of **codon usage** in different
6325
- organisms.
6326
-
6327
- Let us first define the term <b>codon usage</b>. In order to do so,
6328
- we also have to define what a <b>codon</b> is, so let's start with that.
6329
-
6330
- A <span style="color: darkgreen; font-weight: bold">codon</span> is
6331
- essentially the basic code used in DNA to denote which particular
6332
- **aminoacid** corresponds to these (three) nucleotide base pairs.
6333
- A codon is thus **a series of three nucleotides, also called
6334
- a <b>triplet</b>.
6335
-
6336
- When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
6337
- abbreviated as <b>dsDNA</b>. The codon is, however had, only found
6338
- in a single stranded molecule, even within dsDNA. Since some parts of
6339
- a **dsDNA** in any given genome gives rise to a, more or less, complementary
6340
- copy into **mRNA**, the codons that are actually used, are found in the
6341
- corresponding mRNA. (Remember that mRNA differs from DNA in that there
6342
- will be Uracil rather than Thymine; otherwise it is the same, sequence-wise.
6343
- Of course it uses another sugar (Ribose), but remember we are here mostly
6344
- interested in the **information-containing part**, not the full chemical
6345
- structure.)
6346
-
6347
- The codon is thus found on the mRNA and since mRNA is mostly
6348
- single-stranded, the codon is a component of the mRNA. It is
6349
- where the two subunits of the ribosome are assembled (or more
6350
- accurately, the smaller subunit scans along the mRNA until it
6351
- detects a start codon). Mind you, this subsection will not go into
6352
- all relevant details, so just keep in mind that the codon is the
6353
- part that will eventually be "translated" at the ribosome into
6354
- a corresponding aminoacid, excluding stop codons at the end.
6355
-
6356
- Now - different organisms use **different frequencies of codons**.
6357
- **Codon usage** thus describes the fact that many proteins in
6358
- these different organisms make use of certain codons with a
6359
- **substantially higher frequency than other codons**. We can
6360
- use statistics to infer this on a global (proteome) level
6361
- too.
6362
-
6363
- Remember that the genetic code is **degenerate**, meaning that
6364
- you have a few aminoacids that are encoded only by one codon
6365
- (<b>Tryptophan</b> and <b>Methionin</b>), whereas the other
6366
- aminoacids are encoded by more than one codon - thus, at the
6367
- very least two codons. Note that the latter codons, if they
6368
- code for the **same** aminoacid, are also called <b>synonymous
6369
- codons</b>.
6370
-
6371
- This means that if you have any given aminoacid chain, you can have
6372
- several different sequences (and codons in these sequences, which
6373
- ultimtely means that you can have different DNA sequences code for
6374
- the very same aminoacid chain).
6375
-
6376
- Usually the third base of a codon has the least influence on
6377
- codon meaning. This is also called <b>wobbling</b> - since
6378
- the anticodon loop on the tRNA is in the reverse direction,
6379
- and the wobble position refers to the tRNA, this means that
6380
- the wobble-position is at the 5'-end of the tRNA anticodon.
6381
-
6382
- Now a few words about functionality related to codons and codon
6383
- usage in the Bioroebe project.
6384
-
6385
- Say that you have a long DNA sequence; let's pick a sample
6386
- for now, such as:
6387
-
6388
- ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
6389
-
6390
- You can analyze the codons used via class **ShowCodonUsage**:
6391
-
6392
- show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
6393
-
6394
- This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
6395
- It will report the top 5 codons in use and also output the
6396
- frequency hash on the commandline.
6397
-
6398
- You can use this from ruby too, via this toplevel method:
6399
-
6400
- Bioroebe.codon_frequencies_of_this_sequence(ARGV)
6401
-
6402
- If you want to look at the actual codon frequencies used
6403
- by different organisms, have a look here:
6404
-
6405
- http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
6406
-
6407
- This is an excellent resource.
6408
-
6409
- ## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
6410
-
6411
- If you quickly wish to determine the aminoacid composition, as a
6412
- Hash, you can use **bin/aminoacid_frequencies**.
6413
-
6414
- Example from the commandline for this:
6415
-
6416
- aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
6417
-
6418
- Example from within bioroebe itself (and thus ruby):
6419
-
6420
- require 'bioroebe/frequencies.rb'
6421
-
6422
- Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
6423
-
6424
- The latter will return a Hash that you can then further make use for, such as:
6425
-
6426
- {"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
6427
-
6428
5875
  ## The Levensthein distance
6429
5876
 
6430
5877
  The <b>Levensthein distance</b> - also called a '**string metric**' - was formulated
@@ -6842,6 +6289,34 @@ change A: teal or C: slateblue to some other colour; these are HTML
6842
6289
  colours, so it is recommended to use the names of these HTML
6843
6290
  colours).
6844
6291
 
6292
+ In <b>July 2022</b> the method <b>Bioroebe.colourize_this_fasta_sequence</b>
6293
+ was extended slightly. You can now attach a "ruler" to the output, that
6294
+ is a numbered series that shows the nucleotide position, on the commandline.
6295
+
6296
+ Example for this:
6297
+
6298
+ puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
6299
+ 'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
6300
+ 'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
6301
+ ) { :with_ruler }
6302
+
6303
+ By default this will use a white colour on black background. If you want to
6304
+ modify the foreground colour you can pass the colour name to the method,
6305
+ such as via:
6306
+
6307
+ puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
6308
+ 'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
6309
+ 'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
6310
+ ) { :with_ruler_steelblue_colour }
6311
+
6312
+ The following image shows how this can be used on the commandline:
6313
+
6314
+ <img src="https://i.imgur.com/ucVEVnK.png" style="margin: 1em; border: 3px solid black">
6315
+
6316
+ At a later time this may be extended to allow for use in a webpage,
6317
+ that is to embed these strings directly into HTML or .php or
6318
+ .cgi.
6319
+
6845
6320
  If you wish to show a **chunked display** of the dataset (nucleotides
6846
6321
  normally) then you can use the following API:
6847
6322
 
@@ -7365,16 +6840,6 @@ This would notify the bioshell that only nucleotides from position
7365
6840
  51 to (including) position 3251 will be colourized, when doing another
7366
6841
  "ORF?" invocation.
7367
6842
 
7368
- ## Longest substring
7369
-
7370
- Within the Bioroebe::Shell you can determine the longest substring,
7371
- including gaps, like s:'
7372
-
7373
- longest_substring? ATTATTGTT | ATTATTCTT'
7374
-
7375
- Note that this will make use of the diff-lcs gem, which uses
7376
- the McIlroy-Hunt algorithm.
7377
-
7378
6843
  ## Restriction Enzymes
7379
6844
 
7380
6845
  This **subsection** will eventually be expanded to explain various things about
@@ -8733,6 +8198,22 @@ The images that can be generated via this may look as follows:
8733
8198
 
8734
8199
  <img src="https://i.imgur.com/fWwD1fj.png" style="margin: 1em; margin-left: 2em">
8735
8200
 
8201
+ Let's look at another example.
8202
+
8203
+ Say you input the following sequences there:
8204
+
8205
+ AGVV
8206
+ AGVV
8207
+ AGVV
8208
+ AGVV
8209
+ AGGV
8210
+ AGGV
8211
+ AGGV
8212
+
8213
+ The resulting image that is generated is:
8214
+
8215
+ <img src="https://i.imgur.com/3wWApIQ.png" style="margin: 1em; margin-left: 2em">
8216
+
8736
8217
  ## The Kozak Sequence
8737
8218
 
8738
8219
  The ribosome usually scans for a **AUG** codon. But there are
@@ -8872,85 +8353,6 @@ Usage Example:
8872
8353
 
8873
8354
  pfasta insulin_mRNA.fasta --toprotein
8874
8355
 
8875
- ## Determining the codon frequencies from the commandline
8876
-
8877
- In April 2022 I noticed that one use case is to show the codon
8878
- frequencies of a given sequence - typically a nucleotide sequence.
8879
-
8880
- For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
8881
- So, following that logic, a new executable was added at
8882
- **bin/codon_frequency**. This will show the Hash of the codon
8883
- frequencies, as a String, on the commandline.
8884
-
8885
- Usage example:
8886
-
8887
- codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
8888
-
8889
- The output of this would be the following:
8890
-
8891
- AUU: 2
8892
- CGU: 2
8893
- ACG: 1
8894
- AUC: 1
8895
- GAC: 1
8896
- UGA: 1
8897
- CUG: 1
8898
- ACA: 1
8899
- GUC: 1
8900
-
8901
- ## Showing the codon frequency via countcodon
8902
-
8903
- https://www.kazusa.or.jp/codon/countcodon.html offers a rather useful
8904
- functionality via a simple web-interface, in that you can pass in a mRNA
8905
- sequence, and it will then show the codon frequency/likelihood of that
8906
- sequence - all codons in that sequence, that is. This can be extended
8907
- to all protein-coding genes in a given genome, and will thus be useful
8908
- for a researcher who may be interested in determining the codon frequency
8909
- in general, across all genes in that given genome.
8910
-
8911
- You can test it with an input sequence. For instance, the following
8912
- sequence:
8913
-
8914
- ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
8915
-
8916
- Would yield this result:
8917
-
8918
- fields: [triplet] [frequency: per thousand] ([number])
8919
-
8920
- UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
8921
- UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
8922
- UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
8923
- UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
8924
-
8925
- CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
8926
- CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
8927
- CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
8928
- CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
8929
-
8930
- AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
8931
- AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
8932
- AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
8933
- AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
8934
-
8935
- GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
8936
- GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
8937
- GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
8938
- GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
8939
-
8940
- At any rate, the individual functionality for that is also available
8941
- within the Bioroebe project since as of **April 2022**.
8942
-
8943
- The method that does so is:
8944
-
8945
- Bioroebe.frequency_per_thousand
8946
- Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
8947
-
8948
- At a later time sinatra-bindings as well as ruby-gtk3 bindings will
8949
- be added, and possibly ruby-libui bindings as well, for windows
8950
- support. What is missing is support for different codon tables in
8951
- different species, but that may be added at a later time as well
8952
- - for now it seemed more important to offer the functionality.
8953
-
8954
8356
  ## class Bioroebe::Protein
8955
8357
 
8956
8358
  **class Bioroebe::Protein** can be used to store a protein sequence.
@@ -9183,6 +8585,1036 @@ time being it is what it is. At a later point in time test cases
9183
8585
  may be added to check whether it performs correctly or whether it
9184
8586
  does not.
9185
8587
 
8588
+ The other rules, also published in 2004, are the Reynolds rules. Code
8589
+ support was added to the Bioroebe project in <b>June 2022</b>, but
8590
+ it was not tested yet, so the implementation may be incorrect.
8591
+
8592
+ ## The Bioroebe::Shell interface
8593
+
8594
+ The following subsection specifically handles information
8595
+ pertaining to the <b>Bioroebe::Shell</b> interface of the
8596
+ <b>bioroebe project</b>. It is also called <b>bioshell</b>,
8597
+ to simplify spelling it.
8598
+
8599
+ ### Numbers as input in the bioshell
8600
+ ![alt text][cat1]
8601
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8602
+
8603
+ You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
8604
+
8605
+ This will attempt to <b>display the first 3 nucleotides</b> of
8606
+ the assigned **main sequence**. It will only work if you have
8607
+ assigned a sequence prior to that, though.
8608
+
8609
+ Examples:
8610
+
8611
+ 3
8612
+ 33
8613
+ 15
8614
+
8615
+ ### transeq
8616
+ ![alt text][cat1]
8617
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8618
+
8619
+ You can convert a DNA sequence into an aminoacid sequence by
8620
+ doing this:
8621
+
8622
+ transeq
8623
+
8624
+ ### Shuffling the DNA/RNA string in the bioshell
8625
+ ![alt text][cat1]
8626
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8627
+
8628
+ Via
8629
+
8630
+ shuffle
8631
+
8632
+ you can <b>randomly rearrange the main DNA/RNA string</b>
8633
+ that is used by the <b>Bioroebe::Shell</b>.
8634
+
8635
+ This can be useful if you just wish to quickly "test"
8636
+ new compositions of the same nucleotide.
8637
+
8638
+ ### Permanently disabling showing the startup-introduction of the Bioshell
8639
+ ![alt text][cat1]
8640
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8641
+
8642
+ If you do not want to see the start-up intro, you can try
8643
+ any of the following:
8644
+
8645
+ bioshell --permanently-disable-startup-intro
8646
+ bioshell --permanently-disable-startup-notice
8647
+ bioshell --permanently-no-startup-intro
8648
+ bioshell --permanently-no-startup-info
8649
+
8650
+ ### Longest substring
8651
+ ![alt text][cat1]
8652
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8653
+
8654
+ Within the Bioroebe::Shell you can determine the longest substring,
8655
+ including gaps, like s:'
8656
+
8657
+ longest_substring? ATTATTGTT | ATTATTCTT'
8658
+
8659
+ Note that this will make use of the diff-lcs gem, which uses
8660
+ the McIlroy-Hunt algorithm.
8661
+
8662
+ ### Do not create directories on startup of the shell
8663
+ ![alt text][cat1]
8664
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8665
+
8666
+ By default the <b>bioshell</b> will try to create some directories
8667
+ on startup. This may not always be desired by the user, though,
8668
+ so an option has to exist to <b>disable</b> this functionality.
8669
+
8670
+ Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
8671
+ keeps track of whether directories on startup of the shell will
8672
+ be created.
8673
+
8674
+ To disable this behaviour on startup of the bioshell, try
8675
+ something like this:
8676
+
8677
+ bioshell --do-not-create-directories-on-startup
8678
+ bioshell --do-not-create-directories
8679
+
8680
+ ### Generating and assigning a random amount of nucleotides
8681
+ ![alt text][cat1]
8682
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8683
+
8684
+ Via:
8685
+
8686
+ random 555
8687
+
8688
+ you can "generate" 555 random nucleotides (DNA that is) and
8689
+ assign it to the main sequence in use by the bioshell. This
8690
+ is mostly a convenience feature, if you want to debug something
8691
+ quickly.
8692
+
8693
+ ### Determining the log directory for the Bioroebe::Shell component
8694
+ ![alt text][cat1]
8695
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8696
+
8697
+ Via:
8698
+
8699
+ bioshell_log_dir?
8700
+
8701
+ you can determine the log-directory output for the bioshell
8702
+ component. On my home system this will default to
8703
+ <b>/home/Temp/bioroebe/bioshell/</b>.
8704
+
8705
+ ### Prompt (the shell prompt of the bioshell)
8706
+ ![alt text][cat1]
8707
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8708
+
8709
+ You can set a <b>custom prompt</b> in the bioshell, via
8710
+ the keywords "<b>prompt</b>" or "<b>set_prompt</b>".
8711
+
8712
+ To display the <b>current working directory</b>, do:
8713
+
8714
+ prompt pwd
8715
+
8716
+ To revert to the old default again, do this:
8717
+
8718
+ prompt REVERT
8719
+ prompt revert
8720
+ prompt DEFAULT
8721
+ prompt default
8722
+
8723
+ If you do not want to set any prompt, do:
8724
+
8725
+ prompt none
8726
+
8727
+ ### Random stuff - generating random DNA sequences in the bioshell
8728
+ ![alt text][cat1]
8729
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8730
+
8731
+ You can <b>generate random DNA sequences</b> in the
8732
+ <b>bioshell</b> via:
8733
+
8734
+ random dna 20
8735
+ random dna 25
8736
+ random dna 30
8737
+ # or simpler
8738
+ random 20
8739
+ random 25
8740
+ random 30
8741
+
8742
+ This will generate random DNA sequences, with a length
8743
+ of 20, 25, 30, respectively. This may not be very useful
8744
+ but it was important that this functionality is made
8745
+ available somewhere. Sometimes you may not even care
8746
+ about the sequence and just use the a "filler" sequence,
8747
+ so randomness has to be part of the Bioroebe project
8748
+ as well.
8749
+
8750
+ You can also use some toplevel-methods to generate, e. g.
8751
+ 20 random aminoacids. Have a look at the following
8752
+ <b>toplevel API</b>:
8753
+
8754
+ Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
8755
+
8756
+ Note that there may exist other APIs within the Bioroebe project
8757
+ that do the same as well.
8758
+
8759
+ If you would like to use a ruby-gtk3 widget have a look
8760
+ at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
8761
+ It works with aminoacids, DNA and RNA, and allows the user to
8762
+ create random sequences. (If you need weighted randomness then
8763
+ you currently have to use the commandline variant. Perhaps I may
8764
+ add support into the GUI directly for this one day.)
8765
+
8766
+ ### Deprecations within the Bioroebe::Shell
8767
+ ![alt text][cat1]
8768
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8769
+
8770
+ Over the years the Bioroebe::Shell changed quite a bit.
8771
+
8772
+ This subsection here will list a few of these changes
8773
+ or rather, the deprecations.
8774
+
8775
+ **raw_sequence**: removed in June 2022 completely. It is
8776
+ simpler to handle sequences via Bioroebe::Sequence
8777
+ instead.
8778
+
8779
+ <b>@internal_hash[:array_sequences]</b> was no longer in
8780
+ use, so it was removed in July 2022.
8781
+
8782
+ ### Chop off nucleotides within the Bioroebe::Shell
8783
+ ![alt text][cat1]
8784
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8785
+
8786
+ You can use the following syntax to chop away until you find
8787
+ a particular substring, in the bioshell:
8788
+
8789
+ chop_to ATG
8790
+
8791
+ This functionality was specifically added to find the first
8792
+ ATG codon.
8793
+
8794
+ ### Truncating output in the bioroebe-shell
8795
+ ![alt text][cat1]
8796
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8797
+
8798
+ **DNA/RNA sequences** can become very long and then become
8799
+ quite difficult to view, read and handle on the commandline.
8800
+
8801
+ Normally the bioroebe shell will truncate output of DNA sequences
8802
+ that are "too long". This is mostly done so that working with
8803
+ very long sequences becomes a bit more convenient.
8804
+
8805
+ Sometimes this can become an antifeature, though, so the user
8806
+ must be able to toggle this at his or her own discretion.
8807
+
8808
+ By default, the bioroebe-shell (bioshell) will always try
8809
+ to truncate output, but you can toggle this behaviour by
8810
+ issuing:
8811
+
8812
+ do not truncate
8813
+
8814
+ In theory, other "do not" actions are also supported, or will
8815
+ be supported in the future; right now (Oct 2019) this is a bit
8816
+ limited.
8817
+
8818
+ From the toplevel, you can use this method:
8819
+
8820
+ Bioroebe.do_not_truncate
8821
+
8822
+ The above instruction will toggle the truncate behaviour
8823
+ to not truncate, ever.
8824
+
8825
+ If you need to do so within the bioshell, this is the way:
8826
+
8827
+ no_truncate
8828
+
8829
+ Or simply
8830
+
8831
+ truncate
8832
+
8833
+ This will toggle, like a switch.
8834
+
8835
+ ### Working with .pdb files in the bioshell
8836
+ ![alt text][cat1]
8837
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8838
+
8839
+ This subsection only very briefly mentions how to work with
8840
+ .pdb files in the bioshell. See other parts of this
8841
+ document for a more extensive overview how you can work
8842
+ with .pdb files via the Bioroebe project.
8843
+
8844
+ If you input something like this, if it ends with .pdb:
8845
+
8846
+ 1fat.pdb
8847
+
8848
+ And if no such file currently exists at
8849
+ /home/Temp/bioroebe/pdb/1fat.pdb then it will be
8850
+ downloaded and moved towards
8851
+ **/home/Temp/bioroebe/pdb/**.
8852
+
8853
+ This feature exists just to simplify using the
8854
+ **bioshell**.
8855
+
8856
+ ### Showing the stop codons in frame1, frame2 and frame2 in the bioshell
8857
+ ![alt text][cat1]
8858
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8859
+
8860
+ When you have a given sequence assigned to the bioshell, such
8861
+ as via "random 99", you can then show all stop codons in
8862
+ frame1, frame2 and frame3.
8863
+
8864
+ The corresponding input for this will be:
8865
+
8866
+ stop_frame1?
8867
+ stop_frame2?
8868
+ stop_frame3?
8869
+
8870
+ An image shows this next, where we first did input "random 120",
8871
+ before issuing the above-mentioned instructions one after
8872
+ the other:
8873
+
8874
+ <img src="https://i.imgur.com/HpHF4jq.png" style="margin: 1em; border: 1px solid black">
8875
+
8876
+ ### Freezing the main sequence in the bioshell - and unfreezing it again
8877
+ ![alt text][cat1]
8878
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8879
+
8880
+ You can **freeze** the BioShell, meaning that it will no longer
8881
+ allow for the main sequence to be modified, via the following
8882
+ command:
8883
+
8884
+ freeze
8885
+
8886
+ To <b>unfreeze</b> the sequence again, issue:
8887
+
8888
+ unfreeze
8889
+
8890
+ This functionality has been added because the shell may sometimes be
8891
+ quite eager to change the main sequence, so we needed a way to
8892
+ disable any further modifications (until "unfreeze" is issued
8893
+ that is).
8894
+
8895
+ ## Support for other programming languages
8896
+
8897
+ The main programming language for the bioroebe project is **ruby**.
8898
+ Ruby, from a language design point of view, is a great programming
8899
+ language - not necessarily all of ruby, but the subset that I use.
8900
+ It is very easy to quickly prototype ideas via ruby.
8901
+
8902
+ However had, ruby is known to **not** be among the fastest programming
8903
+ languages about on this planet; so, it makes sense to use other
8904
+ languages too from this point of view. Additionally there are some
8905
+ software stacks in use in **other** programming languages, such as
8906
+ matplotlib and various more.
8907
+
8908
+ Thus, it is important to **support other programming languages** as
8909
+ well, if there are useful libraries. The bioroebe project, after
8910
+ all, tries to be **practical**: it focuses on getting things done,
8911
+ no matter the language.
8912
+
8913
+ This means that support for other programming languages can be
8914
+ found in this project as well, often using system() or similar
8915
+ functionality to tap into these other programming languages. Do
8916
+ not be surprised when that happens - the bioroebe project will
8917
+ also try to act as a **practical glue** towards functionality
8918
+ enabled via other projects. We want to get things done, no
8919
+ matter the programming language at hand!
8920
+
8921
+ Whenever possible, though, the bioroebe project will try to be
8922
+ flexible in this regard, so ideally the same solution should
8923
+ work for many different programming languages.
8924
+
8925
+ While Ruby is the primary language for this project, since as
8926
+ of 2021 I will try to officially support **java**, **jruby**
8927
+ and the **GraalVM**. This is on my TODO list, though - stay
8928
+ tuned for more updates in this regard. See also the
8929
+ subsection <b>Support for Python</b>.
8930
+
8931
+ ## Support for Python
8932
+
8933
+ In <b>June 2022</b> I decided to add support for Python to bioroebe.
8934
+
8935
+ While people can - and should - easily use <b>biopython</b> instead,
8936
+ I simply wanted to see how much python-support I can add to
8937
+ bioroebe. This may lag behind some years compared to biopython,
8938
+ but I wanted to extend python support as well, so there you go.
8939
+ It is simply an additional option for the bioroebe project.
8940
+ <b>Ruby</b> will remain the primary language for the project,
8941
+ though, at the least for now.
8942
+
8943
+ ## Bioroebe::ProfilePattern
8944
+
8945
+ This class can be used to generate nucleotide sequences that
8946
+ are not quite "random". For example, to generate sequences
8947
+ that may "simulate" a TATA box.
8948
+
8949
+ The idea for this class is to be extended into allowing
8950
+ HMMs (Hidden Markov Models) one day.
8951
+
8952
+ Usage example:
8953
+
8954
+ _ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
8955
+ _.generate_sequence_based_on_this_profile
8956
+
8957
+ Such a profile will encode the profile specifying the preferred sequence
8958
+ letters for each position in a section of DNA. You have to provide
8959
+ the Hash into the method generate_sequence_based_on_this_profile() -
8960
+ or you use the default Hash, which is stored in the constant
8961
+ called **PER_POSITION_HASH**.
8962
+
8963
+ That profile should be a Hash, with keys pointing to A, T, C, G
8964
+ and the values being an Array of likelihood chance there,
8965
+ as a number, such as 140. These values are also called
8966
+ **scores**. Each score contains a number for each position
8967
+ that indicates how likely it is to find the given
8968
+ nucleotide at that location.
8969
+
8970
+ You can also use this class to generate a random DNA string,
8971
+ similar to the method called
8972
+ **Bioroebe.generate_random_dna_sequence()**. The difference
8973
+ is that class ProfilePattern allows for a bit more fine-tuned
8974
+ control. The class will likely be extended in the future too.
8975
+
8976
+ ## Generate DNA via Bioroebe.random_dna
8977
+
8978
+ You can "generate" random DNA strings by making use of the
8979
+ following code:
8980
+
8981
+ x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
8982
+
8983
+ As can be seen in the example above, after the #, a String will be
8984
+ returned representing that nucleotide sequence. In the case above
8985
+ it'll be 50 nucleotides in length.
8986
+
8987
+ The number given to <b>.random_dna()</b> tells the method how many
8988
+ nucleotides should be generated.
8989
+
8990
+ The method accepts a second argument, which should be a Hash.
8991
+ If it is a hash then the generated DNA will be based on the
8992
+ **probabilities** given to that Hash.
8993
+
8994
+ Let's look at specific example here:
8995
+
8996
+ Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"
8997
+
8998
+ As you can see, in the Hash defined above, the likelihood for
8999
+ incorporating a Guanine is much higher than for Adenine
9000
+ (70 : 10). This will be reflected in the generated DNA
9001
+ sequence which, as can be seen, contains many more
9002
+ Guanines than Adenines.
9003
+
9004
+ There is yet a third use case for the above. If you pass a **String**
9005
+ as the second argument rather than a Hash, then that String will be
9006
+ used as basis for generating the DNA string at hand.
9007
+
9008
+ Again, let's look at a specific example here:
9009
+
9010
+ Bioroebe.random_dna(10, 'ATCGATCGGG')
9011
+
9012
+ Here we add more G than A, T or C, so the new DNA sequence should
9013
+ contain these nucleotides as well.
9014
+
9015
+ More usage examples in this regard:
9016
+
9017
+ Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
9018
+ Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"
9019
+
9020
+ Note that this is similar to the .randomize() method in the bioruby
9021
+ project:
9022
+
9023
+ hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
9024
+ puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)
9025
+
9026
+ ## Generating a random nucleotide sequence based on frequencies
9027
+
9028
+ If you ever need to generate a nucleotide frequency then you can use
9029
+ the following method:
9030
+
9031
+ Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
9032
+ Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
9033
+ Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
9034
+
9035
+ ## Parsing genbank (.gbk) files
9036
+
9037
+ You could use class <b>Bioroebe::GenbankParser</b> to parse .gbk files, at
9038
+ the least if you want to obtain the raw sequence, in FASTA format.
9039
+
9040
+ Example for this:
9041
+
9042
+ require 'bioroebe/genbank/genbank_parser.rb'
9043
+ result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
9044
+ result.dataset? # This method call will return the FASTA sequence.
9045
+
9046
+ Note that this currently (<b>July 2022</b>) only grabs one entry. In
9047
+ the upcoming rewrite in the future the parser will be able to parse
9048
+ all entries, and then present them to the user. Stay tuned in this
9049
+ regard.
9050
+
9051
+ ## Parsers in general
9052
+
9053
+ The bioroebe project will store most parsers in the parsers/ subdirectory
9054
+ since as of <b>July 2022</b>.
9055
+
9056
+ Prior to that date different parsers were stored in different subdirectories,
9057
+ such as the parser for genbank-files being stored in the genbank/
9058
+ subdirectory. As I found this situation confusing, I settled for
9059
+ the parsers/ subdirectory since as of <b>July 2022</b>.
9060
+
9061
+ ## Coomassie staining of proteins
9062
+
9063
+ Coomassie staining is typically done on proteins, giving them a blue
9064
+ or blueish colour. <b>Coomassie staining</b> is <b>the most popular
9065
+ anionic protein dye</b>.
9066
+
9067
+ This may look like this:
9068
+
9069
+ <img src="https://i.imgur.com/6eUN7HR.png" style="margin: 1em; border: 1px solid black">
9070
+
9071
+ This picture shows five different bands. The molecular weight of the
9072
+ marker can be seen on the very left hand side, in <b>kDa</b>. The
9073
+ larger fragments can be seen on top, so the farther the band has
9074
+ moved, the smaller the fragment must be (in kDa). That means that
9075
+ the larger proteins can be found on top; the smaller proteins on
9076
+ the bottom.
9077
+
9078
+ Some bands are missing, and this gives information - that is
9079
+ that a particular protein is missing. Probably it was not
9080
+ synthesized in the given tissue at hand.
9081
+
9082
+ The staining for a Coomassie Blue stain is typically done
9083
+ via G-250, with a 0.5% density prepared in
9084
+ 50% methanol and 10% acetic acid. The staining duration is
9085
+ usually done for 5 minutes.
9086
+
9087
+ Note that the G-250 stain is the dimethyl derivative from
9088
+ R-250 - the <b>R</b> stands for <b>red</b> or <b>reddish</b>.
9089
+ Both dyes will bind via electrostatic interaction with <b>protonated
9090
+ basic amino acids</b>: that is <b>lysine</b>, <b>arginine</b>,
9091
+ and <b>histidine</b>. They can also bind via hydrophobic
9092
+ associations to aromatic residues.
9093
+
9094
+ Coomassie stains are in principle reversible. They are not
9095
+ as sensitive as silver staining, but significantly cheaper,
9096
+ which is one reason why they have become so popular.
9097
+
9098
+ Not every protein has all aminoacids, so staining may be difficult.
9099
+ For instance, the <b>glycomacropeptide</b> is the only known
9100
+ naturally occurring protein that contains no Phe (Phenylalanine; F).
9101
+
9102
+ A protein that lacks lysine, arginine, histidine or aromatic
9103
+ acids may be undetectable via Coomassie staining. However had,
9104
+ this does not seem to be a universal rule; some groups report
9105
+ that they even managed to stain "unstainable" proteins via
9106
+ Coomassie staining.
9107
+
9108
+ The paper at https://www.jbc.org/article/S0021-9258(17)39198-6/pdf,
9109
+ titled "Why Does Coomassie Brilliant Blue R Interact Differently
9110
+ with Different Proteins?" and published in the year 1985, tries
9111
+ to give some explanations to different groups yielding different
9112
+ results via Coomassie staining.
9113
+
9114
+ They specifically point out that "there is a striking correlation
9115
+ between intensity of response to Coomassie dyes and the basicity
9116
+ of a protein which depends on the number of lysine, histidine,
9117
+ and arginine residues, as well as the NH₂-terminal amino group"
9118
+ (aka the aminoterminus of the protein at hand). The concluding
9119
+ remark from that paper is that <b>"Coomassie R Interacts
9120
+ Differently with Different Proteins"</b>.
9121
+
9122
+ On class <b>Bioroebe::Protein</b> you can determine whether
9123
+ a given protein can be stained via coomassie through the
9124
+ following method:
9125
+
9126
+ .can_be_stained_via_coomassie?
9127
+
9128
+ This isn't an ideal check, so don't rely on it. It will simply
9129
+ check whether the sequence has at the least one lysine,
9130
+ or one histidine, or one arginine, or any of the aromatic
9131
+ amino acids.
9132
+
9133
+ ## Codon Usage
9134
+
9135
+ This **paragraph** deals with some aspects of **codon usage** in different
9136
+ organisms.
9137
+
9138
+ Let us first define the term <b>codon usage</b> so we can base any further
9139
+ analysis on this definition. In order to do so, we also have to define
9140
+ what a <b>codon</b> is, so let's start with that actually.
9141
+
9142
+ A <span style="color: darkgreen; font-weight: bold">codon</span> is
9143
+ essentially the basic code used in DNA to denote which particular
9144
+ **aminoacid** corresponds to these (three) nucleotide base pairs.
9145
+ A codon is thus <b>a series of three nucleotides</b>, also called
9146
+ a <b>triplet</b>, such as <b>ATG</b>.
9147
+
9148
+ When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
9149
+ abbreviated as <b>dsDNA</b>. The codon is, however had, only found
9150
+ in a single stranded molecule, even within dsDNA. Since some parts of
9151
+ a **dsDNA** in any given genome give rise to a, more or less, complementary
9152
+ copy into **mRNA**, the codons that are actually used, are found in the
9153
+ corresponding mRNA as well, excluding the codon that codes for a stop
9154
+ signal (a so-called <b>stop codon</b>). (Remember that mRNA differs from
9155
+ DNA in that there will be Uracil rather than Thymine; otherwise it is
9156
+ the same, sequence-wise. Of course it uses another sugar (Ribose), but
9157
+ remember we are here mostly interested in the **information-containing
9158
+ part**, not the full chemical structure.)
9159
+
9160
+ The <b>codon</b> is thus found on the mRNA and since mRNA is mostly
9161
+ single-stranded, the codon is a component of the mRNA. The two subunits
9162
+ of the ribosome are assembled on a mRNA, at the least in prokaryotes (or
9163
+ more accurately, the smaller subunit scans along the mRNA until it
9164
+ <b>detects</b> a start codon). Mind you, this subsection will not go
9165
+ into all relevant details, so just keep in mind that the codon is the
9166
+ part that will eventually be "<i>translated</i>" at the ribosome into
9167
+ a corresponding aminoacid, excluding stop codons at the end.
9168
+
9169
+ Now - different organisms use **different frequencies of codons**.
9170
+ <b style="color:darkblue">Codon usage</b> thus describes the fact
9171
+ that many proteins in these different organisms make use of certain
9172
+ codons with a **substantially higher frequency than other codons**.
9173
+ We can use statistics to infer this on a global (proteome)
9174
+ level too.
9175
+
9176
+ Remember that the genetic code is **degenerate**, meaning that
9177
+ you have a few aminoacids that are encoded only by one codon
9178
+ (<b>Tryptophan</b> and <b>Methionine</b>), whereas the other
9179
+ aminoacids are encoded by more than one codon - thus, at the
9180
+ very least two codons. Note that the latter codons, if they
9181
+ code for the **same** aminoacid, are also called
9182
+ <b style="font-style: italic">synonymous codons</b>.
9183
+
9184
+ This means that if you have any given aminoacid chain, you can have
9185
+ several different sequences that would yield to the very same
9186
+ amino acid chain (and codons in these sequences, which
9187
+ ultimately means that you can have different DNA sequences
9188
+ code for the very same aminoacid chain).
9189
+
9190
+ Usually the third base of a codon has the least influence on
9191
+ codon meaning. This is also called <b>wobbling</b> - since
9192
+ the anticodon loop on the tRNA is in the reverse direction,
9193
+ and the wobble position refers to the tRNA, this means that
9194
+ the wobble-position is at the 5'-end of the tRNA anticodon.
9195
+
9196
+ Now a few words about functionality related to codons and codon
9197
+ usage in the Bioroebe project.
9198
+
9199
+ Say that you have a long DNA sequence; let's pick a sample
9200
+ for now, such as:
9201
+
9202
+ ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
9203
+
9204
+ You can analyze the codons used via class **ShowCodonUsage**
9205
+ and the corresponding entry at <b>bin/show_codon_usage</b>:
9206
+
9207
+ show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
9208
+
9209
+ This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
9210
+ It will report the top 5 codons in use and also output the
9211
+ frequency hash on the commandline.
9212
+
9213
+ On my computer at home the output it yields via the commandline,
9214
+ on a KDE konsole terminal, looks like this:
9215
+
9216
+ <img src="https://i.imgur.com/h55Thdu.png" style="margin: 1em; border: 3px solid black">
9217
+
9218
+ You can use this from within ruby code too, via the following
9219
+ toplevel method:
9220
+
9221
+ Bioroebe.codon_frequencies_of_this_sequence(ARGV)
9222
+
9223
+ To get the hash of the codon frequencies you can use the .hash? method:
9224
+
9225
+ hash = Bioroebe.codon_frequencies_of_this_sequence('ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG').hash?
9226
+
9227
+ If you want to look at the actual codon frequencies used
9228
+ by different organisms, have a look here:
9229
+
9230
+ http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
9231
+
9232
+ This is an excellent resource.
9233
+
9234
+ For instance, the <i>E. coli</i> K strain can be found here:
9235
+
9236
+ https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=83333&aa=9&style=N
9237
+
9238
+ ## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
9239
+
9240
+ If you quickly wish to determine the aminoacid composition, as a
9241
+ Hash, you can use **bin/aminoacid_frequencies**.
9242
+
9243
+ Example from the commandline for this:
9244
+
9245
+ aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
9246
+
9247
+ Example from within bioroebe itself (and thus ruby):
9248
+
9249
+ require 'bioroebe/frequencies.rb'
9250
+
9251
+ Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
9252
+
9253
+ The latter will return a Hash that you can then further make use for, such as:
9254
+
9255
+ {"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
9256
+
9257
+ ## Determining the codon frequencies from the commandline
9258
+
9259
+ In <b>April 2022</b> I noticed that one use case is to show the
9260
+ codon frequencies of a given sequence - typically a nucleotide sequence.
9261
+
9262
+ For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
9263
+
9264
+ So, following that logic, a new executable was added at
9265
+ **bin/codon_frequency**. This will show the Hash of the codon
9266
+ frequencies, as a String, on the commandline.
9267
+
9268
+ Usage example:
9269
+
9270
+ codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
9271
+
9272
+ The output of this would be the following:
9273
+
9274
+ AUU: 2
9275
+ CGU: 2
9276
+ ACG: 1
9277
+ AUC: 1
9278
+ GAC: 1
9279
+ UGA: 1
9280
+ CUG: 1
9281
+ ACA: 1
9282
+ GUC: 1
9283
+
9284
+ ## Showing the codon frequency via countcodon
9285
+
9286
+ The excellent website at https://www.kazusa.or.jp/codon/countcodon.html offers
9287
+ a rather useful functionality via a simple web-interface, in that you can pass
9288
+ in a mRNA sequence, and it will then show the codon frequency/likelihood of
9289
+ that sequence - all codons in that sequence, that is. This can be extended
9290
+ to <b>all protein-coding genes in a given genome</b>, and will thus be
9291
+ useful for a researcher who may be interested in determining the codon
9292
+ frequency in general, across all genes in that given genome.
9293
+
9294
+ You can test it with an input sequence.
9295
+
9296
+ For instance, the following sequence:
9297
+
9298
+ ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
9299
+
9300
+ Would yield this result:
9301
+
9302
+ fields: [triplet] [frequency: per thousand] ([number])
9303
+
9304
+ UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
9305
+ UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
9306
+ UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
9307
+ UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
9308
+
9309
+ CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
9310
+ CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
9311
+ CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
9312
+ CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
9313
+
9314
+ AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
9315
+ AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
9316
+ AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
9317
+ AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
9318
+
9319
+ GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
9320
+ GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
9321
+ GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
9322
+ GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
9323
+
9324
+ At any rate, the individual functionality for that is also available
9325
+ within the Bioroebe project since as of **April 2022**.
9326
+
9327
+ The method that does so is:
9328
+
9329
+ Bioroebe.frequency_per_thousand
9330
+ Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
9331
+
9332
+ Sinatra-bindings exist to this functionality since as of July 2022,
9333
+ but they are not very well-polished. Ruby-gtk3 bindings may be
9334
+ added at a later time, and possibly ruby-libui bindings as well, for
9335
+ windows support. What is missing is support for different codon tables in
9336
+ different species, but that may be added at a later time as well - for now
9337
+ it seemed more important to offer the functionality.
9338
+
9339
+ ## Working with PDB files (.pdb)
9340
+
9341
+ The **PDB**, founded in the year **1971**, holds lots of **atomic
9342
+ structures of proteins**.
9343
+
9344
+ For instance, in **July 2016** it contained **121000 structures**.
9345
+
9346
+ In **February 2018** it contained **~124000 structures**
9347
+ (from X-ray crystallography), and about **~12000 NMR
9348
+ structures**. <b>NMR</b> is limited to about <b>350 amino
9349
+ acids maximum length</b>, give or take.
9350
+
9351
+ In **April 2020** the PDB contained **163141 structures**.
9352
+
9353
+ We can see that more and more structures are available nowadays -
9354
+ a trend that will most likely continue or even accelerate.
9355
+ (Let's hope the quality also remains high.)
9356
+
9357
+ A typical .pdb file contains entries such as this:
9358
+
9359
+ RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
9360
+ ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
9361
+ ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
9362
+ ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
9363
+ ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
9364
+ ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
9365
+
9366
+ (Not the first line; **RTyp** is just an explanation for the ATOM
9367
+ entries below that line).
9368
+
9369
+ The sequence starts from the N-terminal residue for proteins; see
9370
+ the <b>Atm</b> entry at <b>Num 1</b>.
9371
+
9372
+ The **meaning of these entries** is as follows:
9373
+
9374
+ 1) RTyp: Record Type
9375
+ 2) Num: Serial number of the atom. Each atom has a unique serial number.
9376
+ 3) Atm: Atom name (in IUPAC format).
9377
+ 4) Res: Residue name (IUPAC format).
9378
+ 5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
9379
+ 6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
9380
+ 7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
9381
+ 10) Occ: Occupancy factor
9382
+ 11) Temp: Temperature factor (atoms disordered in the crystal have high
9383
+ temperature factors; they are "wobbly" with a high factor.
9384
+ This is also called the B-factor).
9385
+ 12) PDB: The PDB data file unique identifier.
9386
+ 13) Line: Line (record) number in the data file.
9387
+
9388
+ Typically the entry on the most right area, the last one, specifies
9389
+ which atom it is. A **H** stands for a hydrogen atom; the other atoms
9390
+ are "heavy" atoms (heavier than hydrogen most definitely).
9391
+
9392
+ Most .pdb files will contain **SEQRES** entries. These entries will list
9393
+ the primary sequence of the polymeric molecules present in the entry.
9394
+ You can notice this by looking at the standard 3-character code
9395
+ used by SEQRES here, for the canonical amino acids. So, for instance,
9396
+ the amino acids that will be mentioned in a SEQRES entry are
9397
+ ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
9398
+ PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
9399
+ method **Bioroebe.three_to_one()** to convert back to the
9400
+ one-letter chain such as follows:
9401
+
9402
+ Bioroebe.three_to_one('PHE') # => "F"
9403
+
9404
+ The data in a .pdb file need not necessarily only be a protein, with
9405
+ a specific aminoacid sequence. It may also include DNA. An example
9406
+ for such a molecule is
9407
+ <b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
9408
+ which includes a protein chain and a DNA chain.
9409
+
9410
+ As far as the **bioroebe project** is concerned, you can parse .pdb files
9411
+ via the following class:
9412
+
9413
+ Bioroebe::ParsePdbFile.new
9414
+ Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
9415
+ Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
9416
+
9417
+ This class also allows some shortcuts for integrated .pdb files,
9418
+ that is files that are bundled with the bioroebe project:
9419
+
9420
+ Bioroebe::ParsePdbFile.new ':1fat'
9421
+
9422
+ This requires a String because ruby symbols may not start with
9423
+ a number. Note that this also works through the commandline,
9424
+ such as:
9425
+
9426
+ parse_pdb_file :1fat
9427
+
9428
+ A shell such as bash does not understand ruby symbols, so instead
9429
+ a string will be passed in, being :1fat. The ParsePdbFile will
9430
+ handle this correctly internally.
9431
+
9432
+ Note that a small bug was fixed in the file parse_pdb_file.rb;
9433
+ some entries were skipped due to an erroneous loop in the ruby
9434
+ file. This was corrected in **May 2020**.
9435
+
9436
+ In **March 2021** the ability to use entries such as ':1fat'
9437
+ was removed again; the code remains though. The reason why
9438
+ this was removed was that the .pdb files are quite large,
9439
+ so distributing them via the bioroebe project makes no real
9440
+ sense. Consider simply downloading the .pdb files; you
9441
+ can use this from the bioshell or via something
9442
+ like:
9443
+
9444
+ pdb 5TIM
9445
+
9446
+ Note that you can also return the aminoacid-sequence from a
9447
+ .pdb file directly, since as of **May 2020**.
9448
+
9449
+ Example for this:
9450
+
9451
+ Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
9452
+
9453
+ The first argument should be **the path to the (local)
9454
+ .pdb file at hand**. (In theory support for remote .pdb
9455
+ files could also be added easily, but right now this
9456
+ is not possible, so you have to download it first.)
9457
+
9458
+ The **specification for .pdb files** can be read at the following
9459
+ two remote resources:
9460
+
9461
+ http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
9462
+ http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
9463
+
9464
+ Note that the parse_pdb_file.rb can also do some additional
9465
+ things, such as calculating the maximum distance between
9466
+ atoms in that file, via the method
9467
+ **.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
9468
+
9469
+ If you wish to report the secondary structures from a given .pdb file
9470
+ then you can use the following class:
9471
+
9472
+ require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
9473
+
9474
+ Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
9475
+ Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
9476
+
9477
+ If you wish to obtain the FASTA sequence of a particular remote
9478
+ .pdb file then you can use this API:
9479
+
9480
+ x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
9481
+
9482
+ Keep in mind that this is the FASTA sequence; the .pdb file itself
9483
+ has another format, and contains a lot more information, such as
9484
+ the various ATOM entries.
9485
+
9486
+ Since as of **June 2020** the command **fetch** also works from
9487
+ within the Bioshell, similar to how pymol **works**. This allows
9488
+ us to quickly download a remote .pdb file.
9489
+
9490
+ fetch 2BTS
9491
+
9492
+ You can also use the following toplevel-API to download a remote
9493
+ .pdb file:
9494
+
9495
+ Bioroebe.download_this_pdb
9496
+ Bioroebe.download_this_pdb '355D'
9497
+ Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
9498
+ Bioroebe.download_this_pdb '1fat.pdb' # Lectin Phytohemagglutinin
9499
+
9500
+ This will refer to a remote URL such as
9501
+ https://files.rcsb.org/view/1FAT.pdb.
9502
+
9503
+ Note that this will be automatically moved to the "correct" default
9504
+ position in the bioroebe-project, under the **pdb/** subdirectory.
9505
+
9506
+ You can also invoke this script from the commandline via
9507
+ **bin/download_this_pdb**, like in this way:
9508
+
9509
+ download_this_pdb 355D
9510
+
9511
+ This works with several .pdb files in one go as well:
9512
+
9513
+ download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
9514
+
9515
+ They would all be downloaded one after the other. Be aware that
9516
+ this will overwrite the old .pdb files on that position, so
9517
+ if you don't want this, I recommend to do a backup on the
9518
+ **pdb/** subdirectory before invoking the above call.
9519
+
9520
+ You can also turn the FASTA sequence stored in a .pdb file into
9521
+ a .fasta file, via **--create-fasta-file**.
9522
+
9523
+ Usage examples:
9524
+
9525
+ parsedb 1NR6 --create-fasta-file
9526
+ parsedb 2F9Q --create-fasta-file
9527
+ parsedb 3TDA --create-fasta-file
9528
+ parsedb 2HI4 --create-fasta-file
9529
+ parsedb 2V0M --create-fasta-file
9530
+
9531
+ So if you have a file called <b>1NR6.pdb</b> and you use
9532
+ the first input, a .fasta file will be created. If such
9533
+ a .pdb file does not exist then this will not work, so
9534
+ make sure to download the .pdb file before invoking
9535
+ this commandline-flag.
9536
+
9537
+ Last but not least, the following table shall document the
9538
+ PDB format - it is not yet complete, but it is intended
9539
+ to add the remaining datasets eventually:
9540
+
9541
+ Record Name Describes
9542
+ MODRES Modifications to standard residues
9543
+ HET Nonstandard residues (as well as ligands, ions and water)
9544
+ HETNAM Full chemical name of the residue
9545
+ HETSYM Synonyms for the residue
9546
+ FORMUL Chemical formula of the residue
9547
+ KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
9548
+
9549
+
9550
+ ## Determining how many stop codons existing in a given sequence
9551
+
9552
+ You can use **bin/n_stop_codons_in_this_sequence** to determine
9553
+ how many stop codons exist in a given sequence at hand.
9554
+
9555
+ Usage example from the commandline:
9556
+
9557
+ n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
9558
+
9559
+ You can also separate these via a ' ' spacer on the commandline of
9560
+ course:
9561
+
9562
+ n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
9563
+
9564
+ Internally this makes use of the method called
9565
+ <b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
9566
+ aliased names. Usage example for the method, just as in the
9567
+ first example shown above:
9568
+
9569
+ Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
9570
+
9571
+ ## The Aliphatic Index of Globular Proteins
9572
+
9573
+ In a paper from 1980, Atsushi IKAI provided a formula with which one can
9574
+ calculate the aliphatic index of a globular protein, in a short paper
9575
+ titled "Thermostability and aliphatic index of globular proteins"
9576
+ (<b>PMID: 7462208</b>,
9577
+ <a href="https://www.jstage.jst.go.jp/article/biochemistry1922/88/6/88_6_1895/_article">
9578
+ see here</a>).
9579
+
9580
+ Atsushi provided a statistical analysis of proteins, and determined
9581
+ that the aliphatic index - which is defined as the relative volume
9582
+ of a protein occupied by <b>aliphatic side chains</b> (alanine, valine,
9583
+ isoleucine, and leucine) - of proteins of thermophilic bacteria
9584
+ is significantly higher than that of ordinary proteins.
9585
+
9586
+ Atsushi reasoned that the index may be regarded as a positive
9587
+ factor for the <b>increase of thermostability of globular
9588
+ proteins</b>. The enzymes of some organisms are more stable
9589
+ at higher temperature than the enzymes of other organisms,
9590
+ in particular among <b>thermostable proteins</b>.
9591
+
9592
+ Thus, there is a good correlation between the "aliphatic
9593
+ index" on the one hand, and the thermostability of proteins
9594
+ on the other hand.
9595
+
9596
+ Atsushi gave the following formula for calculating this:
9597
+
9598
+ Aliphatic Index = XA + aXV + b (xI+XL)
9599
+
9600
+ The four letters A, V, I and L refer to the four aminoacids
9601
+ Alanine, Valine, Isoleucine and Leucine. The two coefficients
9602
+ a and b are the relative volumes of the side chains of
9603
+ Alanine. A has a value range of 2.8-3.0 and
9604
+ b has a value range of 3.8-4.0.
9605
+
9606
+ The method called <b>.aliphatic_index()</b> is making use of that
9607
+ formula. As values for a and b the two values <b>2.9</b> and
9608
+ <b>3.9</b> have been taken. The code in the bioroebe project
9609
+ for this has been inspired by: https://github.com/wwood/bioruby-aliphatic_index
9610
+
9611
+ It yields the following usage example for bioruby:
9612
+
9613
+ Bio::Sequence::AA.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
9614
+
9615
+ In bioroebe, the equivalent would be:
9616
+
9617
+ Bioroebe::Protein.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
9186
9618
 
9187
9619
  ## Possibly useful links in regards to molecular biology and science in general
9188
9620