bioroebe 0.10.80 → 0.11.24

Sign up to get free protection for your applications and to get access to all the features.

Potentially problematic release.


This version of bioroebe might be problematic. Click here for more details.

Files changed (129) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +1204 -772
  3. data/bioroebe.gemspec +3 -3
  4. data/doc/README.gen +1203 -771
  5. data/doc/todo/bioroebe_todo.md +391 -365
  6. data/lib/bioroebe/aminoacids/aminoacid_substitution.rb +1 -9
  7. data/lib/bioroebe/aminoacids/codon_percentage.rb +1 -9
  8. data/lib/bioroebe/aminoacids/deduce_aminoacid_sequence.rb +1 -9
  9. data/lib/bioroebe/aminoacids/display_aminoacid_table.rb +1 -0
  10. data/lib/bioroebe/aminoacids/show_hydrophobicity.rb +1 -6
  11. data/lib/bioroebe/base/colours_for_base/colours_for_base.rb +18 -8
  12. data/lib/bioroebe/base/commandline_application/commandline_arguments.rb +13 -11
  13. data/lib/bioroebe/base/commandline_application/misc.rb +18 -8
  14. data/lib/bioroebe/base/misc.rb +16 -0
  15. data/lib/bioroebe/base/prototype/misc.rb +1 -1
  16. data/lib/bioroebe/codons/show_codon_tables.rb +6 -2
  17. data/lib/bioroebe/codons/show_codon_usage.rb +2 -1
  18. data/lib/bioroebe/constants/aminoacids_and_proteins.rb +1 -0
  19. data/lib/bioroebe/constants/database_constants.rb +1 -1
  20. data/lib/bioroebe/constants/files_and_directories.rb +20 -1
  21. data/lib/bioroebe/constants/misc.rb +20 -0
  22. data/lib/bioroebe/count/count_amount_of_nucleotides.rb +3 -0
  23. data/lib/bioroebe/crystal/README.md +2 -0
  24. data/lib/bioroebe/crystal/to_rna.cr +19 -0
  25. data/lib/bioroebe/data/README.md +11 -8
  26. data/lib/bioroebe/data/electron_microscopy/pos_example.pos +396 -0
  27. data/lib/bioroebe/data/electron_microscopy/test_particles.star +36 -0
  28. data/lib/bioroebe/{shell/tk.rb → electron_microscopy/electron_microscopy_module.rb} +15 -10
  29. data/lib/bioroebe/electron_microscopy/simple_star_file_generator.rb +4 -9
  30. data/lib/bioroebe/fasta_and_fastq/show_fasta_headers.rb +27 -12
  31. data/lib/bioroebe/genome/README.md +4 -0
  32. data/lib/bioroebe/genome/genome.rb +67 -0
  33. data/lib/bioroebe/gui/gtk3/protein_to_DNA/protein_to_DNA.rb +18 -18
  34. data/lib/bioroebe/gui/gtk3/random_sequence/random_sequence.rb +19 -11
  35. data/lib/bioroebe/gui/shared_code/protein_to_DNA/protein_to_DNA_module.rb +14 -14
  36. data/lib/bioroebe/misc/ruler.rb +1 -0
  37. data/lib/bioroebe/parsers/genbank_parser.rb +353 -24
  38. data/lib/bioroebe/parsers/gff.rb +1 -9
  39. data/lib/bioroebe/pdb/parse_pdb_file.rb +1 -9
  40. data/lib/bioroebe/project/project.rb +1 -1
  41. data/lib/bioroebe/python/README.md +1 -0
  42. data/lib/bioroebe/python/__pycache__/mymodule.cpython-39.pyc +0 -0
  43. data/lib/bioroebe/python/gui/gtk3/all_in_one.css +4 -0
  44. data/lib/bioroebe/python/gui/gtk3/all_in_one.py +59 -0
  45. data/lib/bioroebe/python/gui/gtk3/widget1.py +20 -0
  46. data/lib/bioroebe/python/gui/tkinter/all_in_one.py +91 -0
  47. data/lib/bioroebe/python/mymodule.py +8 -0
  48. data/lib/bioroebe/python/protein_to_dna.py +33 -0
  49. data/lib/bioroebe/python/shell/shell.py +19 -0
  50. data/lib/bioroebe/python/to_rna.py +14 -0
  51. data/lib/bioroebe/python/toplevel_methods/open_in_browser.py +20 -0
  52. data/lib/bioroebe/python/toplevel_methods/palindromes.py +42 -0
  53. data/lib/bioroebe/python/toplevel_methods/rds.py +13 -0
  54. data/lib/bioroebe/python/toplevel_methods/three_delimiter.py +34 -0
  55. data/lib/bioroebe/python/toplevel_methods/time_and_date.py +43 -0
  56. data/lib/bioroebe/python/toplevel_methods/to_camelcase.py +11 -0
  57. data/lib/bioroebe/requires/require_the_bioroebe_project.rb +3 -1
  58. data/lib/bioroebe/sequence/nucleotide_module/nucleotide_module.rb +28 -25
  59. data/lib/bioroebe/sequence/protein.rb +105 -3
  60. data/lib/bioroebe/sequence/sequence.rb +61 -2
  61. data/lib/bioroebe/shell/menu.rb +3451 -3366
  62. data/lib/bioroebe/shell/misc.rb +51 -4311
  63. data/lib/bioroebe/shell/readline/readline.rb +1 -1
  64. data/lib/bioroebe/shell/shell.rb +11192 -28
  65. data/lib/bioroebe/siRNA/siRNA.rb +81 -1
  66. data/lib/bioroebe/string_matching/find_longest_substring.rb +3 -2
  67. data/lib/bioroebe/taxonomy/class_methods.rb +3 -8
  68. data/lib/bioroebe/taxonomy/constants.rb +4 -3
  69. data/lib/bioroebe/taxonomy/edit.rb +2 -1
  70. data/lib/bioroebe/taxonomy/help/help.rb +10 -10
  71. data/lib/bioroebe/taxonomy/info/check_available.rb +15 -9
  72. data/lib/bioroebe/taxonomy/info/info.rb +17 -2
  73. data/lib/bioroebe/taxonomy/info/is_dna.rb +46 -36
  74. data/lib/bioroebe/taxonomy/interactive.rb +139 -95
  75. data/lib/bioroebe/taxonomy/menu.rb +27 -18
  76. data/lib/bioroebe/taxonomy/parse_fasta.rb +3 -1
  77. data/lib/bioroebe/taxonomy/shared.rb +1 -0
  78. data/lib/bioroebe/taxonomy/taxonomy.rb +1 -0
  79. data/lib/bioroebe/toplevel_methods/aminoacids_and_proteins.rb +31 -24
  80. data/lib/bioroebe/toplevel_methods/databases.rb +1 -1
  81. data/lib/bioroebe/toplevel_methods/fasta_and_fastq.rb +101 -63
  82. data/lib/bioroebe/toplevel_methods/misc.rb +17 -16
  83. data/lib/bioroebe/toplevel_methods/nucleotides.rb +22 -5
  84. data/lib/bioroebe/toplevel_methods/open_in_browser.rb +2 -0
  85. data/lib/bioroebe/toplevel_methods/palindromes.rb +1 -2
  86. data/lib/bioroebe/toplevel_methods/taxonomy.rb +2 -2
  87. data/lib/bioroebe/toplevel_methods/to_camelcase.rb +5 -0
  88. data/lib/bioroebe/utility_scripts/align_open_reading_frames.rb +1 -9
  89. data/lib/bioroebe/utility_scripts/check_for_mismatches/check_for_mismatches.rb +1 -9
  90. data/lib/bioroebe/utility_scripts/compacter.rb +1 -9
  91. data/lib/bioroebe/utility_scripts/compseq/compseq.rb +1 -9
  92. data/lib/bioroebe/utility_scripts/create_batch_entrez_file.rb +1 -9
  93. data/lib/bioroebe/utility_scripts/dot_alignment.rb +1 -9
  94. data/lib/bioroebe/utility_scripts/move_file_to_its_correct_location.rb +1 -4
  95. data/lib/bioroebe/utility_scripts/showorf/constants.rb +0 -5
  96. data/lib/bioroebe/utility_scripts/showorf/reset.rb +1 -4
  97. data/lib/bioroebe/version/version.rb +2 -2
  98. data/lib/bioroebe/www/embeddable_interface.rb +101 -52
  99. data/lib/bioroebe/www/sinatra/sinatra.rb +186 -70
  100. data/lib/bioroebe/yaml/aminoacids/amino_acids_long_name_to_one_letter.yml +2 -2
  101. data/lib/bioroebe/yaml/configuration/browser.yml +1 -1
  102. data/lib/bioroebe/yaml/genomes/README.md +3 -4
  103. data/lib/bioroebe/yaml/restriction_enzymes/restriction_enzymes.yml +3 -3
  104. metadata +32 -35
  105. data/doc/setup.rb +0 -1655
  106. data/lib/bioroebe/genbank/genbank_parser.rb +0 -291
  107. data/lib/bioroebe/shell/add.rb +0 -108
  108. data/lib/bioroebe/shell/assign.rb +0 -360
  109. data/lib/bioroebe/shell/chop_and_cut.rb +0 -281
  110. data/lib/bioroebe/shell/constants.rb +0 -166
  111. data/lib/bioroebe/shell/download.rb +0 -335
  112. data/lib/bioroebe/shell/enable_and_disable.rb +0 -158
  113. data/lib/bioroebe/shell/enzymes.rb +0 -310
  114. data/lib/bioroebe/shell/fasta.rb +0 -345
  115. data/lib/bioroebe/shell/gtk.rb +0 -76
  116. data/lib/bioroebe/shell/history.rb +0 -132
  117. data/lib/bioroebe/shell/initialize.rb +0 -217
  118. data/lib/bioroebe/shell/loop.rb +0 -74
  119. data/lib/bioroebe/shell/prompt.rb +0 -107
  120. data/lib/bioroebe/shell/random.rb +0 -289
  121. data/lib/bioroebe/shell/reset.rb +0 -335
  122. data/lib/bioroebe/shell/scan_and_parse.rb +0 -135
  123. data/lib/bioroebe/shell/search.rb +0 -337
  124. data/lib/bioroebe/shell/sequences.rb +0 -200
  125. data/lib/bioroebe/shell/show_report_and_display.rb +0 -2901
  126. data/lib/bioroebe/shell/startup.rb +0 -127
  127. data/lib/bioroebe/shell/taxonomy.rb +0 -14
  128. data/lib/bioroebe/shell/user_input.rb +0 -88
  129. data/lib/bioroebe/shell/xorg.rb +0 -45
data/doc/README.gen CHANGED
@@ -5,7 +5,7 @@ ADD_TIME_STAMP
5
5
 
6
6
  ## Bioroebe
7
7
 
8
- <img src="http://shevy.bplaced.net/BIOROEBE.png">
8
+ <img src="https://i.imgur.com/mAoP7AP.png">
9
9
  <img src="https://i.imgur.com/YqYxRBZ.png" style="margin: 4px; margin-left: 12px;"/>
10
10
  <img src="https://i.imgur.com/k7mMlg2.png" style="margin: 4px; margin-left: 12px;"/>
11
11
 
@@ -332,41 +332,6 @@ so I opted to go the yaml route. But if people want to use a hash
332
332
  instead, they can do so, too - see the <b>API</b> for codon tables
333
333
  lateron. Simply define your own constants and pass them to the
334
334
  appropriate methods.
335
-
336
- ## Support for other programming languages
337
-
338
- The main programming language for the bioroebe project is **ruby**.
339
- Ruby, from a language design point of view, is a great programming
340
- language - not necessarily all of ruby, but the subset that I use.
341
- It is very easy to quickly prototype ideas via ruby.
342
-
343
- However had, ruby is known to **not** be among the fastest programming
344
- languages about on this planet; so, it makes sense to use other
345
- languages too from this point of view. Additionally there are some
346
- software stacks in use in **other** programming languages, such as
347
- matplotlib and various more.
348
-
349
- Thus, it is important to **support other programming languages** as
350
- well, if there are useful libraries. The bioroebe project, after
351
- all, tries to be **practical**: it focuses on getting things done,
352
- no matter the language.
353
-
354
- This means that support for other programming languages can be
355
- found in this project as well, often using system() or similar
356
- functionality to tap into these other programming languages. Do
357
- not be surprised when that happens - the bioroebe project will
358
- also try to act as a **practical glue** towards functionality
359
- enabled via other projects. We want to get things done, no
360
- matter the programming language at hand!
361
-
362
- Whenever possible, though, the bioroebe project will try to be
363
- flexible in this regard, so ideally the same solution should
364
- work for many different programming languages.
365
-
366
- While Ruby is the primary language for this project, since as
367
- of 2021 I will try to officially support **java**, **jruby**
368
- and the **GraalVM**. This is on my TODO list, though - stay
369
- tuned for more updates in this regard.
370
335
 
371
336
  ## Readline support in the BioRoebe project
372
337
 
@@ -550,16 +515,16 @@ the DNA-to-Protein translation is somewhat simply kept as a
550
515
  Once you are inside a **running Bioshell**, you can do other **commands**
551
516
  such as this one here:
552
517
 
553
- random # ← This will generate a random DNA sequence.
518
+ random # ← This will generate a random DNA sequence. Each nucleotide has the same chance to be added.
554
519
 
555
520
  To **assign** a DNA sequence, do:
556
521
 
557
522
  assign ATAGGGCTTTT
558
523
 
559
- Note that since the year 2016, if you input a nucleotide sequence like
560
- the one above, without any other commands/words, then we will assume
524
+ Note that since as of the year <b>2016</b>, if you input a nucleotide sequence
525
+ like the one above, without any other commands/words, then we will assume
561
526
  that you did mean to do an assignment as-is anyway. The "assign" part
562
- then becomes superfluous.
527
+ then becomes superfluous and can be omitted.
563
528
 
564
529
  This is how this is simply done, by omitting the "assign" part of the
565
530
  above instruction altogether:
@@ -1070,18 +1035,18 @@ The text **banana** thus has the following suffixes:
1070
1035
 
1071
1036
  This subsection deals with some aspects of **HMMs**.
1072
1037
 
1073
- Why are HMMs useful in biology? They can be used to represent protein
1074
- families, for example (via pHMMs - profile hidden markov models).
1038
+ Why are HMMs useful in biology? They can be used to <b>represent protein
1039
+ families</b>, for example (via <b>pHMMs</b> - profile hidden markov models).
1075
1040
 
1076
1041
  Furthermore, they can show some bias in the mutation rate that can be
1077
1042
  observed. Different genomes are known to have different hotspots where
1078
- mutations are more likely to happen. These are examples where a HMM
1079
- may be useful.
1043
+ mutations are more likely to happen, for various reasons. These are
1044
+ examples where a HMM may be useful.
1080
1045
 
1081
- HMMs are usually based on the Shannon model where you assign different
1046
+ HMMs are usually based on the <b>Shannon model</b> where you assign different
1082
1047
  probabilities to "change" events. An example that was mentioned back
1083
- in 1948 was the english alphabet - some letters, and combinations of
1084
- letters, are more commonly seen. Shannon gave the example of "E"
1048
+ in <b>1948</b> was the english alphabet - some letters, and combinations
1049
+ of letters, are more commonly seen. Shannon gave the example of "E"
1085
1050
  versus "W", as shown in the following graph (a **finite state
1086
1051
  graph**):
1087
1052
 
@@ -1095,40 +1060,47 @@ DNA sequence, a 10-mer would be equivalent to **10 base pairs**.
1095
1060
  The individual transition states are based on an assumption of
1096
1061
  "randomness", but ensuring that these are truly random is not
1097
1062
  necessarily trivial. Computers do not really 'generate' true
1098
- randomness, at the least not when they are working solo. You
1099
- can even 'predict' some randomness here or there - see vulnerabilities
1100
- such as Specter or similar variants where software can read from
1101
- areas of the memory that should be inaccessible to them. Some
1102
- of this is based on co-predictions. For distributed computers,
1103
- you may often use random noise or decay of atoms as 'a source
1104
- of randomness''. For any DNA nucleotide sequence, we would
1105
- assume that each base pair has a 25% chance to exist at any
1106
- given position, but this is not necessarily true, for various
1107
- reasons. An interesting thought is ... why is ATP so important?
1108
- Yes, due to it being 'the energy currency in a cell' but .. why
1109
- is this ATP aka adenine? Why not GTP, aka guanine or any of
1110
- the other two nucleotides? I can not answer the question; there may
1111
- be many reasons, including differential chemical storage power as
1112
- well as mere random chance event in evolution, but for whatever
1063
+ randomness, at the least not when they are working solo, "on
1064
+ their own". You can even 'predict' some randomness here or there
1065
+ via various techniques - see vulnerabilities such as <b>Specter</b>
1066
+ or similar variants where software can read from areas of the
1067
+ memory that should be inaccessible to them. Some of this is based
1068
+ on co-predictions. For distributed computers, you may often use
1069
+ random noise or decay of atoms as 'a source of randomness'. For
1070
+ any DNA nucleotide sequence, we would assume that each base pair
1071
+ has a 25% chance to exist at any given position, but this is not
1072
+ necessarily true, again for various reasons.
1073
+
1074
+ An interesting thought is ... why is <b>ATP</b> so important?
1075
+ Yes, of course due to it being 'the energy currency in a cell' but ..
1076
+ why is this ATP, aka adenine? Why not GTP, aka guanine or any of
1077
+ the other two nucleotides? (GTP is used too, but why? Why not
1078
+ CTP and TTP?) I can not answer this question; there may
1079
+ be many reasons, including differential chemical storage power
1080
+ as well as mere random chance event in evolution, but for whatever
1113
1081
  the reason, you will not find a complete 25% percentage value
1114
1082
  for every given "slot" in DNA, depending on the organism.
1115
1083
 
1116
1084
  From a practical point of view, how can we approach Hidden Markov
1117
- Models?
1085
+ Models and use them?
1118
1086
 
1119
- Let's take the following sequence:
1087
+ Let's take the following simple sequence:
1120
1088
 
1121
1089
  ACGTACGC
1122
1090
 
1123
1091
  From this sequence we can see that the <b>3-mer</b> "ACG"
1124
1092
  is followed by either a T, or a C. Have a look at the sequence
1125
- to see if you can identify the two ACG subsequences there.
1093
+ again to see if you can identify the two ACG subsequences
1094
+ there. You can see one at the start, and the other one
1095
+ following a bit later, hence why we come to the conclusion
1096
+ that either a T or a C will follow this <b>3-mer</b>.
1126
1097
 
1127
- The probability of either T or C, thus, is 0.5 (50%);
1128
- for A and G to follow there is 0% so the latter two can
1129
- be ignored.
1098
+ The probability of either T or C to occur on <b>that</b>
1099
+ position, thus, is 0.5 (50%); for A and G to follow there
1100
+ is 0% so the latter two can be ignored.
1130
1101
 
1131
- Thus, we could use a ruby Hash as follows:
1102
+ Thus, we could use a ruby Hash as follows that should
1103
+ describe these probabilities:
1132
1104
 
1133
1105
  probabilities = {'T': 0.5, 'C': 0.5} # ignoring A and G here, but we could denote them via 0 as well
1134
1106
 
@@ -1214,34 +1186,6 @@ each edge.
1214
1186
  Parsimony assumes that substitutions are rare and that back-mutations
1215
1187
  do not occur.
1216
1188
 
1217
- ## Random stuff
1218
-
1219
- You can generate random DNA sequences in the shell:
1220
-
1221
- random dna 20
1222
- random dna 25
1223
- random dna 30
1224
-
1225
- This will generate random DNA sequences, with a length
1226
- of 20, 25, 30, respectively. This may not be very useful
1227
- but it was important that this functionality is made
1228
- available somewhere.
1229
-
1230
- You can also use some toplevel-methods to generate, e. g.
1231
- 20 random aminoacids:
1232
-
1233
- Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
1234
-
1235
- Note that there may exist other APIs within the Bioroebe project
1236
- that do the same as well.
1237
-
1238
- If you would like to use a ruby-gtk3 widget have a look
1239
- at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
1240
- It works with aminoacids, DNA and RNA, and allows the user to
1241
- create random sequences. (If you need weighted randomness then
1242
- you currently have to use the commandline variant. Perhaps I may
1243
- add support into the GUI directly for this one day.)
1244
-
1245
1189
  ## Displaying the main sequence with delimiter characters
1246
1190
 
1247
1191
  From within the <b>bioshell</b>, you can use some alternative ways to
@@ -1483,24 +1427,9 @@ You can simulate this via the following API:
1483
1427
  Bioroebe.cleave_with_trypsin(sequence_goes_in_here)
1484
1428
  Bioroebe.cleave :with_trypsin, sequence_goes_in_here
1485
1429
 
1486
- Currently (July 2021) only support for Trypsin is included, but
1430
+ Currently (<b>July 2021</b>) only support for Trypsin is included, but
1487
1431
  in the long run the goal is to add as many digestive (peptide-bond
1488
1432
  cleaving) enzymes here as possible.
1489
-
1490
- ## Freezing the main sequence - and unfreezing it again
1491
-
1492
- You can **freeze** the BioShell, meaning that it will no longer allow
1493
- for the main sequence to be modified, via:
1494
-
1495
- freeze
1496
-
1497
- To unfreeze again, issue:
1498
-
1499
- unfreeze
1500
-
1501
- This functionality has been added because the shell may sometimes be
1502
- quite eager to change the main sequence, so we needed a way to disable
1503
- any further modifications (until "unfreeze" is issued that is).
1504
1433
 
1505
1434
  ## MUMmer
1506
1435
 
@@ -2711,18 +2640,6 @@ This may look as follows:
2711
2640
 
2712
2641
  <img src="https://i.imgur.com/gAZg8qG.png" style="margin: 1em; margin-left: 3em">
2713
2642
 
2714
- ## Obtaining a subsequence from a Bioroebe::Sequence object
2715
-
2716
- Say that you have the DNA sequence **ATGCATGCAAAA**.
2717
-
2718
- There are several ways how to obtain a subsequence from
2719
- this. One variant will be shown next, by making use of
2720
- the method called **.subseq()**.
2721
-
2722
- Example:
2723
-
2724
- seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
2725
-
2726
2643
  ## Bioroebe::Protein
2727
2644
 
2728
2645
  This class is a subclass of class **Bioroebe::Sequence**. The
@@ -2737,15 +2654,26 @@ functionality is also available in another method.
2737
2654
  For now keep this in mind; at some later point I may decide whether
2738
2655
  this class is to be kept or not.
2739
2656
 
2740
- ## Permanently disabling showing the startup-introduction of the Bioshell
2657
+ In July 2022 I noticed that the bio-gem has the following method:
2741
2658
 
2742
- If you do not want to see the start-up intro, you can try
2743
- any of the following:
2659
+ p Bio::AminoAcid['A'] # => "Ala"
2744
2660
 
2745
- bioshell --permanently-disable-startup-intro
2746
- bioshell --permanently-disable-startup-notice
2747
- bioshell --permanently-no-startup-intro
2748
- bioshell --permanently-no-startup-info
2661
+ I liked this functionality, but class Bioroebe::Protein already
2662
+ has a [] method which is used to instantiate a new
2663
+ instance of class Bioroebe::Protein. So, a toplevel method
2664
+ was added instead.
2665
+
2666
+ Usage example:
2667
+
2668
+ Bioroebe::Aminoacids.one_to_three('A') # => Ala
2669
+
2670
+ So this is the equivalent to what the bio-gem does, more or
2671
+ less.
2672
+
2673
+ If you want to find out the name of a one-letter aminoacid
2674
+ you can also use this method:
2675
+
2676
+ Bioroebe::Protein.name('A') # => "alanine"
2749
2677
 
2750
2678
  ## Decoding aminoacids
2751
2679
 
@@ -2931,27 +2859,6 @@ Note that presently (April 2020) not all of PROSITE may be supported
2931
2859
  via this regex, but in the long run the plan is to support all
2932
2860
  of PROSITE's regex expression.
2933
2861
 
2934
- ## Determining how many stop codons existing in a given sequence
2935
-
2936
- You can use **bin/n_stop_codons_in_this_sequence** to determine
2937
- how many stop codons exist in a given sequence at hand.
2938
-
2939
- Usage example from the commandline:
2940
-
2941
- n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
2942
-
2943
- You can also separate these via a ' ' spacer on the commandline of
2944
- course:
2945
-
2946
- n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
2947
-
2948
- Internally this makes use of the method called
2949
- <b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
2950
- aliased names. Usage example for the method, just as in the
2951
- first example shown above:
2952
-
2953
- Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
2954
-
2955
2862
  ## AT and GC content
2956
2863
  ![alt text][cat1]
2957
2864
  [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3173,47 +3080,45 @@ can try to use:
3173
3080
  On class Bioroebe::Sequence. More customizability may be added
3174
3081
  to that method in this regard, if users need this.
3175
3082
 
3176
- ## The Hydropathy index
3083
+ ### Obtaining a subsequence from a Bioroebe::Sequence object
3177
3084
 
3178
- You can display the hydropathy index for aminoacids from within
3179
- the **bioshell**.
3085
+ Say that you have the DNA sequence **ATGCATGCAAAA**.
3180
3086
 
3181
- Simply issue:
3087
+ There are several ways how to obtain a subsequence from
3088
+ this. One variant will be shown next, by making use of
3089
+ the method called **.subseq()**.
3182
3090
 
3183
- hydropathy?
3091
+ Example:
3184
3092
 
3185
- ## Generate DNA
3093
+ seq = Bioroebe::Sequence.new("ATGCATGCAAAA"); seq.subseq(1,3) # => "ATG"
3186
3094
 
3187
- You can generate random DNA strings by issuing the following
3188
- code:
3095
+ You can also randomize the sequence, via .randomize().
3189
3096
 
3190
- x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
3097
+ Example:
3191
3098
 
3192
- As can be seen in the example above, after the #, a String will be
3193
- returned representing that nucleotide sequence.
3099
+ x = Bioroebe::Sequence.new; x.randomize
3194
3100
 
3195
- The number given to .random_dna() tells the method how many nucleotides
3196
- should be generated.
3101
+ This is similar to the method in Bioruby here:
3197
3102
 
3198
- ## The GFF file format
3103
+ https://github.com/bioruby/bioruby/blob/master/lib/bio/sequence/common.rb#L243
3199
3104
 
3200
- From within the **bioshell** you can analyze .gff and .gff3 files,
3201
- such as by issuing the following command:
3105
+ ## The Hydropathy index
3202
3106
 
3203
- gff3? foobar.gff3
3107
+ You can display the hydropathy index for aminoacids from within
3108
+ the **bioshell**.
3204
3109
 
3205
- Evidently for this to work the file at hand has to exist.
3110
+ Simply issue:
3206
3111
 
3207
- ## Shuffling the DNA/RNA string in the bioshell
3112
+ hydropathy?
3208
3113
 
3209
- Via
3114
+ ## The GFF file format
3210
3115
 
3211
- shuffle
3116
+ From within the **bioshell** you can analyze .gff and .gff3 files,
3117
+ such as by issuing the following command:
3212
3118
 
3213
- you can randomly rearrange the main DNA/RNA string.
3119
+ gff3? foobar.gff3
3214
3120
 
3215
- This can be useful if you just wish to quickly "test" new
3216
- compositions of the same nucleotide.
3121
+ Evidently for this to work the file at hand has to exist.
3217
3122
 
3218
3123
  ## The NCBI Taxonomy database (the Taxonomy submodule of the Bioroebe project)
3219
3124
 
@@ -3350,47 +3255,6 @@ nucleotides by issuing:
3350
3255
 
3351
3256
  show_individual_weight_of_the_four_dna_nucleotides
3352
3257
 
3353
- ## Truncating output in the bioroebe-shell
3354
- ![alt text][cat1]
3355
- [cat1]: https://i.imgur.com/Qmd7R0p.png
3356
-
3357
- **DNA/RNA sequences** can become very long and then become
3358
- quite difficult to view, read and handle on the commandline.
3359
-
3360
- Normally the bioroebe shell will truncate output of DNA sequences
3361
- that are "too long". This is mostly done so that working with
3362
- very long sequences becomes a bit more convenient.
3363
-
3364
- Sometimes this can become an antifeature, though, so the user
3365
- must be able to toggle this at his or her own discretion.
3366
-
3367
- By default, the bioroebe-shell (bioshell) will always try
3368
- to truncate output, but you can toggle this behaviour by
3369
- issuing:
3370
-
3371
- do not truncate
3372
-
3373
- In theory, other "do not" actions are also supported, or will
3374
- be supported in the future; right now (Oct 2019) this is a bit
3375
- limited.
3376
-
3377
- From the toplevel, you can use this method:
3378
-
3379
- Bioroebe.do_not_truncate
3380
-
3381
- The above instruction will toggle the truncate behaviour
3382
- to not truncate, ever.
3383
-
3384
- If you need to do so within the bioshell, this is the way:
3385
-
3386
- no_truncate
3387
-
3388
- Or simply
3389
-
3390
- truncate
3391
-
3392
- This will toggle, like a switch.
3393
-
3394
3258
  ## Rosalind Challenges
3395
3259
  ![alt text][cat1]
3396
3260
  [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3527,31 +3391,6 @@ investing more time into Rosalind. Let's focus on solving
3527
3391
  real, existing problems instead - at the least as far as
3528
3392
  the Bioroebe project is concerned.
3529
3393
 
3530
- ## Numbers as input in the bioshell
3531
- ![alt text][cat1]
3532
- [cat1]: https://i.imgur.com/Qmd7R0p.png
3533
-
3534
- You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
3535
-
3536
- This will attempt to <b>display the first 3 nucleotides</b> of
3537
- the assigned **main sequence**. It will only work if you have
3538
- assigned a sequence prior to that, though.
3539
-
3540
- Examples:
3541
-
3542
- 3
3543
- 33
3544
- 15
3545
-
3546
- ## transeq
3547
- ![alt text][cat1]
3548
- [cat1]: https://i.imgur.com/Qmd7R0p.png
3549
-
3550
- You can convert a DNA sequence into an aminoacid sequence by
3551
- doing this:
3552
-
3553
- transeq
3554
-
3555
3394
  ## Align two different sequences
3556
3395
  ![alt text][cat1]
3557
3396
  [cat1]: https://i.imgur.com/Qmd7R0p.png
@@ -3863,22 +3702,6 @@ does not (yet?) have support for comparing two genomes to
3863
3702
  one another and generate a visual map indicating the findings
3864
3703
  there.
3865
3704
 
3866
- ## Do not create directories on startup of the shell
3867
-
3868
- By default the bioshell will try to create some directories
3869
- on startup. This may not always be desired by the user
3870
- though, so an option has to exist to disable this functionality.
3871
-
3872
- Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
3873
- keeps track of whether directories on startup of the shell will
3874
- be created.
3875
-
3876
- To disable this behaviour on startup of the bioshell, try
3877
- something like this:
3878
-
3879
- bioshell --do-not-create-directories-on-startup
3880
- bioshell --do-not-create-directories
3881
-
3882
3705
  ## class Bioroebe::MoveFileToItsCorrectLocation
3883
3706
 
3884
3707
  This class will move a bio-file to its "correct" location, with respect
@@ -3921,15 +3744,6 @@ synonymous, aka aliases):
3921
3744
  ruler2 25 # ← use 25 characters per line
3922
3745
  ruler2 50 # ← use 50 characters per line
3923
3746
 
3924
- ## Generating a random nucleotide sequence based on frequencies
3925
-
3926
- If you ever need to generate a nucleotide frequency then you can use
3927
- the following method:
3928
-
3929
- Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
3930
- Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
3931
- Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
3932
-
3933
3747
  ## The Mouse
3934
3748
 
3935
3749
  This subsection is about the **mouse**, in particular relevant
@@ -4047,57 +3861,24 @@ has". Genes in itself are not that well-defined, so they are not necessarily
4047
3861
  the primary means of complexity. Think of this more as an interactome,
4048
3862
  where RNAs play a major dynamic role as well.
4049
3863
 
4050
- ## Bioroebe::ProfilePattern
3864
+ ## class Bioroebe::DisplayOpenReadingFrames
4051
3865
 
4052
- This class can be used to generate nucleotide sequences that
4053
- are not quite "random". For example, to generate sequences
4054
- that may "simulate" a TATA box.
3866
+ **class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
3867
+ will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
3868
+ **class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
3869
+ flexible. It shall also support **sixpack** and **showorf** from the
3870
+ **Emboss online tools**. (In fact, supporting these two use cases
3871
+ was the original reason as to why this class has been created.)
4055
3872
 
4056
- The idea for this class is to be extended into allowing
4057
- HMMs (Hidden Markov Models) one day.
3873
+ Where does the code to this class reside?
4058
3874
 
4059
- Usage example:
3875
+ It can be found here:
4060
3876
 
4061
- _ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
4062
- _.generate_sequence_based_on_this_profile
3877
+ bioroebe/utility_scripts/display_open_reading_frames/
3878
+ require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
4063
3879
 
4064
- Such a profile will encode the profile specifying the preferred sequence
4065
- letters for each position in a section of DNA. You have to provide
4066
- the Hash into the method generate_sequence_based_on_this_profile() -
4067
- or you use the default Hash, which is stored in the constant
4068
- called **PER_POSITION_HASH**.
4069
-
4070
- That profile should be a Hash, with keys pointing to A, T, C, G
4071
- and the values being an Array of likelihood chance there,
4072
- as a number, such as 140. These values are also called
4073
- **scores**. Each score contains a number for each position
4074
- that indicates how likely it is to find the given
4075
- nucleotide at that location.
4076
-
4077
- You can also use this class to generate a random DNA string,
4078
- similar to the method called
4079
- **Bioroebe.generate_random_dna_sequence()**. The difference
4080
- is that class ProfilePattern allows for a bit more fine-tuned
4081
- control. The class will likely be extended in the future too.
4082
-
4083
- ## class Bioroebe::DisplayOpenReadingFrames
4084
-
4085
- **class Bioroebe::DisplayOpenReadingFrames**, created in **May 2020**,
4086
- will eventually replace the older **class Bioroebe::ShowOrf**. Thus,
4087
- **class Bioroebe::DisplayOpenReadingFrames** will have to remain quite
4088
- flexible. It shall also support **sixpack** and **showorf** from the
4089
- **Emboss online tools**. (In fact, supporting these two use cases
4090
- was the original reason as to why this class has been created.)
4091
-
4092
- Where does the code to this class reside?
4093
-
4094
- It can be found here:
4095
-
4096
- bioroebe/utility_scripts/display_open_reading_frames/
4097
- require 'bioroebe/utility_scripts/display_open_reading_frames/display_open_reading_frames.rb'
4098
-
4099
- The display of this class is typically aimed for the commandline,
4100
- but it is planned to use the class on the www too (via sinatra).
3880
+ The display of this class is typically aimed for the commandline,
3881
+ but it is planned to use the class on the www too (via sinatra).
4101
3882
 
4102
3883
  Take note that this class also reports how many ORFs (open reading
4103
3884
  frames) have been found. The number displayed here differs from
@@ -4459,28 +4240,6 @@ the BioRoebe-Shell, then you can use either of the following:
4459
4240
 
4460
4241
  seq?
4461
4242
  seq_with_tab?
4462
-
4463
- ## Prompt (the shell prompt9
4464
-
4465
- You can set a <b>custom prompt</b>, via the keywords
4466
- "prompt" or "set_prompt".
4467
-
4468
- To display the <b>current working directory</b>, do:
4469
-
4470
- prompt pwd
4471
-
4472
- To revert to the old default again, do this:
4473
-
4474
- prompt REVERT
4475
- prompt revert
4476
- prompt DEFAULT
4477
- prompt default
4478
-
4479
- If you do not want to set any prompt, do:
4480
-
4481
- prompt none
4482
-
4483
-
4484
4243
 
4485
4244
  ## Leader and Trailer
4486
4245
 
@@ -4968,17 +4727,17 @@ For now, here is the list:
4968
4727
 
4969
4728
  ## The T-Bacteriophages
4970
4729
 
4971
- The following table only shows a short summary for the **T-phages**.
4730
+ The following table only shows a short summary for the <b>T-phages</b>.
4972
4731
 
4973
- name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size
4974
- -------------------|--------------|---------------------------|----------------|----------------------------|-------------
4975
- T1 | medium | 50 | 150 x 15 | 13 | 180
4976
- T2 | small | 65 x 80 | 120 x 20 | 21 | 120
4977
- T3 | large | 45 | invisible | 13 | 300
4978
- T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300
4979
- T5 | small | 100 | tiny | 40 | 300
4980
- T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300
4981
- T7 | large | 45 | invisible | 13 | 300
4732
+ name of the phage | Plaque size | phage-head diameter (nm) | tail diameter | latent period (in minutes) | Burst size | n genes
4733
+ -------------------|--------------|---------------------------|----------------|----------------------------|-------------|------------
4734
+ T1 | medium | 50 | 150 x 15 | 13 | 180 |
4735
+ T2 | small | 65 x 80 | 120 x 20 | 21 | 120 |
4736
+ T3 | large | 45 | invisible | 13 | 300 |
4737
+ T4 | small | 65 x 80 | 120 x 20 | 23.5 | 300 | 300
4738
+ T5 | small | 100 | tiny | 40 | 300 |
4739
+ T6 | small | 65 x 80 | 120 x 20 | 25.5 | 200-300 |
4740
+ T7 | large | 45 | invisible | 13 | 300 |
4982
4741
 
4983
4742
  The next table will show some phage genomes.
4984
4743
 
@@ -5389,215 +5148,6 @@ that format.
5389
5148
  Presently (**May 2020**) there is no support for the mmCIF format
5390
5149
  in the Bioroebe project, but this will eventually change.
5391
5150
 
5392
- ## Working with PDB files (.pdb)
5393
- ![alt text][cat1]
5394
- [cat1]: https://i.imgur.com/Qmd7R0p.png
5395
-
5396
- The **PDB**, founded in the year **1971**, holds lots of **atomic
5397
- structures of proteins**.
5398
-
5399
- In **July 2016** it contained **121000 structures**.
5400
-
5401
- In **February 2018** it contained **~124000 structures**
5402
- (from X-ray crystallography), and about **~12000 NMR
5403
- structures**. <b>NMR</b> is limited to about <b>350 amino
5404
- acids maximum length</b>, give or take.
5405
-
5406
- In **April 2020** the PDB contained **163141 structures**.
5407
-
5408
- We can see that more and more structures are available
5409
- nowadays - a trend that will most likely continue or
5410
- even accelerate. (Let's hope the quality also remains
5411
- high.)
5412
-
5413
- A typical .pdb file contains entries such as this:
5414
-
5415
- RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
5416
- ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
5417
- ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
5418
- ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
5419
- ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
5420
- ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
5421
-
5422
- (Not the first line; **RTyp** is just an explanation for the ATOM
5423
- entries below that line).
5424
-
5425
- The sequence starts from the N-terminal residue for proteins; see
5426
- the <b>Atm</b> entry at <b>Num 1</b>.
5427
-
5428
- The **meaning of these entries** is as follows:
5429
-
5430
- 1) RTyp: Record Type
5431
- 2) Num: Serial number of the atom. Each atom has a unique serial number.
5432
- 3) Atm: Atom name (in IUPAC format).
5433
- 4) Res: Residue name (IUPAC format).
5434
- 5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
5435
- 6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
5436
- 7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
5437
- 10) Occ: Occupancy factor
5438
- 11) Temp: Temperature factor (atoms disordered in the crystal have high
5439
- temperature factors; they are "wobbly" with a high factor.
5440
- This is also called the B-factor).
5441
- 12) PDB: The PDB data file unique identifier.
5442
- 13) Line: Line (record) number in the data file.
5443
-
5444
- Typically the entry on the most right area, the last one, specifies
5445
- which atom it is. A **H** stands for a hydrogen atom; the other atoms
5446
- are "heavy" atoms (heavier than hydrogen most definitely).
5447
-
5448
- Most .pdb files will contain **SEQRES** entries. These entries will list
5449
- the primary sequence of the polymeric molecules present in the entry.
5450
- You can notice this by looking at the standard 3-character code
5451
- used by SEQRES here, for the canonical amino acids. So, for instance,
5452
- the amino acids that will be mentioned in a SEQRES entry are
5453
- ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
5454
- PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
5455
- method **Bioroebe.three_to_one()** to convert back to the
5456
- one-letter chain such as follows:
5457
-
5458
- Bioroebe.three_to_one('PHE') # => "F"
5459
-
5460
- The data in a .pdb file need not necessarily only be a protein, with
5461
- a specific aminoacid sequence. It may also include DNA. An example
5462
- for such a molecule is
5463
- <b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
5464
- which includes a protein chain and a DNA chain.
5465
-
5466
- As far as the **bioroebe project** is concerned, you can parse .pdb files
5467
- via the following class:
5468
-
5469
- Bioroebe::ParsePdbFile.new
5470
- Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
5471
- Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
5472
-
5473
- This class also allows some shortcuts for integrated .pdb files,
5474
- that is files that are bundled with the bioroebe project:
5475
-
5476
- Bioroebe::ParsePdbFile.new ':1fat'
5477
-
5478
- This requires a String because ruby symbols may not start with
5479
- a number. Note that this also works through the commandline,
5480
- such as:
5481
-
5482
- parse_pdb_file :1fat
5483
-
5484
- A shell such as bash does not understand ruby symbols, so instead
5485
- a string will be passed in, being :1fat. The ParsePdbFile will
5486
- handle this correctly internally.
5487
-
5488
- Note that a small bug was fixed in the file parse_pdb_file.rb;
5489
- some entries were skipped due to an erroneous loop in the ruby
5490
- file. This was corrected in **May 2020**.
5491
-
5492
- In **March 2021** the ability to use entries such as ':1fat'
5493
- was removed again; the code remains though. The reason why
5494
- this was removed was that the .pdb files are quite large,
5495
- so distributing them via the bioroebe project makes no real
5496
- sense. Consider simply downloading the .pdb files; you
5497
- can use this from the bioshell or via something
5498
- like:
5499
-
5500
- pdb 5TIM
5501
-
5502
- Note that you can also return the aminoacid-sequence from a
5503
- .pdb file directly, since as of **May 2020**.
5504
-
5505
- Example for this:
5506
-
5507
- Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
5508
-
5509
- The first argument should be **the path to the (local)
5510
- .pdb file at hand**. (In theory support for remote .pdb
5511
- files could also be added easily, but right now this
5512
- is not possible, so you have to download it first.)
5513
-
5514
- The **specification for .pdb files** can be read at the following
5515
- two remote resources:
5516
-
5517
- http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
5518
- http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
5519
-
5520
- Note that the parse_pdb_file.rb can also do some additional
5521
- things, such as calculating the maximum distance between
5522
- atoms in that file, via the method
5523
- **.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
5524
-
5525
- If you wish to report the secondary structures from a given .pdb file
5526
- then you can use the following class:
5527
-
5528
- require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
5529
-
5530
- Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
5531
- Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
5532
-
5533
- If you wish to obtain the FASTA sequence of a particular remote
5534
- .pdb file then you can use this API:
5535
-
5536
- x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
5537
-
5538
- Keep in mind that this is the FASTA sequence; the .pdb file itself
5539
- has another format, and contains a lot more information, such as
5540
- the various ATOM entries.
5541
-
5542
- Since as of **June 2020** the command **fetch** also works from
5543
- within the Bioshell, similar to how pymol **works**. This allows
5544
- us to quickly download a remote .pdb file.
5545
-
5546
- fetch 2BTS
5547
-
5548
- You can also use the following toplevel-API to download a remote
5549
- .pdb file:
5550
-
5551
- Bioroebe.download_this_pdb
5552
- Bioroebe.download_this_pdb '355D'
5553
- Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
5554
-
5555
- Note that this will be automatically moved to the "correct" default
5556
- position in the bioroebe-project, under the **pdb/** subdirectory.
5557
-
5558
- You can also invoke this script from the commandline via
5559
- **bin/download_this_pdb**, like in this way:
5560
-
5561
- download_this_pdb 355D
5562
-
5563
- This works with several .pdb files in one go as well:
5564
-
5565
- download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
5566
-
5567
- They would all be downloaded one after the other. Be aware that
5568
- this will overwrite the old .pdb files on that position, so
5569
- if you don't want this, I recommend to do a backup on the
5570
- **pdb/** subdirectory before invoking the above call.
5571
-
5572
- You can also turn the FASTA sequence stored in a .pdb file into
5573
- a .fasta file, via **--create-fasta-file**.
5574
-
5575
- Usage examples:
5576
-
5577
- parsedb 1NR6 --create-fasta-file
5578
- parsedb 2F9Q --create-fasta-file
5579
- parsedb 3TDA --create-fasta-file
5580
- parsedb 2HI4 --create-fasta-file
5581
- parsedb 2V0M --create-fasta-file
5582
-
5583
- So if you have a file called <b>1NR6.pdb</b> and you use
5584
- the first input, a .fasta file will be created. If such
5585
- a .pdb file does not exist then this will not work, so
5586
- make sure to download the .pdb file before invoking
5587
- this commandline-flag.
5588
-
5589
- Last but not least, the following table shall document the
5590
- PDB format - it is not yet complete, but it is intended
5591
- to add the remaining datasets eventually:
5592
-
5593
- Record Name Describes
5594
- MODRES Modifications to standard residues
5595
- HET Nonstandard residues (as well as ligands, ions and water)
5596
- HETNAM Full chemical name of the residue
5597
- HETSYM Synonyms for the residue
5598
- FORMUL Chemical formula of the residue
5599
- KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
5600
-
5601
5151
  ## Sugars and glyco-patterns
5602
5152
 
5603
5153
  I am currently having to do an assignment related to glyco-patterns
@@ -5761,6 +5311,9 @@ like this:
5761
5311
 
5762
5312
  <img src="https://i.imgur.com/vr2kEBz.png" style="margin: 1em; margin-left: 3em">
5763
5313
 
5314
+ Since as of <b>July 2022</b> invalid amino acids will be automatically
5315
+ filtered away before being assigned to the input.
5316
+
5764
5317
  ## Colourizing hydrophilic and hydrophobic aminoacids on the commandline
5765
5318
 
5766
5319
  Via class **Bioroebe::ColourizeHydrophilicAndHydrophobicAminoacids** you
@@ -5774,35 +5327,36 @@ Example output for this:
5774
5327
 
5775
5328
  This subsection contains some information about proteases.
5776
5329
 
5777
- trypsin:
5330
+ Trypsin:
5778
5331
  https://en.wikipedia.org/wiki/Trypsin
5779
- cuts at: Trypsin cuts peptide chains mainly at the carboxyl
5332
+ <b>cuts at</b>: Trypsin cuts peptide chains mainly at the carboxyl
5780
5333
  side of the amino acids lysine or arginine.
5781
5334
 
5782
- chymotrypsin:
5335
+ Chymotrypsin:
5783
5336
  https://en.wikipedia.org/wiki/Chymotrypsin
5784
- cuts at: Chymotrypsin preferentially cleaves peptide amide
5337
+ <b>cuts at</b>: Chymotrypsin preferentially cleaves peptide amide
5785
5338
  bonds where the side chain of the amino acid N-terminal
5786
- to the scissile amide bond is a large hydrophobic amino
5787
- acid (tyrosine, tryptophan, and phenylalanine).
5339
+ to the scissile amide bond is <b>a large hydrophobic amino</b>
5340
+ acid (specifically: tyrosine, tryptophan, and phenylalanine).
5341
+ Chymotrypsin will cleave proteins on the <b>carboxyl side</b>
5342
+ of aromatic or large hydrophobic amino acids.
5788
5343
 
5789
- thrombin:
5344
+ Thrombin:
5790
5345
  https://en.wikipedia.org/wiki/Thrombin
5791
- cuts at: Thrombin acts as a serine protease that converts
5346
+ <b>cuts at</b>: Thrombin acts as a serine protease that converts
5792
5347
  soluble fibrinogen into insoluble strands of fibrin. It
5793
5348
  catalyzes the hydrolysis of <b>Arg-Gly</b> bonds in
5794
5349
  particular peptide sequences only.
5795
5350
 
5796
- plasmin:
5351
+ Plasmin:
5797
5352
  https://en.wikipedia.org/wiki/Plasmin
5798
- cuts at: Plasmin is a serine protease.
5353
+ <b>cuts at</b>: Plasmin is a serine protease.
5799
5354
 
5800
- papain:
5355
+ Papain:
5801
5356
  https://en.wikipedia.org/wiki/Papain
5802
- cuts at: Papain prefers to cleave after an
5803
- arginine or lysine preceded by a hydrophobic
5804
- unit (Ala, Val, Leu, Ile, Phe, Trp, Tyr) and
5805
- not followed by a valine.
5357
+ <b>cuts at</b>: Papain prefers to cleave after an arginine or
5358
+ lysine preceded by a hydrophobic unit (Ala, Val, Leu, Ile,
5359
+ Phe, Trp, Tyr) and not followed by a valine.
5806
5360
 
5807
5361
  factor Xa:
5808
5362
 
@@ -5814,8 +5368,8 @@ Some proteins may permanently reside in the lumen of the
5814
5368
  Often such proteins will have a special signal sequence attached
5815
5369
  to their **C-terminal part**, such as **KDEL** (Lys-Asp-Glu-Leu).
5816
5370
 
5817
- KDEL is not the only signal that may be used, though. Some species
5818
- may use different signals, such as:
5371
+ <b>KDEL</b> is not the only signal that may be used, though. Some
5372
+ species may use different signals, such as:
5819
5373
 
5820
5374
  aminoacids | species
5821
5375
  -------------|------------------------------------------------------------
@@ -5825,8 +5379,9 @@ may use different signals, such as:
5825
5379
  ADEL | Schizosaccharomyces pombe (fission yeast)
5826
5380
  SDEL | Plasmodium falciparum
5827
5381
 
5828
- If you work with the bioshell then you can simply use this method
5829
- to query whether the given aminoacid sequence has a KDEL sequence:
5382
+ If you work with the <b>bioshell</b> then you can simply use this
5383
+ method to query whether the given aminoacid sequence has a KDEL
5384
+ sequence:
5830
5385
 
5831
5386
  KDEL?
5832
5387
 
@@ -6237,8 +5792,6 @@ Next, do something such as this:
6237
5792
  This will show the distribution of the oligos.
6238
5793
 
6239
5794
  ## Number of chromomes in different species
6240
- ![alt text][cat1]
6241
- [cat1]: https://i.imgur.com/Qmd7R0p.png
6242
5795
 
6243
5796
  Name of the organism | Latin name | Number of chromosomes
6244
5797
  ---------------------|--------------|-----------------------
@@ -6316,112 +5869,6 @@ So this is what would be returned:
6316
5869
 
6317
5870
  Bioroebe::DetectMinimalCodon[["TTT", "TTC"]] # => ["TTY"]
6318
5871
 
6319
- ## Codon Usage
6320
-
6321
- This **paragraph** deals with some aspects of **codon usage** in different
6322
- organisms.
6323
-
6324
- Let us first define the term <b>codon usage</b>. In order to do so,
6325
- we also have to define what a <b>codon</b> is, so let's start with that.
6326
-
6327
- A <span style="color: darkgreen; font-weight: bold">codon</span> is
6328
- essentially the basic code used in DNA to denote which particular
6329
- **aminoacid** corresponds to these (three) nucleotide base pairs.
6330
- A codon is thus **a series of three nucleotides, also called
6331
- a <b>triplet</b>.
6332
-
6333
- When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
6334
- abbreviated as <b>dsDNA</b>. The codon is, however had, only found
6335
- in a single stranded molecule, even within dsDNA. Since some parts of
6336
- a **dsDNA** in any given genome gives rise to a, more or less, complementary
6337
- copy into **mRNA**, the codons that are actually used, are found in the
6338
- corresponding mRNA. (Remember that mRNA differs from DNA in that there
6339
- will be Uracil rather than Thymine; otherwise it is the same, sequence-wise.
6340
- Of course it uses another sugar (Ribose), but remember we are here mostly
6341
- interested in the **information-containing part**, not the full chemical
6342
- structure.)
6343
-
6344
- The codon is thus found on the mRNA and since mRNA is mostly
6345
- single-stranded, the codon is a component of the mRNA. It is
6346
- where the two subunits of the ribosome are assembled (or more
6347
- accurately, the smaller subunit scans along the mRNA until it
6348
- detects a start codon). Mind you, this subsection will not go into
6349
- all relevant details, so just keep in mind that the codon is the
6350
- part that will eventually be "translated" at the ribosome into
6351
- a corresponding aminoacid, excluding stop codons at the end.
6352
-
6353
- Now - different organisms use **different frequencies of codons**.
6354
- **Codon usage** thus describes the fact that many proteins in
6355
- these different organisms make use of certain codons with a
6356
- **substantially higher frequency than other codons**. We can
6357
- use statistics to infer this on a global (proteome) level
6358
- too.
6359
-
6360
- Remember that the genetic code is **degenerate**, meaning that
6361
- you have a few aminoacids that are encoded only by one codon
6362
- (<b>Tryptophan</b> and <b>Methionin</b>), whereas the other
6363
- aminoacids are encoded by more than one codon - thus, at the
6364
- very least two codons. Note that the latter codons, if they
6365
- code for the **same** aminoacid, are also called <b>synonymous
6366
- codons</b>.
6367
-
6368
- This means that if you have any given aminoacid chain, you can have
6369
- several different sequences (and codons in these sequences, which
6370
- ultimtely means that you can have different DNA sequences code for
6371
- the very same aminoacid chain).
6372
-
6373
- Usually the third base of a codon has the least influence on
6374
- codon meaning. This is also called <b>wobbling</b> - since
6375
- the anticodon loop on the tRNA is in the reverse direction,
6376
- and the wobble position refers to the tRNA, this means that
6377
- the wobble-position is at the 5'-end of the tRNA anticodon.
6378
-
6379
- Now a few words about functionality related to codons and codon
6380
- usage in the Bioroebe project.
6381
-
6382
- Say that you have a long DNA sequence; let's pick a sample
6383
- for now, such as:
6384
-
6385
- ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
6386
-
6387
- You can analyze the codons used via class **ShowCodonUsage**:
6388
-
6389
- show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
6390
-
6391
- This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
6392
- It will report the top 5 codons in use and also output the
6393
- frequency hash on the commandline.
6394
-
6395
- You can use this from ruby too, via this toplevel method:
6396
-
6397
- Bioroebe.codon_frequencies_of_this_sequence(ARGV)
6398
-
6399
- If you want to look at the actual codon frequencies used
6400
- by different organisms, have a look here:
6401
-
6402
- http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
6403
-
6404
- This is an excellent resource.
6405
-
6406
- ## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
6407
-
6408
- If you quickly wish to determine the aminoacid composition, as a
6409
- Hash, you can use **bin/aminoacid_frequencies**.
6410
-
6411
- Example from the commandline for this:
6412
-
6413
- aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
6414
-
6415
- Example from within bioroebe itself (and thus ruby):
6416
-
6417
- require 'bioroebe/frequencies.rb'
6418
-
6419
- Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
6420
-
6421
- The latter will return a Hash that you can then further make use for, such as:
6422
-
6423
- {"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
6424
-
6425
5872
  ## The Levensthein distance
6426
5873
 
6427
5874
  The <b>Levensthein distance</b> - also called a '**string metric**' - was formulated
@@ -6839,6 +6286,34 @@ change A: teal or C: slateblue to some other colour; these are HTML
6839
6286
  colours, so it is recommended to use the names of these HTML
6840
6287
  colours).
6841
6288
 
6289
+ In <b>July 2022</b> the method <b>Bioroebe.colourize_this_fasta_sequence</b>
6290
+ was extended slightly. You can now attach a "ruler" to the output, that
6291
+ is a numbered series that shows the nucleotide position, on the commandline.
6292
+
6293
+ Example for this:
6294
+
6295
+ puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
6296
+ 'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
6297
+ 'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
6298
+ ) { :with_ruler }
6299
+
6300
+ By default this will use a white colour on black background. If you want to
6301
+ modify the foreground colour you can pass the colour name to the method,
6302
+ such as via:
6303
+
6304
+ puts Bioroebe.colourize_this_fasta_sequence('ATGAAATCGCGCGTGCCGCGCGCGC'\
6305
+ 'GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCTGCGCGCGCGCGCGCGCGCGCG'\
6306
+ 'TGCCGCGCGCAGGCGGCGGCGGCGGCGGCGGCG'
6307
+ ) { :with_ruler_steelblue_colour }
6308
+
6309
+ The following image shows how this can be used on the commandline:
6310
+
6311
+ <img src="https://i.imgur.com/ucVEVnK.png" style="margin: 1em; border: 3px solid black">
6312
+
6313
+ At a later time this may be extended to allow for use in a webpage,
6314
+ that is to embed these strings directly into HTML or .php or
6315
+ .cgi.
6316
+
6842
6317
  If you wish to show a **chunked display** of the dataset (nucleotides
6843
6318
  normally) then you can use the following API:
6844
6319
 
@@ -7362,16 +6837,6 @@ This would notify the bioshell that only nucleotides from position
7362
6837
  51 to (including) position 3251 will be colourized, when doing another
7363
6838
  "ORF?" invocation.
7364
6839
 
7365
- ## Longest substring
7366
-
7367
- Within the Bioroebe::Shell you can determine the longest substring,
7368
- including gaps, like s:'
7369
-
7370
- longest_substring? ATTATTGTT | ATTATTCTT'
7371
-
7372
- Note that this will make use of the diff-lcs gem, which uses
7373
- the McIlroy-Hunt algorithm.
7374
-
7375
6840
  ## Restriction Enzymes
7376
6841
 
7377
6842
  This **subsection** will eventually be expanded to explain various things about
@@ -8730,6 +8195,22 @@ The images that can be generated via this may look as follows:
8730
8195
 
8731
8196
  <img src="https://i.imgur.com/fWwD1fj.png" style="margin: 1em; margin-left: 2em">
8732
8197
 
8198
+ Let's look at another example.
8199
+
8200
+ Say you input the following sequences there:
8201
+
8202
+ AGVV
8203
+ AGVV
8204
+ AGVV
8205
+ AGVV
8206
+ AGGV
8207
+ AGGV
8208
+ AGGV
8209
+
8210
+ The resulting image that is generated is:
8211
+
8212
+ <img src="https://i.imgur.com/3wWApIQ.png" style="margin: 1em; margin-left: 2em">
8213
+
8733
8214
  ## The Kozak Sequence
8734
8215
 
8735
8216
  The ribosome usually scans for a **AUG** codon. But there are
@@ -8869,85 +8350,6 @@ Usage Example:
8869
8350
 
8870
8351
  pfasta insulin_mRNA.fasta --toprotein
8871
8352
 
8872
- ## Determining the codon frequencies from the commandline
8873
-
8874
- In April 2022 I noticed that one use case is to show the codon
8875
- frequencies of a given sequence - typically a nucleotide sequence.
8876
-
8877
- For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
8878
- So, following that logic, a new executable was added at
8879
- **bin/codon_frequency**. This will show the Hash of the codon
8880
- frequencies, as a String, on the commandline.
8881
-
8882
- Usage example:
8883
-
8884
- codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
8885
-
8886
- The output of this would be the following:
8887
-
8888
- AUU: 2
8889
- CGU: 2
8890
- ACG: 1
8891
- AUC: 1
8892
- GAC: 1
8893
- UGA: 1
8894
- CUG: 1
8895
- ACA: 1
8896
- GUC: 1
8897
-
8898
- ## Showing the codon frequency via countcodon
8899
-
8900
- https://www.kazusa.or.jp/codon/countcodon.html offers a rather useful
8901
- functionality via a simple web-interface, in that you can pass in a mRNA
8902
- sequence, and it will then show the codon frequency/likelihood of that
8903
- sequence - all codons in that sequence, that is. This can be extended
8904
- to all protein-coding genes in a given genome, and will thus be useful
8905
- for a researcher who may be interested in determining the codon frequency
8906
- in general, across all genes in that given genome.
8907
-
8908
- You can test it with an input sequence. For instance, the following
8909
- sequence:
8910
-
8911
- ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
8912
-
8913
- Would yield this result:
8914
-
8915
- fields: [triplet] [frequency: per thousand] ([number])
8916
-
8917
- UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
8918
- UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
8919
- UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
8920
- UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
8921
-
8922
- CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
8923
- CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
8924
- CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
8925
- CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
8926
-
8927
- AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
8928
- AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
8929
- AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
8930
- AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
8931
-
8932
- GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
8933
- GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
8934
- GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
8935
- GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
8936
-
8937
- At any rate, the individual functionality for that is also available
8938
- within the Bioroebe project since as of **April 2022**.
8939
-
8940
- The method that does so is:
8941
-
8942
- Bioroebe.frequency_per_thousand
8943
- Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
8944
-
8945
- At a later time sinatra-bindings as well as ruby-gtk3 bindings will
8946
- be added, and possibly ruby-libui bindings as well, for windows
8947
- support. What is missing is support for different codon tables in
8948
- different species, but that may be added at a later time as well
8949
- - for now it seemed more important to offer the functionality.
8950
-
8951
8353
  ## class Bioroebe::Protein
8952
8354
 
8953
8355
  **class Bioroebe::Protein** can be used to store a protein sequence.
@@ -9180,6 +8582,1036 @@ time being it is what it is. At a later point in time test cases
9180
8582
  may be added to check whether it performs correctly or whether it
9181
8583
  does not.
9182
8584
 
8585
+ The other rules, also published in 2004, are the Reynolds rules. Code
8586
+ support was added to the Bioroebe project in <b>June 2022</b>, but
8587
+ it was not tested yet, so the implementation may be incorrect.
8588
+
8589
+ ## The Bioroebe::Shell interface
8590
+
8591
+ The following subsection specifically handles information
8592
+ pertaining to the <b>Bioroebe::Shell</b> interface of the
8593
+ <b>bioroebe project</b>. It is also called <b>bioshell</b>,
8594
+ to simplify spelling it.
8595
+
8596
+ ### Numbers as input in the bioshell
8597
+ ![alt text][cat1]
8598
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8599
+
8600
+ You can input a number in the **BioShell** such as <b style="color: darkblue">3</b>.
8601
+
8602
+ This will attempt to <b>display the first 3 nucleotides</b> of
8603
+ the assigned **main sequence**. It will only work if you have
8604
+ assigned a sequence prior to that, though.
8605
+
8606
+ Examples:
8607
+
8608
+ 3
8609
+ 33
8610
+ 15
8611
+
8612
+ ### transeq
8613
+ ![alt text][cat1]
8614
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8615
+
8616
+ You can convert a DNA sequence into an aminoacid sequence by
8617
+ doing this:
8618
+
8619
+ transeq
8620
+
8621
+ ### Shuffling the DNA/RNA string in the bioshell
8622
+ ![alt text][cat1]
8623
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8624
+
8625
+ Via
8626
+
8627
+ shuffle
8628
+
8629
+ you can <b>randomly rearrange the main DNA/RNA string</b>
8630
+ that is used by the <b>Bioroebe::Shell</b>.
8631
+
8632
+ This can be useful if you just wish to quickly "test"
8633
+ new compositions of the same nucleotide.
8634
+
8635
+ ### Permanently disabling showing the startup-introduction of the Bioshell
8636
+ ![alt text][cat1]
8637
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8638
+
8639
+ If you do not want to see the start-up intro, you can try
8640
+ any of the following:
8641
+
8642
+ bioshell --permanently-disable-startup-intro
8643
+ bioshell --permanently-disable-startup-notice
8644
+ bioshell --permanently-no-startup-intro
8645
+ bioshell --permanently-no-startup-info
8646
+
8647
+ ### Longest substring
8648
+ ![alt text][cat1]
8649
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8650
+
8651
+ Within the Bioroebe::Shell you can determine the longest substring,
8652
+ including gaps, like s:'
8653
+
8654
+ longest_substring? ATTATTGTT | ATTATTCTT'
8655
+
8656
+ Note that this will make use of the diff-lcs gem, which uses
8657
+ the McIlroy-Hunt algorithm.
8658
+
8659
+ ### Do not create directories on startup of the shell
8660
+ ![alt text][cat1]
8661
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8662
+
8663
+ By default the <b>bioshell</b> will try to create some directories
8664
+ on startup. This may not always be desired by the user, though,
8665
+ so an option has to exist to <b>disable</b> this functionality.
8666
+
8667
+ Internally the variable @internal_hash[:create_directories_on_startup_of_the_shell]
8668
+ keeps track of whether directories on startup of the shell will
8669
+ be created.
8670
+
8671
+ To disable this behaviour on startup of the bioshell, try
8672
+ something like this:
8673
+
8674
+ bioshell --do-not-create-directories-on-startup
8675
+ bioshell --do-not-create-directories
8676
+
8677
+ ### Generating and assigning a random amount of nucleotides
8678
+ ![alt text][cat1]
8679
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8680
+
8681
+ Via:
8682
+
8683
+ random 555
8684
+
8685
+ you can "generate" 555 random nucleotides (DNA that is) and
8686
+ assign it to the main sequence in use by the bioshell. This
8687
+ is mostly a convenience feature, if you want to debug something
8688
+ quickly.
8689
+
8690
+ ### Determining the log directory for the Bioroebe::Shell component
8691
+ ![alt text][cat1]
8692
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8693
+
8694
+ Via:
8695
+
8696
+ bioshell_log_dir?
8697
+
8698
+ you can determine the log-directory output for the bioshell
8699
+ component. On my home system this will default to
8700
+ <b>/home/Temp/bioroebe/bioshell/</b>.
8701
+
8702
+ ### Prompt (the shell prompt of the bioshell)
8703
+ ![alt text][cat1]
8704
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8705
+
8706
+ You can set a <b>custom prompt</b> in the bioshell, via
8707
+ the keywords "<b>prompt</b>" or "<b>set_prompt</b>".
8708
+
8709
+ To display the <b>current working directory</b>, do:
8710
+
8711
+ prompt pwd
8712
+
8713
+ To revert to the old default again, do this:
8714
+
8715
+ prompt REVERT
8716
+ prompt revert
8717
+ prompt DEFAULT
8718
+ prompt default
8719
+
8720
+ If you do not want to set any prompt, do:
8721
+
8722
+ prompt none
8723
+
8724
+ ### Random stuff - generating random DNA sequences in the bioshell
8725
+ ![alt text][cat1]
8726
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8727
+
8728
+ You can <b>generate random DNA sequences</b> in the
8729
+ <b>bioshell</b> via:
8730
+
8731
+ random dna 20
8732
+ random dna 25
8733
+ random dna 30
8734
+ # or simpler
8735
+ random 20
8736
+ random 25
8737
+ random 30
8738
+
8739
+ This will generate random DNA sequences, with a length
8740
+ of 20, 25, 30, respectively. This may not be very useful
8741
+ but it was important that this functionality is made
8742
+ available somewhere. Sometimes you may not even care
8743
+ about the sequence and just use the a "filler" sequence,
8744
+ so randomness has to be part of the Bioroebe project
8745
+ as well.
8746
+
8747
+ You can also use some toplevel-methods to generate, e. g.
8748
+ 20 random aminoacids. Have a look at the following
8749
+ <b>toplevel API</b>:
8750
+
8751
+ Bioroebe.random_aminoacid? 20 # => "UAVHYQQESWUYAOVESEIY"
8752
+
8753
+ Note that there may exist other APIs within the Bioroebe project
8754
+ that do the same as well.
8755
+
8756
+ If you would like to use a ruby-gtk3 widget have a look
8757
+ at **RandomSequence**, under **bioroebe/gtk3/random_sequence/**.
8758
+ It works with aminoacids, DNA and RNA, and allows the user to
8759
+ create random sequences. (If you need weighted randomness then
8760
+ you currently have to use the commandline variant. Perhaps I may
8761
+ add support into the GUI directly for this one day.)
8762
+
8763
+ ### Deprecations within the Bioroebe::Shell
8764
+ ![alt text][cat1]
8765
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8766
+
8767
+ Over the years the Bioroebe::Shell changed quite a bit.
8768
+
8769
+ This subsection here will list a few of these changes
8770
+ or rather, the deprecations.
8771
+
8772
+ **raw_sequence**: removed in June 2022 completely. It is
8773
+ simpler to handle sequences via Bioroebe::Sequence
8774
+ instead.
8775
+
8776
+ <b>@internal_hash[:array_sequences]</b> was no longer in
8777
+ use, so it was removed in July 2022.
8778
+
8779
+ ### Chop off nucleotides within the Bioroebe::Shell
8780
+ ![alt text][cat1]
8781
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8782
+
8783
+ You can use the following syntax to chop away until you find
8784
+ a particular substring, in the bioshell:
8785
+
8786
+ chop_to ATG
8787
+
8788
+ This functionality was specifically added to find the first
8789
+ ATG codon.
8790
+
8791
+ ### Truncating output in the bioroebe-shell
8792
+ ![alt text][cat1]
8793
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8794
+
8795
+ **DNA/RNA sequences** can become very long and then become
8796
+ quite difficult to view, read and handle on the commandline.
8797
+
8798
+ Normally the bioroebe shell will truncate output of DNA sequences
8799
+ that are "too long". This is mostly done so that working with
8800
+ very long sequences becomes a bit more convenient.
8801
+
8802
+ Sometimes this can become an antifeature, though, so the user
8803
+ must be able to toggle this at his or her own discretion.
8804
+
8805
+ By default, the bioroebe-shell (bioshell) will always try
8806
+ to truncate output, but you can toggle this behaviour by
8807
+ issuing:
8808
+
8809
+ do not truncate
8810
+
8811
+ In theory, other "do not" actions are also supported, or will
8812
+ be supported in the future; right now (Oct 2019) this is a bit
8813
+ limited.
8814
+
8815
+ From the toplevel, you can use this method:
8816
+
8817
+ Bioroebe.do_not_truncate
8818
+
8819
+ The above instruction will toggle the truncate behaviour
8820
+ to not truncate, ever.
8821
+
8822
+ If you need to do so within the bioshell, this is the way:
8823
+
8824
+ no_truncate
8825
+
8826
+ Or simply
8827
+
8828
+ truncate
8829
+
8830
+ This will toggle, like a switch.
8831
+
8832
+ ### Working with .pdb files in the bioshell
8833
+ ![alt text][cat1]
8834
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8835
+
8836
+ This subsection only very briefly mentions how to work with
8837
+ .pdb files in the bioshell. See other parts of this
8838
+ document for a more extensive overview how you can work
8839
+ with .pdb files via the Bioroebe project.
8840
+
8841
+ If you input something like this, if it ends with .pdb:
8842
+
8843
+ 1fat.pdb
8844
+
8845
+ And if no such file currently exists at
8846
+ /home/Temp/bioroebe/pdb/1fat.pdb then it will be
8847
+ downloaded and moved towards
8848
+ **/home/Temp/bioroebe/pdb/**.
8849
+
8850
+ This feature exists just to simplify using the
8851
+ **bioshell**.
8852
+
8853
+ ### Showing the stop codons in frame1, frame2 and frame2 in the bioshell
8854
+ ![alt text][cat1]
8855
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8856
+
8857
+ When you have a given sequence assigned to the bioshell, such
8858
+ as via "random 99", you can then show all stop codons in
8859
+ frame1, frame2 and frame3.
8860
+
8861
+ The corresponding input for this will be:
8862
+
8863
+ stop_frame1?
8864
+ stop_frame2?
8865
+ stop_frame3?
8866
+
8867
+ An image shows this next, where we first did input "random 120",
8868
+ before issuing the above-mentioned instructions one after
8869
+ the other:
8870
+
8871
+ <img src="https://i.imgur.com/HpHF4jq.png" style="margin: 1em; border: 1px solid black">
8872
+
8873
+ ### Freezing the main sequence in the bioshell - and unfreezing it again
8874
+ ![alt text][cat1]
8875
+ [cat1]: https://i.imgur.com/Qmd7R0p.png
8876
+
8877
+ You can **freeze** the BioShell, meaning that it will no longer
8878
+ allow for the main sequence to be modified, via the following
8879
+ command:
8880
+
8881
+ freeze
8882
+
8883
+ To <b>unfreeze</b> the sequence again, issue:
8884
+
8885
+ unfreeze
8886
+
8887
+ This functionality has been added because the shell may sometimes be
8888
+ quite eager to change the main sequence, so we needed a way to
8889
+ disable any further modifications (until "unfreeze" is issued
8890
+ that is).
8891
+
8892
+ ## Support for other programming languages
8893
+
8894
+ The main programming language for the bioroebe project is **ruby**.
8895
+ Ruby, from a language design point of view, is a great programming
8896
+ language - not necessarily all of ruby, but the subset that I use.
8897
+ It is very easy to quickly prototype ideas via ruby.
8898
+
8899
+ However had, ruby is known to **not** be among the fastest programming
8900
+ languages about on this planet; so, it makes sense to use other
8901
+ languages too from this point of view. Additionally there are some
8902
+ software stacks in use in **other** programming languages, such as
8903
+ matplotlib and various more.
8904
+
8905
+ Thus, it is important to **support other programming languages** as
8906
+ well, if there are useful libraries. The bioroebe project, after
8907
+ all, tries to be **practical**: it focuses on getting things done,
8908
+ no matter the language.
8909
+
8910
+ This means that support for other programming languages can be
8911
+ found in this project as well, often using system() or similar
8912
+ functionality to tap into these other programming languages. Do
8913
+ not be surprised when that happens - the bioroebe project will
8914
+ also try to act as a **practical glue** towards functionality
8915
+ enabled via other projects. We want to get things done, no
8916
+ matter the programming language at hand!
8917
+
8918
+ Whenever possible, though, the bioroebe project will try to be
8919
+ flexible in this regard, so ideally the same solution should
8920
+ work for many different programming languages.
8921
+
8922
+ While Ruby is the primary language for this project, since as
8923
+ of 2021 I will try to officially support **java**, **jruby**
8924
+ and the **GraalVM**. This is on my TODO list, though - stay
8925
+ tuned for more updates in this regard. See also the
8926
+ subsection <b>Support for Python</b>.
8927
+
8928
+ ## Support for Python
8929
+
8930
+ In <b>June 2022</b> I decided to add support for Python to bioroebe.
8931
+
8932
+ While people can - and should - easily use <b>biopython</b> instead,
8933
+ I simply wanted to see how much python-support I can add to
8934
+ bioroebe. This may lag behind some years compared to biopython,
8935
+ but I wanted to extend python support as well, so there you go.
8936
+ It is simply an additional option for the bioroebe project.
8937
+ <b>Ruby</b> will remain the primary language for the project,
8938
+ though, at the least for now.
8939
+
8940
+ ## Bioroebe::ProfilePattern
8941
+
8942
+ This class can be used to generate nucleotide sequences that
8943
+ are not quite "random". For example, to generate sequences
8944
+ that may "simulate" a TATA box.
8945
+
8946
+ The idea for this class is to be extended into allowing
8947
+ HMMs (Hidden Markov Models) one day.
8948
+
8949
+ Usage example:
8950
+
8951
+ _ = Bioroebe::ProfilePattern.new(ARGV, :do_not_run_yet)
8952
+ _.generate_sequence_based_on_this_profile
8953
+
8954
+ Such a profile will encode the profile specifying the preferred sequence
8955
+ letters for each position in a section of DNA. You have to provide
8956
+ the Hash into the method generate_sequence_based_on_this_profile() -
8957
+ or you use the default Hash, which is stored in the constant
8958
+ called **PER_POSITION_HASH**.
8959
+
8960
+ That profile should be a Hash, with keys pointing to A, T, C, G
8961
+ and the values being an Array of likelihood chance there,
8962
+ as a number, such as 140. These values are also called
8963
+ **scores**. Each score contains a number for each position
8964
+ that indicates how likely it is to find the given
8965
+ nucleotide at that location.
8966
+
8967
+ You can also use this class to generate a random DNA string,
8968
+ similar to the method called
8969
+ **Bioroebe.generate_random_dna_sequence()**. The difference
8970
+ is that class ProfilePattern allows for a bit more fine-tuned
8971
+ control. The class will likely be extended in the future too.
8972
+
8973
+ ## Generate DNA via Bioroebe.random_dna
8974
+
8975
+ You can "generate" random DNA strings by making use of the
8976
+ following code:
8977
+
8978
+ x = Bioroebe.random_dna 50 # => "AGACATCCGGCTTGGATACCTCATAAGTCATATCAGCATCGTCGGACATT"
8979
+
8980
+ As can be seen in the example above, after the #, a String will be
8981
+ returned representing that nucleotide sequence. In the case above
8982
+ it'll be 50 nucleotides in length.
8983
+
8984
+ The number given to <b>.random_dna()</b> tells the method how many
8985
+ nucleotides should be generated.
8986
+
8987
+ The method accepts a second argument, which should be a Hash.
8988
+ If it is a hash then the generated DNA will be based on the
8989
+ **probabilities** given to that Hash.
8990
+
8991
+ Let's look at specific example here:
8992
+
8993
+ Bioroebe.random_dna(50, { A: 10, T: 10, C: 10, G: 70}) # => "GGGGTGGGGAGGGTATGCGGAGGAAGGGCGGGAAGGGCGGGGGCTGGGCG"
8994
+
8995
+ As you can see, in the Hash defined above, the likelihood for
8996
+ incorporating a Guanine is much higher than for Adenine
8997
+ (70 : 10). This will be reflected in the generated DNA
8998
+ sequence which, as can be seen, contains many more
8999
+ Guanines than Adenines.
9000
+
9001
+ There is yet a third use case for the above. If you pass a **String**
9002
+ as the second argument rather than a Hash, then that String will be
9003
+ used as basis for generating the DNA string at hand.
9004
+
9005
+ Again, let's look at a specific example here:
9006
+
9007
+ Bioroebe.random_dna(10, 'ATCGATCGGG')
9008
+
9009
+ Here we add more G than A, T or C, so the new DNA sequence should
9010
+ contain these nucleotides as well.
9011
+
9012
+ More usage examples in this regard:
9013
+
9014
+ Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "TGAGGGGGGGGGTGGGAGGG"
9015
+ Bioroebe.random_dna(20, 'ATGGGGGGGG') # => "GGTAGGGGGGGGTAGGGGGG"
9016
+
9017
+ Note that this is similar to the .randomize() method in the bioruby
9018
+ project:
9019
+
9020
+ hash = {'a'=>1,'c'=>2,'g'=>3,'t'=>4}
9021
+ puts Bio::Sequence::NA.randomize(hash) # => "ggcttgttac" (for example)
9022
+
9023
+ ## Generating a random nucleotide sequence based on frequencies
9024
+
9025
+ If you ever need to generate a nucleotide frequency then you can use
9026
+ the following method:
9027
+
9028
+ Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies
9029
+ Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 100
9030
+ Bioroebe.generate_nucleotide_sequence_based_on_these_frequencies 500
9031
+
9032
+ ## Parsing genbank (.gbk) files
9033
+
9034
+ You could use class <b>Bioroebe::GenbankParser</b> to parse .gbk files, at
9035
+ the least if you want to obtain the raw sequence, in FASTA format.
9036
+
9037
+ Example for this:
9038
+
9039
+ require 'bioroebe/genbank/genbank_parser.rb'
9040
+ result = Bioroebe::GenbankParser.new('/home/Temp/bioroebe/ls_orchid.gbk')
9041
+ result.dataset? # This method call will return the FASTA sequence.
9042
+
9043
+ Note that this currently (<b>July 2022</b>) only grabs one entry. In
9044
+ the upcoming rewrite in the future the parser will be able to parse
9045
+ all entries, and then present them to the user. Stay tuned in this
9046
+ regard.
9047
+
9048
+ ## Parsers in general
9049
+
9050
+ The bioroebe project will store most parsers in the parsers/ subdirectory
9051
+ since as of <b>July 2022</b>.
9052
+
9053
+ Prior to that date different parsers were stored in different subdirectories,
9054
+ such as the parser for genbank-files being stored in the genbank/
9055
+ subdirectory. As I found this situation confusing, I settled for
9056
+ the parsers/ subdirectory since as of <b>July 2022</b>.
9057
+
9058
+ ## Coomassie staining of proteins
9059
+
9060
+ Coomassie staining is typically done on proteins, giving them a blue
9061
+ or blueish colour. <b>Coomassie staining</b> is <b>the most popular
9062
+ anionic protein dye</b>.
9063
+
9064
+ This may look like this:
9065
+
9066
+ <img src="https://i.imgur.com/6eUN7HR.png" style="margin: 1em; border: 1px solid black">
9067
+
9068
+ This picture shows five different bands. The molecular weight of the
9069
+ marker can be seen on the very left hand side, in <b>kDa</b>. The
9070
+ larger fragments can be seen on top, so the farther the band has
9071
+ moved, the smaller the fragment must be (in kDa). That means that
9072
+ the larger proteins can be found on top; the smaller proteins on
9073
+ the bottom.
9074
+
9075
+ Some bands are missing, and this gives information - that is
9076
+ that a particular protein is missing. Probably it was not
9077
+ synthesized in the given tissue at hand.
9078
+
9079
+ The staining for a Coomassie Blue stain is typically done
9080
+ via G-250, with a 0.5% density prepared in
9081
+ 50% methanol and 10% acetic acid. The staining duration is
9082
+ usually done for 5 minutes.
9083
+
9084
+ Note that the G-250 stain is the dimethyl derivative from
9085
+ R-250 - the <b>R</b> stands for <b>red</b> or <b>reddish</b>.
9086
+ Both dyes will bind via electrostatic interaction with <b>protonated
9087
+ basic amino acids</b>: that is <b>lysine</b>, <b>arginine</b>,
9088
+ and <b>histidine</b>. They can also bind via hydrophobic
9089
+ associations to aromatic residues.
9090
+
9091
+ Coomassie stains are in principle reversible. They are not
9092
+ as sensitive as silver staining, but significantly cheaper,
9093
+ which is one reason why they have become so popular.
9094
+
9095
+ Not every protein has all aminoacids, so staining may be difficult.
9096
+ For instance, the <b>glycomacropeptide</b> is the only known
9097
+ naturally occurring protein that contains no Phe (Phenylalanine; F).
9098
+
9099
+ A protein that lacks lysine, arginine, histidine or aromatic
9100
+ acids may be undetectable via Coomassie staining. However had,
9101
+ this does not seem to be a universal rule; some groups report
9102
+ that they even managed to stain "unstainable" proteins via
9103
+ Coomassie staining.
9104
+
9105
+ The paper at https://www.jbc.org/article/S0021-9258(17)39198-6/pdf,
9106
+ titled "Why Does Coomassie Brilliant Blue R Interact Differently
9107
+ with Different Proteins?" and published in the year 1985, tries
9108
+ to give some explanations to different groups yielding different
9109
+ results via Coomassie staining.
9110
+
9111
+ They specifically point out that "there is a striking correlation
9112
+ between intensity of response to Coomassie dyes and the basicity
9113
+ of a protein which depends on the number of lysine, histidine,
9114
+ and arginine residues, as well as the NH₂-terminal amino group"
9115
+ (aka the aminoterminus of the protein at hand). The concluding
9116
+ remark from that paper is that <b>"Coomassie R Interacts
9117
+ Differently with Different Proteins"</b>.
9118
+
9119
+ On class <b>Bioroebe::Protein</b> you can determine whether
9120
+ a given protein can be stained via coomassie through the
9121
+ following method:
9122
+
9123
+ .can_be_stained_via_coomassie?
9124
+
9125
+ This isn't an ideal check, so don't rely on it. It will simply
9126
+ check whether the sequence has at the least one lysine,
9127
+ or one histidine, or one arginine, or any of the aromatic
9128
+ amino acids.
9129
+
9130
+ ## Codon Usage
9131
+
9132
+ This **paragraph** deals with some aspects of **codon usage** in different
9133
+ organisms.
9134
+
9135
+ Let us first define the term <b>codon usage</b> so we can base any further
9136
+ analysis on this definition. In order to do so, we also have to define
9137
+ what a <b>codon</b> is, so let's start with that actually.
9138
+
9139
+ A <span style="color: darkgreen; font-weight: bold">codon</span> is
9140
+ essentially the basic code used in DNA to denote which particular
9141
+ **aminoacid** corresponds to these (three) nucleotide base pairs.
9142
+ A codon is thus <b>a series of three nucleotides</b>, also called
9143
+ a <b>triplet</b>, such as <b>ATG</b>.
9144
+
9145
+ When we use the term <b>base pairs</b>, we refer to **double-stranded DNA**,
9146
+ abbreviated as <b>dsDNA</b>. The codon is, however had, only found
9147
+ in a single stranded molecule, even within dsDNA. Since some parts of
9148
+ a **dsDNA** in any given genome give rise to a, more or less, complementary
9149
+ copy into **mRNA**, the codons that are actually used, are found in the
9150
+ corresponding mRNA as well, excluding the codon that codes for a stop
9151
+ signal (a so-called <b>stop codon</b>). (Remember that mRNA differs from
9152
+ DNA in that there will be Uracil rather than Thymine; otherwise it is
9153
+ the same, sequence-wise. Of course it uses another sugar (Ribose), but
9154
+ remember we are here mostly interested in the **information-containing
9155
+ part**, not the full chemical structure.)
9156
+
9157
+ The <b>codon</b> is thus found on the mRNA and since mRNA is mostly
9158
+ single-stranded, the codon is a component of the mRNA. The two subunits
9159
+ of the ribosome are assembled on a mRNA, at the least in prokaryotes (or
9160
+ more accurately, the smaller subunit scans along the mRNA until it
9161
+ <b>detects</b> a start codon). Mind you, this subsection will not go
9162
+ into all relevant details, so just keep in mind that the codon is the
9163
+ part that will eventually be "<i>translated</i>" at the ribosome into
9164
+ a corresponding aminoacid, excluding stop codons at the end.
9165
+
9166
+ Now - different organisms use **different frequencies of codons**.
9167
+ <b style="color:darkblue">Codon usage</b> thus describes the fact
9168
+ that many proteins in these different organisms make use of certain
9169
+ codons with a **substantially higher frequency than other codons**.
9170
+ We can use statistics to infer this on a global (proteome)
9171
+ level too.
9172
+
9173
+ Remember that the genetic code is **degenerate**, meaning that
9174
+ you have a few aminoacids that are encoded only by one codon
9175
+ (<b>Tryptophan</b> and <b>Methionine</b>), whereas the other
9176
+ aminoacids are encoded by more than one codon - thus, at the
9177
+ very least two codons. Note that the latter codons, if they
9178
+ code for the **same** aminoacid, are also called
9179
+ <b style="font-style: italic">synonymous codons</b>.
9180
+
9181
+ This means that if you have any given aminoacid chain, you can have
9182
+ several different sequences that would yield to the very same
9183
+ amino acid chain (and codons in these sequences, which
9184
+ ultimately means that you can have different DNA sequences
9185
+ code for the very same aminoacid chain).
9186
+
9187
+ Usually the third base of a codon has the least influence on
9188
+ codon meaning. This is also called <b>wobbling</b> - since
9189
+ the anticodon loop on the tRNA is in the reverse direction,
9190
+ and the wobble position refers to the tRNA, this means that
9191
+ the wobble-position is at the 5'-end of the tRNA anticodon.
9192
+
9193
+ Now a few words about functionality related to codons and codon
9194
+ usage in the Bioroebe project.
9195
+
9196
+ Say that you have a long DNA sequence; let's pick a sample
9197
+ for now, such as:
9198
+
9199
+ ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
9200
+
9201
+ You can analyze the codons used via class **ShowCodonUsage**
9202
+ and the corresponding entry at <b>bin/show_codon_usage</b>:
9203
+
9204
+ show_codon_usage ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG
9205
+
9206
+ This class can be found at <b>bioroebe/codons/show_codon_usage.rb</b>.
9207
+ It will report the top 5 codons in use and also output the
9208
+ frequency hash on the commandline.
9209
+
9210
+ On my computer at home the output it yields via the commandline,
9211
+ on a KDE konsole terminal, looks like this:
9212
+
9213
+ <img src="https://i.imgur.com/h55Thdu.png" style="margin: 1em; border: 3px solid black">
9214
+
9215
+ You can use this from within ruby code too, via the following
9216
+ toplevel method:
9217
+
9218
+ Bioroebe.codon_frequencies_of_this_sequence(ARGV)
9219
+
9220
+ To get the hash of the codon frequencies you can use the .hash? method:
9221
+
9222
+ hash = Bioroebe.codon_frequencies_of_this_sequence('ATGGGCGGGGTGATGGCAATGATGCCCCCGATGATG').hash?
9223
+
9224
+ If you want to look at the actual codon frequencies used
9225
+ by different organisms, have a look here:
9226
+
9227
+ http://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=11076&aa=9&style=N
9228
+
9229
+ This is an excellent resource.
9230
+
9231
+ For instance, the <i>E. coli</i> K strain can be found here:
9232
+
9233
+ https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=83333&aa=9&style=N
9234
+
9235
+ ## Determining the frequencies of aminoacids in a given aminocid (protein) sequence
9236
+
9237
+ If you quickly wish to determine the aminoacid composition, as a
9238
+ Hash, you can use **bin/aminoacid_frequencies**.
9239
+
9240
+ Example from the commandline for this:
9241
+
9242
+ aminoacid_frequencies MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST
9243
+
9244
+ Example from within bioroebe itself (and thus ruby):
9245
+
9246
+ require 'bioroebe/frequencies.rb'
9247
+
9248
+ Bioroebe.aminoacid_frequencies('MVTDEGAIYFTKDAARNWKAAVEETVSATLNRTVSSGITGASYYTGTFST')
9249
+
9250
+ The latter will return a Hash that you can then further make use for, such as:
9251
+
9252
+ {"M"=>1, "V"=>4, "T"=>9, "D"=>2, "E"=>3, "G"=>4, "A"=>7, "I"=>2, "Y"=>3, "F"=>2, "K"=>2, "R"=>2, "N"=>2, "W"=>1, "S"=>5, "L"=>1}
9253
+
9254
+ ## Determining the codon frequencies from the commandline
9255
+
9256
+ In <b>April 2022</b> I noticed that one use case is to show the
9257
+ codon frequencies of a given sequence - typically a nucleotide sequence.
9258
+
9259
+ For aminoacids there already was an executable, at **bin/aminoacid_frequencies**.
9260
+
9261
+ So, following that logic, a new executable was added at
9262
+ **bin/codon_frequency**. This will show the Hash of the codon
9263
+ frequencies, as a String, on the commandline.
9264
+
9265
+ Usage example:
9266
+
9267
+ codon_frequency ATTCGTACGATCGACTGACTGACAGTCATTCGT
9268
+
9269
+ The output of this would be the following:
9270
+
9271
+ AUU: 2
9272
+ CGU: 2
9273
+ ACG: 1
9274
+ AUC: 1
9275
+ GAC: 1
9276
+ UGA: 1
9277
+ CUG: 1
9278
+ ACA: 1
9279
+ GUC: 1
9280
+
9281
+ ## Showing the codon frequency via countcodon
9282
+
9283
+ The excellent website at https://www.kazusa.or.jp/codon/countcodon.html offers
9284
+ a rather useful functionality via a simple web-interface, in that you can pass
9285
+ in a mRNA sequence, and it will then show the codon frequency/likelihood of
9286
+ that sequence - all codons in that sequence, that is. This can be extended
9287
+ to <b>all protein-coding genes in a given genome</b>, and will thus be
9288
+ useful for a researcher who may be interested in determining the codon
9289
+ frequency in general, across all genes in that given genome.
9290
+
9291
+ You can test it with an input sequence.
9292
+
9293
+ For instance, the following sequence:
9294
+
9295
+ ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA
9296
+
9297
+ Would yield this result:
9298
+
9299
+ fields: [triplet] [frequency: per thousand] ([number])
9300
+
9301
+ UUU 0.0( 0) UCU 0.0( 0) UAU 0.0( 0) UGU 0.0( 0)
9302
+ UUC 0.0( 0) UCC 0.0( 0) UAC 25.6( 1) UGC 0.0( 0)
9303
+ UUA 0.0( 0) UCA 25.6( 1) UAA 25.6( 1) UGA102.6( 4)
9304
+ UUG 0.0( 0) UCG 25.6( 1) UAG 0.0( 0) UGG 0.0( 0)
9305
+
9306
+ CUU 0.0( 0) CCU 0.0( 0) CAU 25.6( 1) CGU 76.9( 3)
9307
+ CUC 0.0( 0) CCC 0.0( 0) CAC 0.0( 0) CGC 0.0( 0)
9308
+ CUA 0.0( 0) CCA 0.0( 0) CAA 0.0( 0) CGA 25.6( 1)
9309
+ CUG102.6( 4) CCG 0.0( 0) CAG 25.6( 1) CGG 0.0( 0)
9310
+
9311
+ AUU 76.9( 3) ACU 25.6( 1) AAU 0.0( 0) AGU 51.3( 2)
9312
+ AUC 76.9( 3) ACC 0.0( 0) AAC 0.0( 0) AGC 0.0( 0)
9313
+ AUA 0.0( 0) ACA 76.9( 3) AAA 0.0( 0) AGA 0.0( 0)
9314
+ AUG 0.0( 0) ACG 76.9( 3) AAG 0.0( 0) AGG 0.0( 0)
9315
+
9316
+ GUU 0.0( 0) GCU 0.0( 0) GAU 25.6( 1) GGU 0.0( 0)
9317
+ GUC 51.3( 2) GCC 0.0( 0) GAC 76.9( 3) GGC 0.0( 0)
9318
+ GUA 0.0( 0) GCA 0.0( 0) GAA 0.0( 0) GGA 0.0( 0)
9319
+ GUG 0.0( 0) GCG 0.0( 0) GAG 0.0( 0) GGG 0.0( 0)
9320
+
9321
+ At any rate, the individual functionality for that is also available
9322
+ within the Bioroebe project since as of **April 2022**.
9323
+
9324
+ The method that does so is:
9325
+
9326
+ Bioroebe.frequency_per_thousand
9327
+ Bioroebe.frequency_per_thousand('ATTCGTACGATCGACTGACTGACAGTCATTCGTAGTACGATCGACTGACTGACAGTCATTCGTACGATCGACTGACTGACAAGTCATTCGTACGATCGACTGACTTGACAGTCATAA') # Usage example here.
9328
+
9329
+ Sinatra-bindings exist to this functionality since as of July 2022,
9330
+ but they are not very well-polished. Ruby-gtk3 bindings may be
9331
+ added at a later time, and possibly ruby-libui bindings as well, for
9332
+ windows support. What is missing is support for different codon tables in
9333
+ different species, but that may be added at a later time as well - for now
9334
+ it seemed more important to offer the functionality.
9335
+
9336
+ ## Working with PDB files (.pdb)
9337
+
9338
+ The **PDB**, founded in the year **1971**, holds lots of **atomic
9339
+ structures of proteins**.
9340
+
9341
+ For instance, in **July 2016** it contained **121000 structures**.
9342
+
9343
+ In **February 2018** it contained **~124000 structures**
9344
+ (from X-ray crystallography), and about **~12000 NMR
9345
+ structures**. <b>NMR</b> is limited to about <b>350 amino
9346
+ acids maximum length</b>, give or take.
9347
+
9348
+ In **April 2020** the PDB contained **163141 structures**.
9349
+
9350
+ We can see that more and more structures are available nowadays -
9351
+ a trend that will most likely continue or even accelerate.
9352
+ (Let's hope the quality also remains high.)
9353
+
9354
+ A typical .pdb file contains entries such as this:
9355
+
9356
+ RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line
9357
+ ATOM 1 N ASP L 1 4.060 7.307 5.186 1.00 51.58 1FDL 93
9358
+ ATOM 2 CA ASP L 1 4.042 7.776 6.553 1.00 48.05 1FDL 94
9359
+ ATOM 3 N VAL A 25 32.433 16.336 57.540 1.00 11.92 A1 N
9360
+ ATOM 4 CA VAL A 25 31.132 16.439 58.160 1.00 11.85 A1 C
9361
+ ATOM 5 C VAL A 25 30.447 15.105 58.363 1.00 12.34 A1 C
9362
+
9363
+ (Not the first line; **RTyp** is just an explanation for the ATOM
9364
+ entries below that line).
9365
+
9366
+ The sequence starts from the N-terminal residue for proteins; see
9367
+ the <b>Atm</b> entry at <b>Num 1</b>.
9368
+
9369
+ The **meaning of these entries** is as follows:
9370
+
9371
+ 1) RTyp: Record Type
9372
+ 2) Num: Serial number of the atom. Each atom has a unique serial number.
9373
+ 3) Atm: Atom name (in IUPAC format).
9374
+ 4) Res: Residue name (IUPAC format).
9375
+ 5) Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody).
9376
+ 6) ResN: Residue sequence number. This will be incremental e. g. 1, 2 3, 4 and so forth.
9377
+ 7,8,9) X, Y, Z: Cartesian coordinates specifying atomic position in space.
9378
+ 10) Occ: Occupancy factor
9379
+ 11) Temp: Temperature factor (atoms disordered in the crystal have high
9380
+ temperature factors; they are "wobbly" with a high factor.
9381
+ This is also called the B-factor).
9382
+ 12) PDB: The PDB data file unique identifier.
9383
+ 13) Line: Line (record) number in the data file.
9384
+
9385
+ Typically the entry on the most right area, the last one, specifies
9386
+ which atom it is. A **H** stands for a hydrogen atom; the other atoms
9387
+ are "heavy" atoms (heavier than hydrogen most definitely).
9388
+
9389
+ Most .pdb files will contain **SEQRES** entries. These entries will list
9390
+ the primary sequence of the polymeric molecules present in the entry.
9391
+ You can notice this by looking at the standard 3-character code
9392
+ used by SEQRES here, for the canonical amino acids. So, for instance,
9393
+ the amino acids that will be mentioned in a SEQRES entry are
9394
+ ALA, CYS, ASP, GLU, PHE, GLY, HIS, ILE, LYS, LEU, MET, ASN,
9395
+ PRO, GLN, ARG, SER, THR, VAL, TRP and TYR. You can use the
9396
+ method **Bioroebe.three_to_one()** to convert back to the
9397
+ one-letter chain such as follows:
9398
+
9399
+ Bioroebe.three_to_one('PHE') # => "F"
9400
+
9401
+ The data in a .pdb file need not necessarily only be a protein, with
9402
+ a specific aminoacid sequence. It may also include DNA. An example
9403
+ for such a molecule is
9404
+ <b><a href="http://rcsb.org/pdb/explore/explore.do?structureId=2DGC">2dgc</a></b>,
9405
+ which includes a protein chain and a DNA chain.
9406
+
9407
+ As far as the **bioroebe project** is concerned, you can parse .pdb files
9408
+ via the following class:
9409
+
9410
+ Bioroebe::ParsePdbFile.new
9411
+ Bioroebe::ParsePdbFile.new(path_to_the_pdb_file_here)
9412
+ Bioroebe::ParsePdbFile.new('/foo/bar/ack.pdb')
9413
+
9414
+ This class also allows some shortcuts for integrated .pdb files,
9415
+ that is files that are bundled with the bioroebe project:
9416
+
9417
+ Bioroebe::ParsePdbFile.new ':1fat'
9418
+
9419
+ This requires a String because ruby symbols may not start with
9420
+ a number. Note that this also works through the commandline,
9421
+ such as:
9422
+
9423
+ parse_pdb_file :1fat
9424
+
9425
+ A shell such as bash does not understand ruby symbols, so instead
9426
+ a string will be passed in, being :1fat. The ParsePdbFile will
9427
+ handle this correctly internally.
9428
+
9429
+ Note that a small bug was fixed in the file parse_pdb_file.rb;
9430
+ some entries were skipped due to an erroneous loop in the ruby
9431
+ file. This was corrected in **May 2020**.
9432
+
9433
+ In **March 2021** the ability to use entries such as ':1fat'
9434
+ was removed again; the code remains though. The reason why
9435
+ this was removed was that the .pdb files are quite large,
9436
+ so distributing them via the bioroebe project makes no real
9437
+ sense. Consider simply downloading the .pdb files; you
9438
+ can use this from the bioshell or via something
9439
+ like:
9440
+
9441
+ pdb 5TIM
9442
+
9443
+ Note that you can also return the aminoacid-sequence from a
9444
+ .pdb file directly, since as of **May 2020**.
9445
+
9446
+ Example for this:
9447
+
9448
+ Bioroebe.return_aminoacid_sequence_from_this_pdb_file "1VII.pdb" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
9449
+
9450
+ The first argument should be **the path to the (local)
9451
+ .pdb file at hand**. (In theory support for remote .pdb
9452
+ files could also be added easily, but right now this
9453
+ is not possible, so you have to download it first.)
9454
+
9455
+ The **specification for .pdb files** can be read at the following
9456
+ two remote resources:
9457
+
9458
+ http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
9459
+ http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM
9460
+
9461
+ Note that the parse_pdb_file.rb can also do some additional
9462
+ things, such as calculating the maximum distance between
9463
+ atoms in that file, via the method
9464
+ **.try_to_determine_the_max_distance_between_the_atoms_in_this_protein()**.
9465
+
9466
+ If you wish to report the secondary structures from a given .pdb file
9467
+ then you can use the following class:
9468
+
9469
+ require 'bioroebe/pdb/report_secondary_structures_from_this_pdb_file.rb'
9470
+
9471
+ Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new
9472
+ Bioroebe::ReportSecondaryStructuresFromThisPdbFile.new('foobar.pdb')
9473
+
9474
+ If you wish to obtain the FASTA sequence of a particular remote
9475
+ .pdb file then you can use this API:
9476
+
9477
+ x = Bioroebe.return_fasta_sequence_from_this_pdb_file "2bts" # => "MLSDEDFKAVFGMTRSAFANLPLWKQQNLKKEKGLF"
9478
+
9479
+ Keep in mind that this is the FASTA sequence; the .pdb file itself
9480
+ has another format, and contains a lot more information, such as
9481
+ the various ATOM entries.
9482
+
9483
+ Since as of **June 2020** the command **fetch** also works from
9484
+ within the Bioshell, similar to how pymol **works**. This allows
9485
+ us to quickly download a remote .pdb file.
9486
+
9487
+ fetch 2BTS
9488
+
9489
+ You can also use the following toplevel-API to download a remote
9490
+ .pdb file:
9491
+
9492
+ Bioroebe.download_this_pdb
9493
+ Bioroebe.download_this_pdb '355D'
9494
+ Bioroebe.download_this_pdb '1K4R' # This is the Dengue Virus
9495
+ Bioroebe.download_this_pdb '1fat.pdb' # Lectin Phytohemagglutinin
9496
+
9497
+ This will refer to a remote URL such as
9498
+ https://files.rcsb.org/view/1FAT.pdb.
9499
+
9500
+ Note that this will be automatically moved to the "correct" default
9501
+ position in the bioroebe-project, under the **pdb/** subdirectory.
9502
+
9503
+ You can also invoke this script from the commandline via
9504
+ **bin/download_this_pdb**, like in this way:
9505
+
9506
+ download_this_pdb 355D
9507
+
9508
+ This works with several .pdb files in one go as well:
9509
+
9510
+ download_this_pdb 1NR6 2F9Q 3TDA 2HI4 2V0M
9511
+
9512
+ They would all be downloaded one after the other. Be aware that
9513
+ this will overwrite the old .pdb files on that position, so
9514
+ if you don't want this, I recommend to do a backup on the
9515
+ **pdb/** subdirectory before invoking the above call.
9516
+
9517
+ You can also turn the FASTA sequence stored in a .pdb file into
9518
+ a .fasta file, via **--create-fasta-file**.
9519
+
9520
+ Usage examples:
9521
+
9522
+ parsedb 1NR6 --create-fasta-file
9523
+ parsedb 2F9Q --create-fasta-file
9524
+ parsedb 3TDA --create-fasta-file
9525
+ parsedb 2HI4 --create-fasta-file
9526
+ parsedb 2V0M --create-fasta-file
9527
+
9528
+ So if you have a file called <b>1NR6.pdb</b> and you use
9529
+ the first input, a .fasta file will be created. If such
9530
+ a .pdb file does not exist then this will not work, so
9531
+ make sure to download the .pdb file before invoking
9532
+ this commandline-flag.
9533
+
9534
+ Last but not least, the following table shall document the
9535
+ PDB format - it is not yet complete, but it is intended
9536
+ to add the remaining datasets eventually:
9537
+
9538
+ Record Name Describes
9539
+ MODRES Modifications to standard residues
9540
+ HET Nonstandard residues (as well as ligands, ions and water)
9541
+ HETNAM Full chemical name of the residue
9542
+ HETSYM Synonyms for the residue
9543
+ FORMUL Chemical formula of the residue
9544
+ KEYWDS specifies keywords, such as "FK506 BINDING PROTEIN, FKBP12, CIS-TRANS PROLYL-ISOMERASE, ROTAMASE"
9545
+
9546
+
9547
+ ## Determining how many stop codons existing in a given sequence
9548
+
9549
+ You can use **bin/n_stop_codons_in_this_sequence** to determine
9550
+ how many stop codons exist in a given sequence at hand.
9551
+
9552
+ Usage example from the commandline:
9553
+
9554
+ n_stop_codons_in_this_sequence ATGACGTACGTCAGTCAGTGATAGTAA # => 4
9555
+
9556
+ You can also separate these via a ' ' spacer on the commandline of
9557
+ course:
9558
+
9559
+ n_stop_codons_in_this_sequence ATG ACG TAC GTC AGT CAG TGA TAG TAA # => 4
9560
+
9561
+ Internally this makes use of the method called
9562
+ <b>Bioroebe.n_stop_codons_in_this_sequence?</b> or one of its
9563
+ aliased names. Usage example for the method, just as in the
9564
+ first example shown above:
9565
+
9566
+ Bioroebe.n_stop_codons_in_this_sequence "ATGACGTACGTCAGTCAGTGATAGTAA" # => 4
9567
+
9568
+ ## The Aliphatic Index of Globular Proteins
9569
+
9570
+ In a paper from 1980, Atsushi IKAI provided a formula with which one can
9571
+ calculate the aliphatic index of a globular protein, in a short paper
9572
+ titled "Thermostability and aliphatic index of globular proteins"
9573
+ (<b>PMID: 7462208</b>,
9574
+ <a href="https://www.jstage.jst.go.jp/article/biochemistry1922/88/6/88_6_1895/_article">
9575
+ see here</a>).
9576
+
9577
+ Atsushi provided a statistical analysis of proteins, and determined
9578
+ that the aliphatic index - which is defined as the relative volume
9579
+ of a protein occupied by <b>aliphatic side chains</b> (alanine, valine,
9580
+ isoleucine, and leucine) - of proteins of thermophilic bacteria
9581
+ is significantly higher than that of ordinary proteins.
9582
+
9583
+ Atsushi reasoned that the index may be regarded as a positive
9584
+ factor for the <b>increase of thermostability of globular
9585
+ proteins</b>. The enzymes of some organisms are more stable
9586
+ at higher temperature than the enzymes of other organisms,
9587
+ in particular among <b>thermostable proteins</b>.
9588
+
9589
+ Thus, there is a good correlation between the "aliphatic
9590
+ index" on the one hand, and the thermostability of proteins
9591
+ on the other hand.
9592
+
9593
+ Atsushi gave the following formula for calculating this:
9594
+
9595
+ Aliphatic Index = XA + aXV + b (xI+XL)
9596
+
9597
+ The four letters A, V, I and L refer to the four aminoacids
9598
+ Alanine, Valine, Isoleucine and Leucine. The two coefficients
9599
+ a and b are the relative volumes of the side chains of
9600
+ Alanine. A has a value range of 2.8-3.0 and
9601
+ b has a value range of 3.8-4.0.
9602
+
9603
+ The method called <b>.aliphatic_index()</b> is making use of that
9604
+ formula. As values for a and b the two values <b>2.9</b> and
9605
+ <b>3.9</b> have been taken. The code in the bioroebe project
9606
+ for this has been inspired by: https://github.com/wwood/bioruby-aliphatic_index
9607
+
9608
+ It yields the following usage example for bioruby:
9609
+
9610
+ Bio::Sequence::AA.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
9611
+
9612
+ In bioroebe, the equivalent would be:
9613
+
9614
+ Bioroebe::Protein.new('MVKSYDRYEYEDCLGIVNSKSSNCVFLNNA').aliphatic_index # => 71.33333
9183
9615
 
9184
9616
  ## Possibly useful links in regards to molecular biology and science in general
9185
9617