User Tools

Site Tools


tutorialgcskewenglish

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

tutorialgcskewenglish [2008/04/14 07:01]
gaou
tutorialgcskewenglish [2014/01/18 07:44]
Line 1: Line 1:
-====== Introduction ====== 
  
-GC skew is a parameter that represents the amount bias of G and C in a single DNA molecule strand and its formula is (C-G)/​(C+G). ​ Chargaff'​s salt distribution law's second term defines that the quantity of G and C becomes almost the same,  this phenomena occurs ​ in the hole genome, but bias can be seen in small set  regions. In fact in some bacteria this tendency can be seen to be shifting elegantly in different places, moreover it is known that those places match replication origin/​terminus. There are two theories about how GC skew phenomena occurs, these are different mutation probability in leading strand and lagging strand and mutation bias by usage of codons. ​   
- 
- 
- 
- 
-==== Reference: ​ ==== 
- 
-  * Lobry, J.R., 1996. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660-665 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
-====== Step 0 - Starting up G-language System ====== 
- 
- 
-Now, lets start analyzing with GC skew. But before starting start up the G-language System. Using G-language System genome analysis is very simple. For example for analyzing //E.coli// genome in GenBank format bundled with G-language System, only type the two lines below and the preparation is over. 
- 
->​filename:​ test.pl 
-<code perl> 
-  use G;  
-  $gb = new G("​ecoli"​); ​ 
-</​code>​ 
- 
-For testing execute the next script. Out put as the following will appear. 
- 
-  $ perl test.pl [ENTER] ​ 
-  ​ 
-  Length of Sequence :   ​4639221 ​ 
-           A Content :   ​1142136 (24.62%) ​ 
-           T Content :   ​1140877 (24.59%) ​ 
-           G Content :   ​1176775 (25.37%) ​ 
-           C Content :   ​1179433 (25.42%) ​ 
-              Others :         0 (0.00%) ​ 
-          AT Content :    49.21% ​ 
-          GC Content :    50.79% ​ 
- 
-From outputting this statistical base amount data, the G-language System indicates that it has loaded the file successfully. 
- 
-To use local genome flatfile other than Genbank (Fasta, EMBL, swiss, SCF, PIR, GCG, raw, ace, etc), 
- 
-<code perl> 
-   use G;  
-   $gb = new G("​ecoli.fasta",​ "​Fasta"​); ​ 
-</​code>​ 
- 
-specify the data format as second argument. 
- 
- 
- 
- 
- 
-====== Step 1-GC skew observation ====== 
- 
-G-language System is prepared with various functions for genome analysis in standard. Fortunately GC skew is also one of those standard functions. Lets add a line to use GC skews function to the script made in Step 0 and execute. ​ 
- 
-> filename: test.pl 
-<code perl> 
-  use G;  
-  $gb = new G("​ecoli"​); ​ 
-  gcskew($gb); ​ 
-</​code>​ 
- 
-a graph as the below will appear. 
- 
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​gcskew.gif?​400}} 
- 
-This is the graph of //E.coli// genome GC skew plotted shifting the window at a rate of 10kbp. It can be seen that there is a very peculiar tendency with shift points at near 1,500,000 bp and 3,800,000 bp  
- 
-Data loaded while G-language System startup is all stored in $gb. For example all the base sequence is 
- 
-<code perl> 
-  $gb->​{SEQ} ​ 
-</​code>​ 
- 
-inside the above. The majority of standard functions can function by giving it this $gb. 
- 
- 
- 
- 
-==== Exercise 1: ==== 
- ​Change the file used in  $gb->new G("​ecoli"​); ​ with //​B.subtilis//​ ("​bsub"​),​ //​Mycoplasma genitalium//​ ("​mgen"​),​ etc and observe their characteristics. Also compare and discuss about the results. 
- 
- 
- 
- 
- 
-====== Step 2-GC skew analysis ​ ====== 
- 
-G-language System'​s functions are not only useful but also has many options. 
- 
-For example gcskew()'​s function has options as the fallowing. 
-^option ^description^ 
-|-window |Window size when calculating GC skew(Default is 10,000 bp)| 
-|-at |investigates AT skew when it is 1(in default it is 0)| 
-|-output |output of files when f, graph output when g, shows graph when show(default is set with show)| 
-|-filename |output file name. (by default if -output is f it specifies it to data/​gcskew.csv,​ when -output is g it specifies to graph/​gcskew.gif)| 
- 
-options is 
-<code perl> 
-  gcskew($gb, -window=>​50000,​ -at=>​1); ​ 
-</​code>​ 
-as the above "​-"​ is put on the head of the option name , and the value is connected with "​=>"​. 
- 
-For a more detailed //E.coli// GC skew analysis, lets see the AT skew and GC skew of //E.coli// every 50k bp window. AT skew as in GC skew is to see the A and T amount bias and this tendency differs from GC skew and it is not as noticeable as GC skew. 
-  
-The script is as below 
- 
-> filename: test.pl 
-<code perl> 
-  use G;  
-  $gb = new G("​ecoli"​); ​ 
-  gcskew($gb, -window=>​50000,​ -filename=>"​gcskew50k.gif"​); ​ 
-  gcskew($gb, -window=>​50000,​ -at=>1, -filename=>"​atskew50k.gif"​); ​ 
-</​code>​ 
-the graph is as the following 
- 
-GC skew: 50 k bp window 
- 
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​gcskew50k.gif?​400}} 
- 
-AT skew: 50 k bp window 
- 
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​atskew50k.gif?​400}} 
- 
-When modified to 50 k bp GC skew's shift point is much clearer. However, it seems that the result of AT skew is not as noticeable as GC skew. 
- 
- 
- 
-=== Exercise 2: === 
- See the difference in GC skew and AT skew tendencies between different species. Also see if something different can be seen when changing the window size. 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
-====== Step 3-GC skew seen from different angles ​ ====== 
- 
-As I explained in introduction,​ the reason for GC skew occurrence are thought to be lead by different mutation in leading and lagging strand and as a result of codon usage.  ​ 
- 
-If GC skew appearance can be explained as different mutation in leading strand and lagging strand, conversely it is possible to define replication origin/​terminus ​ from GC skew. Certainly GC skew shift point coincide with replication/​end point. So lets predict the replication origin/​terminus using GC skew. 
- 
-Cumulative GC skew will be useful to identify shift points. Cumulative GC skew accumulates GC skew value per window, remarkable shift point apparition is its feature. In G-language System this cumulative GC skew can be calculated using the option -cumulative=>​1. 
-<code perl> 
-   ​gcskew($gb,​ -window=>​50000,​ -cumulative=>​1); ​ 
-</​code>​ 
-when it is as the above it is possible to execute. For example it should have been displayed as below. ​ 
- 
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​cum_gcskew.gif?​400}} 
- 
-Now lets change window size and display some few accumulative GC skew graph. Notify that shift points appear very clear. 
- 
-So now, lets predict the replication origin/​terminus using cumulative GC skew. G-language System is equipped in standard with function find_ori_ter(),​ which uses cumulative GC skew to predict replication origin/​terminus inside it. 
-<code perl> 
-   ​find_ori_ter($gb,​ -window=>​500); ​ 
-</​code>​ 
-It is used as the above. 
- 
-In //E.coli//, for example prediction as the below will be outputted. ​   
- 
-  find_ori_ter:​ 
-     ​Predicted Origin: ​  ​3923622 
-     ​Predicted Terminus: 1550412 
- 
-Interesting result may be able to be seen if comparing and analyzing the state of the third base that has the most outstanding alteration between the whole genome, inside the gene region, the non coding region and three base codon of GC skew because it is said that the generation of GC skew is related to the usage of codons. In G-language System the formula for this analysis is also equipped in standard as genomicskew(). 
- 
-Formula genomicskew() has the following options. 
-^option ^description^ 
-|-divide |GC skew calculating total window amount(in default it is set to 250)| 
-|-at |when it is 1, it looks for AT skew(In default it is set to 0)| 
-|-output |when in f, it outputs file, when in g, it outputs time, and when it is show, it shows the time graph(default is set to show) | 
-|-filename |Specifies the output file name.(In default when -output is f, data/​genomicskew.csv,​ when -output is g, graph/​genomicskew.gif,​ is the settings)| 
-|-application |It is an application to show image(in default it is set to gimp)| 
- 
-   ​genomicskew($gb,​ -divide=>​250); ​ 
- 
-lets execute it as the above. Notice that the first argument is $gb. 
- 
-Output is as the following. Consider it with the all the previous analysis. 
- 
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​genomicskew.gif?​400}} 
- 
- 
- 
-== Exercise 3: == 
-GC skew analysis has still much more multilateral use. Use the kaleidoscopic G-language System'​s standard formulas and do various analysis. It may result interesting when comparing each others. 
- 
- 
- 
- 
- 
-====== Step 4-for higher analysis ====== 
- 
- 
-G-language System is equipped in standard with various genome analysis formulas and thus it possible to do a large range of analysis. However as a bioinformatitian it is required to use these formulas only as a tool to research deeper for the quest of  life phenomena. ​   ​ 
- 
-G-language System not only is equipped with genome analysis formulas, but also is prepared with platforms to handle genome database with ease. For more information refer to periodic documentation in G.pm but that platform is formula $gb, which can be called from a instance in the G-language System and it processes each gene, start/stop codon sorroundings,​ intron/​exon,​ etc.  ​ 
- 
-GC skew analysis may not be the only analysis but a necessary tool in the process. ​ 
- 
-For example when looking for the relation of each gene's occurence and second codon, usage of GC skew to detect the tendency near it, is something that can be thought. I will write the way to express this using the G-language System as one example. 
-<code perl> 
-   use G;  
-   $gb = new G("​ecoli"​); ​ 
-   ​cai($gb); ​ 
-   ​$w_val = w_value($gb); ​ 
-    
-   ​foreach $cds ($gb->​cds()){ ​ 
-      $secondcodon = $gb->​after_startcodon($cds,​ 3);  
-      $w_second = $$w_val{$secondcodon}; ​ 
-      $cai = $gb->​{$cds}->​{cai}; ​ 
-      if ($w_second < 0.5 && $cai > 0.8){  
-         ​$afterstart = $gb->​after_startcodon($cds,​ 99);  
-         ​gcskew(\$afterstart,​ -window=>​9,​ -filename=>"​$cds-gcskew.gif"​); ​ 
-      }  
-    }  
-</​code>​ 
-First start up the G-language System and load the genome data base. Using the cai() formula, it inserts the CAI value(Codon Adaptation Index: It is a parameter of translation efficiencies but can be used as a parameter for gene development amount) . Bias of codon usage value, the W value can be obtained by w_value() and store it to $w_val. 
- 
-<code perl> 
-   ​foreach $cds ($gb->​cds()){ ​ 
-    
-   ​} ​ 
-</​code>​ 
- 
-is the most basic way to process every CDS using the G-language System. $gb->​cds() returns all names inside the genome data base in $gb. In other words, by doing foreach, it is possible to analyse all CDS. 
- 
-$gb has 
- 
->LOCUS, HEADER, FEATURE1 . FEATURE2 ... FEATURE4000 ... , CDS1 . CDS2 ... CDS4000 ..., SEQ  
- 
-structure body such as the above stored with FEATURE information. It is to say that each CDS information is in a structure body with a name as CDS+number, and information is accessed to each hierarchicaly as $gb->​{CDS534}->​{start}. 
- 
-<code perl> 
-      $secondcodon = $gb->​after_startcodon($cds,​ 3);  
-      $w_second = $$w_val{$secondcodon}; ​ 
-      $cai = $gb->​{$cds}->​{cai}; ​ 
-</​code>​ 
- 
-this part, first takes three letters after the start codon of $cds with standard formula after_startcodon(),​ and inputs it to $secondcodon. Also it acquires the W value and the CAI value of that gene as well. 
- 
-<code perl> 
-      if ($w_second < 0.5 && $cai > 0.8){  
-         ​$afterstart = $gb->​after_startcodon($cds,​ 99);  
-         ​gcskew(\$afterstart,​ -window=>​9,​ -filename=>"​$cds-gcskew.gif"​); ​ 
-      }  
-</​code>​ 
- 
-this section is for watching the start codon down stream GC skew 99 bp in a window every 9 bp (three codons) when the CAI value is more or equal to 0.8, or in other words, genes placed just after the start codon with high development quantity and with a W value of less or equal to 0.5, a  rear codon. ​     
- 
-This is a script complex for some degree, but it only has 13 lines. G-language System allows complex analysis efficiently and easily as this.  
- 
-For the rest, go for the quest to discover the truth of life using the G-language System. ​ 
tutorialgcskewenglish.txt ยท Last modified: 2014/01/18 07:44 (external edit)