User Tools

Site Tools


tutorialgcskewenglish

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tutorialgcskewenglish [2007/08/20 07:17]
gaou
tutorialgcskewenglish [2014/01/18 07:44] (current)
Line 1: Line 1:
 ====== Introduction ====== ====== Introduction ======
  
-GC skew is a parameter that represents the amount bias of G and C in a single DNA molecule strand and its formula is (C-G)/​(C+G). ​ Chargaff'​s salt distribution law's second term defines that the quantity of G and C becomes almost the same,  this phenomena occurs ​ in the hole genome, but bias can be seen in small set  regions. In fact in some bacterium ​this tendency can be seen to be shifting elegantly in different places, moreover it is known that those places match replication/​ending points. There is two theories about how GC skew phenomena occurs, these are different mutation probability in leading strand and lagging strand and mutation bias by usage of codons. ​  +GC skew is a parameter that represents the amount bias of G and C in a single DNA molecule strand and its formula is (C-G)/​(C+G). ​ Chargaff'​s salt distribution law's second term defines that the quantity of G and C becomes almost the same,  this phenomena occurs ​ in the hole genome, but bias can be seen in small set  regions. In fact in some bacteria ​this tendency can be seen to be shifting elegantly in different places, moreover it is known that those places match replication ​origin/terminus. There are two theories about how GC skew phenomena occurs, these are different mutation probability in leading strand and lagging strand and mutation bias by usage of codons. ​   
  
  
-\\ 
  
 ==== Reference: ​ ==== ==== Reference: ​ ====
Line 20: Line 20:
  
  
-\\+
  
 ====== Step 0 - Starting up G-language System ====== ====== Step 0 - Starting up G-language System ======
  
  
-Now, lets start analyzing with GC skew. But before starting start up the G-language System. Using G-language System genome analysis is very simple. For example for analyzing E.coli'​s ​genome in Genbank style, only type the two lines below and the preparation is over.+Now, lets start analyzing with GC skew. But before starting start up the G-language System. Using G-language System genome analysis is very simple. For example for analyzing ​//E.coli// genome in GenBank format bundled with G-language System, only type the two lines below and the preparation is over.
  
 >​filename:​ test.pl >​filename:​ test.pl
 +<code perl>
   use G;    use G; 
-  $gb = new G("​ecoli.gbk"​); ​+  $gb = new G("​ecoli"​); ​ 
 +</​code>​
  
 For testing execute the next script. Out put as the following will appear. For testing execute the next script. Out put as the following will appear.
Line 45: Line 46:
           GC Content :    50.79% ​           GC Content :    50.79% ​
  
-From outputting this statistical base amount data, the G-language System indicates that it has downloaded ​the file successfully.+From outputting this statistical base amount data, the G-language System indicates that it has loaded ​the file successfully.
  
-To use genome ​database ​other than Genbank (Fasta, EMBL, swiss, SCF, PIR, GCG, raw, ace, etc),+To use local genome ​flatfile ​other than Genbank (Fasta, EMBL, swiss, SCF, PIR, GCG, raw, ace, etc),
  
 +<code perl>
    use G;     use G; 
    $gb = new G("​ecoli.fasta",​ "​Fasta"​); ​    $gb = new G("​ecoli.fasta",​ "​Fasta"​); ​
 +</​code>​
 +
 +specify the data format as second argument.
  
-specify the data format as second argument and it will be downloaded. 
  
  
  
-\\ 
  
 ====== Step 1-GC skew observation ====== ====== Step 1-GC skew observation ======
Line 63: Line 66:
  
 > filename: test.pl > filename: test.pl
 +<code perl>
   use G;    use G; 
-  $gb = new G("​ecoli.gbk");  +  $gb = new G("​ecoli"​);​  
-  gcskew(\$gb->{SEQ}); +  gcskew($gb); ​ 
 +</​code>​
  
 a graph as the below will appear. a graph as the below will appear.
Line 72: Line 76:
 {{http://​www.g-language.org/​data/​gaou/​gcskew/​gcskew.gif?​400}} {{http://​www.g-language.org/​data/​gaou/​gcskew/​gcskew.gif?​400}}
  
-This is the graph of E.coli'​s ​genome GC skew plotted shifting the window at a rate of 10kbp. It can be seen that there is a very peculiar tendency with shift points at near 1,500,000 bp and 3,800,000 bp +This is the graph of //E.coli// genome GC skew plotted shifting the window at a rate of 10kbp. It can be seen that there is a very peculiar tendency with shift points at near 1,500,000 bp and 3,800,000 bp 
  
 Data loaded while G-language System startup is all stored in $gb. For example all the base sequence is Data loaded while G-language System startup is all stored in $gb. For example all the base sequence is
  
 +<code perl>
   $gb->​{SEQ} ​   $gb->​{SEQ} ​
 +</​code>​
 +
 +inside the above. The majority of standard functions can function by giving it this $gb.
  
-inside the above. The majority of standard functions can function by giving it this $gb or $gb->​{SEQ}'​s reference (those who do not understand $gb->​{SEQ} please remember it in a set). In gcskew() $gb->​{SEQ}'​s reference is the first argument. 
  
  
-\\ 
  
 ==== Exercise 1: ==== ==== Exercise 1: ====
- ​Change the file used in  $gb->new G("​ecoli.gbk"​); ​ with B.subtilis, ​H.pylori, H.influenzae, etc and observe their characteristics. Also compare and think about the results.+ ​Change the file used in  $gb->new G("​ecoli"​); ​ with //B.subtilis// ("​bsub"​)//​Mycoplasma genitalium//​ ("​mgen"​), etc and observe their characteristics. Also compare and discuss ​about the results. 
  
  
-\\ 
  
  
 ====== Step 2-GC skew analysis ​ ====== ====== Step 2-GC skew analysis ​ ======
  
-G-language System'​s functions are not only useful but also has kaleidoscopic ​options.+G-language System'​s functions are not only useful but also has many options.
  
 For example gcskew()'​s function has options as the fallowing. For example gcskew()'​s function has options as the fallowing.
Line 102: Line 108:
  
 options is options is
 +<code perl>
   gcskew($gb, -window=>​50000,​ -at=>​1); ​   gcskew($gb, -window=>​50000,​ -at=>​1); ​
 +</​code>​
 as the above "​-"​ is put on the head of the option name , and the value is connected with "​=>"​. as the above "​-"​ is put on the head of the option name , and the value is connected with "​=>"​.
  
-For a more detailed E.coli'​s ​GC skew analysis, lets see the AT skew and GC skew of E.coli every 50k bp window. AT skew as in GC skew is to see the A and T amount bias and this tendency differs from GC skew and it is not as noticeable as GC skew.+For a more detailed ​//E.coli// GC skew analysis, lets see the AT skew and GC skew of //E.coli// every 50k bp window. AT skew as in GC skew is to see the A and T amount bias and this tendency differs from GC skew and it is not as noticeable as GC skew.
    
 The script is as below The script is as below
  
 > filename: test.pl > filename: test.pl
 +<code perl>
   use G;    use G; 
-  $gb = new G("​ecoli.gbk"​); ​+  $gb = new G("​ecoli"​); ​
   gcskew($gb, -window=>​50000,​ -filename=>"​gcskew50k.gif"​); ​   gcskew($gb, -window=>​50000,​ -filename=>"​gcskew50k.gif"​); ​
   gcskew($gb, -window=>​50000,​ -at=>1, -filename=>"​atskew50k.gif"​); ​   gcskew($gb, -window=>​50000,​ -at=>1, -filename=>"​atskew50k.gif"​); ​
 +</​code>​
 the graph is as the following the graph is as the following
  
 GC skew: 50 k bp window GC skew: 50 k bp window
  
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​gcskew50k.gif}}+{{http://​www.g-language.org/​data/​gaou/​gcskew/​gcskew50k.gif?400}}
  
 AT skew: 50 k bp window AT skew: 50 k bp window
  
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​atskew50k.gif}}+{{http://​www.g-language.org/​data/​gaou/​gcskew/​atskew50k.gif?400}}
  
 When modified to 50 k bp GC skew's shift point is much clearer. However, it seems that the result of AT skew is not as noticeable as GC skew. When modified to 50 k bp GC skew's shift point is much clearer. However, it seems that the result of AT skew is not as noticeable as GC skew.
  
-\\+
  
 === Exercise 2: === === Exercise 2: ===
  See the difference in GC skew and AT skew tendencies between different species. Also see if something different can be seen when changing the window size.  See the difference in GC skew and AT skew tendencies between different species. Also see if something different can be seen when changing the window size.
- 
- 
- 
- 
- 
- 
- 
- 
- 
-\\ 
  
 ====== Step 3-GC skew seen from different angles ​ ====== ====== Step 3-GC skew seen from different angles ​ ======
Line 149: Line 145:
 As I explained in introduction,​ the reason for GC skew occurrence are thought to be lead by different mutation in leading and lagging strand and as a result of codon usage.  ​ As I explained in introduction,​ the reason for GC skew occurrence are thought to be lead by different mutation in leading and lagging strand and as a result of codon usage.  ​
  
-If GC skew appearance can be explained as different mutation in leading strand and lagging strand, conversely it is possible to define replication/​end point  from GC skew. Certainly GC skew shift point coincide with replication/​end point. So lets predict the replication/​ending point using GC skew. +If GC skew appearance can be explained as different mutation in leading strand and lagging strand, conversely it is possible to define replication ​origin/terminus ​ from GC skew. Certainly GC skew shift point coincide with replication/​end point. So lets predict the replication ​origin/terminus ​using GC skew.
- +
-Accumulating GC skew will be useful to identify shift points. Accumulating GC skew accumulates GC skew value per window, remarkable shift point apparition is its feature. In G-language System this GC skew accumulation is in formula cum_gcskew(). +
- +
- +
-Formula cum_gcskew() has the following options. +
-^option description +
-|-window |window size when calculating Cumulative GC skew. (in default it is 10,000 bp)| +
-|-at |when 1, it looks up AT skew (in default it is 0)| +
-|-output |output of files when f, graph output when g, shows graph when show(default is set to show)| +
-|-filename |nominates output file name. (In default if -output is set to f  and it specifies it to data/​cum_gcskew.csv,​ when -output is g it specifies as graph/​gcskew.gif)| +
- +
-   ​cum_gcskew($gb,​ -window=>​50000); ​+
  
 +Cumulative GC skew will be useful to identify shift points. Cumulative GC skew accumulates GC skew value per window, remarkable shift point apparition is its feature. In G-language System this cumulative GC skew can be calculated using the option -cumulative=>​1.
 +<code perl>
 +   ​gcskew($gb,​ -window=>​50000,​ -cumulative=>​1); ​
 +</​code>​
 when it is as the above it is possible to execute. For example it should have been displayed as below. ​ when it is as the above it is possible to execute. For example it should have been displayed as below. ​
  
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​cum_gcskew.gif}}+{{http://​www.g-language.org/​data/​gaou/​gcskew/​cum_gcskew.gif?400}}
  
 Now lets change window size and display some few accumulative GC skew graph. Notify that shift points appear very clear. Now lets change window size and display some few accumulative GC skew graph. Notify that shift points appear very clear.
  
-So now, lets predict the replication/​end place using accumulative ​GC skew. G-language System is equipped in standard with function find_ori_ter(),​ which uses accumulative ​GC skew to predict replication/​finish point inside it. +So now, lets predict the replication ​origin/terminus ​using cumulative ​GC skew. G-language System is equipped in standard with function find_ori_ter(),​ which uses cumulative ​GC skew to predict replication ​origin/terminus ​inside it. 
- +<code perl> 
--window is the only option of find_ori_ter(). This is a option to change accumulative GC skew, and it is set to 1000 in default. By using it with cum_gcskew() and changing window size lets predict the replication/​finish point. +   ​find_ori_ter($gb,​ -window=>​500);​  
- +</​code>​
-   ​find_ori_ter(\$gb->{SEQ}, -window=>​500);​  +
 It is used as the above. It is used as the above.
  
-If window size is 10000 bp, in E.coli, for example prediction as the below will be outputted. ​  +In //E.coli//, for example prediction as the below will be outputted. ​  
  
-  find_ori_ter: ​ +  find_ori_ter:​ 
-     ​Window size = 10000  +     ​Predicted Origin: ​  3923622 
-     ​Predicted Origin: ​  3915000 ​ +     ​Predicted Terminus: ​1550412
-     ​Predicted Terminus: ​1545000 ​+
  
 Interesting result may be able to be seen if comparing and analyzing the state of the third base that has the most outstanding alteration between the whole genome, inside the gene region, the non coding region and three base codon of GC skew because it is said that the generation of GC skew is related to the usage of codons. In G-language System the formula for this analysis is also equipped in standard as genomicskew(). Interesting result may be able to be seen if comparing and analyzing the state of the third base that has the most outstanding alteration between the whole genome, inside the gene region, the non coding region and three base codon of GC skew because it is said that the generation of GC skew is related to the usage of codons. In G-language System the formula for this analysis is also equipped in standard as genomicskew().
Line 194: Line 179:
 |-application |It is an application to show image(in default it is set to gimp)| |-application |It is an application to show image(in default it is set to gimp)|
  
 +<code perl>
    ​genomicskew($gb,​ -divide=>​250); ​    ​genomicskew($gb,​ -divide=>​250); ​
 +</​code>​
  
 lets execute it as the above. Notice that the first argument is $gb. lets execute it as the above. Notice that the first argument is $gb.
Line 200: Line 187:
 Output is as the following. Consider it with the all the previous analysis. Output is as the following. Consider it with the all the previous analysis.
  
-{{http://​www.g-language.org/​data/​gaou/​gcskew/​genomicskew.gif}}+{{http://​www.g-language.org/​data/​gaou/​gcskew/​genomicskew.gif?400}} 
  
-\\ 
  
 == Exercise 3: == == Exercise 3: ==
Line 209: Line 196:
  
  
-\\+
  
 ====== Step 4-for higher analysis ====== ====== Step 4-for higher analysis ======
Line 221: Line 208:
  
 For example when looking for the relation of each gene's occurence and second codon, usage of GC skew to detect the tendency near it, is something that can be thought. I will write the way to express this using the G-language System as one example. For example when looking for the relation of each gene's occurence and second codon, usage of GC skew to detect the tendency near it, is something that can be thought. I will write the way to express this using the G-language System as one example.
 +<code perl>
    use G;     use G; 
-   $gb = new G("​ecoli.gbk"​); ​+   $gb = new G("​ecoli"​); ​
    ​cai($gb); ​    ​cai($gb); ​
    ​$w_val = w_value($gb); ​    ​$w_val = w_value($gb); ​
Line 236: Line 223:
       }        } 
     }      } 
 +</​code>​
 First start up the G-language System and load the genome data base. Using the cai() formula, it inserts the CAI value(Codon Adaptation Index: It is a parameter of translation efficiencies but can be used as a parameter for gene development amount) . Bias of codon usage value, the W value can be obtained by w_value() and store it to $w_val. First start up the G-language System and load the genome data base. Using the cai() formula, it inserts the CAI value(Codon Adaptation Index: It is a parameter of translation efficiencies but can be used as a parameter for gene development amount) . Bias of codon usage value, the W value can be obtained by w_value() and store it to $w_val.
  
 +<code perl>
    ​foreach $cds ($gb->​cds()){ ​    ​foreach $cds ($gb->​cds()){ ​
        
    ​} ​    ​} ​
 +</​code>​
  
 is the most basic way to process every CDS using the G-language System. $gb->​cds() returns all names inside the genome data base in $gb. In other words, by doing foreach, it is possible to analyse all CDS. is the most basic way to process every CDS using the G-language System. $gb->​cds() returns all names inside the genome data base in $gb. In other words, by doing foreach, it is possible to analyse all CDS.
Line 251: Line 240:
 structure body such as the above stored with FEATURE information. It is to say that each CDS information is in a structure body with a name as CDS+number, and information is accessed to each hierarchicaly as $gb->​{CDS534}->​{start}. structure body such as the above stored with FEATURE information. It is to say that each CDS information is in a structure body with a name as CDS+number, and information is accessed to each hierarchicaly as $gb->​{CDS534}->​{start}.
  
 +<code perl>
       $secondcodon = $gb->​after_startcodon($cds,​ 3);        $secondcodon = $gb->​after_startcodon($cds,​ 3); 
       $w_second = $$w_val{$secondcodon}; ​       $w_second = $$w_val{$secondcodon}; ​
       $cai = $gb->​{$cds}->​{cai}; ​       $cai = $gb->​{$cds}->​{cai}; ​
 +</​code>​
  
 this part, first takes three letters after the start codon of $cds with standard formula after_startcodon(),​ and inputs it to $secondcodon. Also it acquires the W value and the CAI value of that gene as well. this part, first takes three letters after the start codon of $cds with standard formula after_startcodon(),​ and inputs it to $secondcodon. Also it acquires the W value and the CAI value of that gene as well.
  
 +<code perl>
       if ($w_second < 0.5 && $cai > 0.8){        if ($w_second < 0.5 && $cai > 0.8){ 
          ​$afterstart = $gb->​after_startcodon($cds,​ 99);           ​$afterstart = $gb->​after_startcodon($cds,​ 99); 
          ​gcskew(\$afterstart,​ -window=>​9,​ -filename=>"​$cds-gcskew.gif"​); ​          ​gcskew(\$afterstart,​ -window=>​9,​ -filename=>"​$cds-gcskew.gif"​); ​
       }        } 
 +</​code>​
  
 this section is for watching the start codon down stream GC skew 99 bp in a window every 9 bp (three codons) when the CAI value is more or equal to 0.8, or in other words, genes placed just after the start codon with high development quantity and with a W value of less or equal to 0.5, a  rear codon. ​     this section is for watching the start codon down stream GC skew 99 bp in a window every 9 bp (three codons) when the CAI value is more or equal to 0.8, or in other words, genes placed just after the start codon with high development quantity and with a W value of less or equal to 0.5, a  rear codon. ​    
tutorialgcskewenglish.1187594227.txt.gz · Last modified: 2014/01/18 07:44 (external edit)