User Tools

Site Tools


tutorialgcskewenglish

Introduction

GC skew is a parameter that represents the amount bias of G and C in a single DNA molecule strand and its formula is (C-G)/(C+G). Chargaff's salt distribution law's second term defines that the quantity of G and C becomes almost the same, this phenomena occurs in the hole genome, but bias can be seen in small set regions. In fact in some bacteria this tendency can be seen to be shifting elegantly in different places, moreover it is known that those places match replication origin/terminus. There are two theories about how GC skew phenomena occurs, these are different mutation probability in leading strand and lagging strand and mutation bias by usage of codons.

Reference:

  • Lobry, J.R., 1996. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660-665

Step 0 - Starting up G-language System

Now, lets start analyzing with GC skew. But before starting start up the G-language System. Using G-language System genome analysis is very simple. For example for analyzing E.coli genome in GenBank format bundled with G-language System, only type the two lines below and the preparation is over.

filename: test.pl
  use G; 
  $gb = new G("ecoli"); 

For testing execute the next script. Out put as the following will appear.

$ perl test.pl [ENTER] 

Length of Sequence :   4639221 
         A Content :   1142136 (24.62%) 
         T Content :   1140877 (24.59%) 
         G Content :   1176775 (25.37%) 
         C Content :   1179433 (25.42%) 
            Others :         0 (0.00%) 
        AT Content :    49.21% 
        GC Content :    50.79% 

From outputting this statistical base amount data, the G-language System indicates that it has loaded the file successfully.

To use local genome flatfile other than Genbank (Fasta, EMBL, swiss, SCF, PIR, GCG, raw, ace, etc),

   use G; 
   $gb = new G("ecoli.fasta", "Fasta"); 

specify the data format as second argument.

Step 1-GC skew observation

G-language System is prepared with various functions for genome analysis in standard. Fortunately GC skew is also one of those standard functions. Lets add a line to use GC skews function to the script made in Step 0 and execute.

filename: test.pl
  use G; 
  $gb = new G("ecoli"); 
  gcskew($gb); 

a graph as the below will appear.

This is the graph of E.coli genome GC skew plotted shifting the window at a rate of 10kbp. It can be seen that there is a very peculiar tendency with shift points at near 1,500,000 bp and 3,800,000 bp

Data loaded while G-language System startup is all stored in $gb. For example all the base sequence is

  $gb->{SEQ} 

inside the above. The majority of standard functions can function by giving it this $gb.

Exercise 1:

Change the file used in $gb→new G("ecoli"); with B.subtilis ("bsub"), Mycoplasma genitalium ("mgen"), etc and observe their characteristics. Also compare and discuss about the results.

Step 2-GC skew analysis

G-language System's functions are not only useful but also has many options.

For example gcskew()'s function has options as the fallowing.

option description
-window Window size when calculating GC skew(Default is 10,000 bp)
-at investigates AT skew when it is 1(in default it is 0)
-output output of files when f, graph output when g, shows graph when show(default is set with show)
-filename output file name. (by default if -output is f it specifies it to data/gcskew.csv, when -output is g it specifies to graph/gcskew.gif)

options is

  gcskew($gb, -window=>50000, -at=>1); 

as the above "-" is put on the head of the option name , and the value is connected with "⇒".

For a more detailed E.coli GC skew analysis, lets see the AT skew and GC skew of E.coli every 50k bp window. AT skew as in GC skew is to see the A and T amount bias and this tendency differs from GC skew and it is not as noticeable as GC skew.

The script is as below

filename: test.pl
  use G; 
  $gb = new G("ecoli"); 
  gcskew($gb, -window=>50000, -filename=>"gcskew50k.gif"); 
  gcskew($gb, -window=>50000, -at=>1, -filename=>"atskew50k.gif"); 

the graph is as the following

GC skew: 50 k bp window

AT skew: 50 k bp window

When modified to 50 k bp GC skew's shift point is much clearer. However, it seems that the result of AT skew is not as noticeable as GC skew.

Exercise 2:

See the difference in GC skew and AT skew tendencies between different species. Also see if something different can be seen when changing the window size.

Step 3-GC skew seen from different angles

As I explained in introduction, the reason for GC skew occurrence are thought to be lead by different mutation in leading and lagging strand and as a result of codon usage.

If GC skew appearance can be explained as different mutation in leading strand and lagging strand, conversely it is possible to define replication origin/terminus from GC skew. Certainly GC skew shift point coincide with replication/end point. So lets predict the replication origin/terminus using GC skew.

Cumulative GC skew will be useful to identify shift points. Cumulative GC skew accumulates GC skew value per window, remarkable shift point apparition is its feature. In G-language System this cumulative GC skew can be calculated using the option -cumulative⇒1.

   gcskew($gb, -window=>50000, -cumulative=>1); 

when it is as the above it is possible to execute. For example it should have been displayed as below.

Now lets change window size and display some few accumulative GC skew graph. Notify that shift points appear very clear.

So now, lets predict the replication origin/terminus using cumulative GC skew. G-language System is equipped in standard with function find_ori_ter(), which uses cumulative GC skew to predict replication origin/terminus inside it.

   find_ori_ter($gb, -window=>500); 

It is used as the above.

In E.coli, for example prediction as the below will be outputted.

find_ori_ter:
   Predicted Origin:   3923622
   Predicted Terminus: 1550412

Interesting result may be able to be seen if comparing and analyzing the state of the third base that has the most outstanding alteration between the whole genome, inside the gene region, the non coding region and three base codon of GC skew because it is said that the generation of GC skew is related to the usage of codons. In G-language System the formula for this analysis is also equipped in standard as genomicskew().

Formula genomicskew() has the following options.

option description
-divide GC skew calculating total window amount(in default it is set to 250)
-at when it is 1, it looks for AT skew(In default it is set to 0)
-output when in f, it outputs file, when in g, it outputs time, and when it is show, it shows the time graph(default is set to show)
-filename Specifies the output file name.(In default when -output is f, data/genomicskew.csv, when -output is g, graph/genomicskew.gif, is the settings)
-application It is an application to show image(in default it is set to gimp)
   genomicskew($gb, -divide=>250); 

lets execute it as the above. Notice that the first argument is $gb.

Output is as the following. Consider it with the all the previous analysis.

Exercise 3:

GC skew analysis has still much more multilateral use. Use the kaleidoscopic G-language System's standard formulas and do various analysis. It may result interesting when comparing each others.

Step 4-for higher analysis

G-language System is equipped in standard with various genome analysis formulas and thus it possible to do a large range of analysis. However as a bioinformatitian it is required to use these formulas only as a tool to research deeper for the quest of life phenomena.

G-language System not only is equipped with genome analysis formulas, but also is prepared with platforms to handle genome database with ease. For more information refer to periodic documentation in G.pm but that platform is formula $gb, which can be called from a instance in the G-language System and it processes each gene, start/stop codon sorroundings, intron/exon, etc.

GC skew analysis may not be the only analysis but a necessary tool in the process.

For example when looking for the relation of each gene's occurence and second codon, usage of GC skew to detect the tendency near it, is something that can be thought. I will write the way to express this using the G-language System as one example.

   use G; 
   $gb = new G("ecoli"); 
   cai($gb); 
   $w_val = w_value($gb); 
 
   foreach $cds ($gb->cds()){ 
      $secondcodon = $gb->after_startcodon($cds, 3); 
      $w_second = $$w_val{$secondcodon}; 
      $cai = $gb->{$cds}->{cai}; 
      if ($w_second < 0.5 && $cai > 0.8){ 
         $afterstart = $gb->after_startcodon($cds, 99); 
         gcskew(\$afterstart, -window=>9, -filename=>"$cds-gcskew.gif"); 
      } 
    } 

First start up the G-language System and load the genome data base. Using the cai() formula, it inserts the CAI value(Codon Adaptation Index: It is a parameter of translation efficiencies but can be used as a parameter for gene development amount) . Bias of codon usage value, the W value can be obtained by w_value() and store it to $w_val.

   foreach $cds ($gb->cds()){ 
 
   } 

is the most basic way to process every CDS using the G-language System. $gb→cds() returns all names inside the genome data base in $gb. In other words, by doing foreach, it is possible to analyse all CDS.

$gb has

LOCUS, HEADER, FEATURE1 . FEATURE2 … FEATURE4000 … , CDS1 . CDS2 … CDS4000 …, SEQ

structure body such as the above stored with FEATURE information. It is to say that each CDS information is in a structure body with a name as CDS+number, and information is accessed to each hierarchicaly as $gb→{CDS534}→{start}.

      $secondcodon = $gb->after_startcodon($cds, 3); 
      $w_second = $$w_val{$secondcodon}; 
      $cai = $gb->{$cds}->{cai}; 

this part, first takes three letters after the start codon of $cds with standard formula after_startcodon(), and inputs it to $secondcodon. Also it acquires the W value and the CAI value of that gene as well.

      if ($w_second < 0.5 && $cai > 0.8){ 
         $afterstart = $gb->after_startcodon($cds, 99); 
         gcskew(\$afterstart, -window=>9, -filename=>"$cds-gcskew.gif"); 
      } 

this section is for watching the start codon down stream GC skew 99 bp in a window every 9 bp (three codons) when the CAI value is more or equal to 0.8, or in other words, genes placed just after the start codon with high development quantity and with a W value of less or equal to 0.5, a rear codon.

This is a script complex for some degree, but it only has 13 lines. G-language System allows complex analysis efficiently and easily as this.

For the rest, go for the quest to discover the truth of life using the G-language System.

tutorialgcskewenglish.txt · Last modified: 2014/01/18 07:44 (external edit)