User Tools

Site Tools


This is an old revision of the document!


GC skew is a parameter that represents the amount bias of G and C in a single DNA molecule strand and its formula is (C-G)/(C+G). Chargaff's salt distribution law's second term defines that the quantity of G and C becomes almost the same, this phenomena occurs in the hole genome, but bias can be seen in small set regions. In fact in some bacterium this tendency can be seen to be shifting elegantly in different places, moreover it is known that those places match replication/ending points. There is two theories about how GC skew phenomena occurs, these are different mutation probability in leading strand and lagging strand and mutation bias by usage of codons.


  • Lobry, J.R., 1996. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660-665

Step 0 - Starting up G-language System

Now, lets start analyzing with GC skew. But before starting start up the G-language System. Using G-language System genome analysis is very simple. For example for analyzing E.coli's genome in Genbank style, only type the two lines below and the preparation is over.


use G; 
$gb = new G("ecoli.gbk"); 

For testing execute the next script. Out put as the following will appear.

$ perl [ENTER] 

Length of Sequence :   4639221 
         A Content :   1142136 (24.62%) 
         T Content :   1140877 (24.59%) 
         G Content :   1176775 (25.37%) 
         C Content :   1179433 (25.42%) 
            Others :         0 (0.00%) 
        AT Content :    49.21% 
        GC Content :    50.79% 

From outputting this statistical base amount data, the G-language System indicates that it has downloaded the file successfully.

To use genome database other than Genbank (Fasta, EMBL, swiss, SCF, PIR, GCG, raw, ace, etc),

 use G; 
 $gb = new G("ecoli.fasta", "Fasta"); 

specify the data format as second argument and it will be downloaded.

Step 1-GC skew observation

G-language System is prepared with various functions for genome analysis in standard. Fortunately GC skew is also one of those standard functions. Lets add a line to use GC skews function to the script made in Step 0 and execute.

use G; 
$gb = new G("ecoli.gbk"); 

a graph as the below will appear.

This is the graph of E.coli's genome GC skew plotted shifting the window at a rate of 10kbp. It can be seen that there is a very peculiar tendency with shift points at near 1,500,000 bp and 3,800,000 bp

Data loaded while G-language System startup is all stored in $gb. For example all the base sequence is


inside the above. The majority of standard functions can function by giving it this $gb or $gb→{SEQ}'s reference (those who do not understand $gb→{SEQ} please remember it in a set). In gcskew() $gb→{SEQ}'s reference is the first argument.

Exercise 1: Change the file used in $gb→new G("ecoli.gbk"); with B.subtilis, H.pylori, H.influenzae, etc and observe their characteristics. Also compare and think about the results.

Step 2-GC skew analysis

G-language System's functions are not only useful but also has kaleidoscopic options.

For example gcskew()'s function has options as the fallowing.

option description
-window Window size when calculating GC skew(Default is 10,000 bp)
-at investigates AT skew when it is 1(in default it is 0)
-output output of files when f, graph output when g, shows graph when show(default is set with show)
-filename nominates output file name. (In default if -output is f it specifies it to data/gcskew.csv, when -output is g it specifies to graph/gcskew.gif)
-application application for showing images. (default is set as gimp)

options is

gc(\$gb->{SEQ}, -window=>50000, -at=>1); 

as the above "-" is put on the head of the option name , and the value is connected with "⇒".

For a more detailed E.coli's GC skew analysis, lets see the AT skew and GC skew of E.coli every 50k bp window. AT skew as in GC skew is to see the A and T amount bias and this tendency differs from GC skew and it is not as noticeable as GC skew.

The script is as below

use G; 
$gb = new G("ecoli.gbk"); 
gcskew(\$gb->{SEQ}, -window=>50000, -filename=>"gcskew50k.gif"); 
gcskew(\$gb->{SEQ}, -window=>50000, -at=>1, -filename=>"atskew50k.gif"); 

the graph is as the following

GC skew: 50 k bp window

AT skew: 50 k bp window

When modified to 50 k bp GC skew's shift point is much clearer. However, it seems that the result of AT skew is not as noticeable as GC skew.

Exercise 2: See the difference in GC skew and AT skew tendencies between different species. Also see if something different can be seen when changing the window size.

Step 3 -GC skew seen from different angles

As I explained in introduction, the reason for GC skew occurrence are thought to be lead by different mutation in leading and lagging strand and as a result of codon usage.

If GC skew appearance can be explained as different mutation in leading strand and lagging strand, conversely it is possible to define replication/end point from GC skew. Certainly GC skew shift point coincide with replication/end point. So lets predict the replication/ending point using GC skew.

Accumulating GC skew will be useful to identify shift points. Accumulating GC skew accumulates GC skew value per window, remarkable shift point apparition is its feature. In G-language System this GC skew accumulation is in formula cum_gcskew().

Formula cum_gcskew() has the following options. option description -window window size when calculating Cumulative GC skew. (in default it is 10,000 bp) -at when 1, it looks up AT skew (in default it is 0) -output output of files when f, graph output when g, shows graph when show(default is set to show) -filename nominates output file name. (In default if -output is set to f and it specifies it to data/cum_gcskew.csv, when -output is g it specifies as graph/gcskew.gif) -application application for showing images. (default is set to gimp)

 cum_gcskew(\$gb->{SEQ}, -window=>50000); 

when it is as the above it is possible to execute. For example it should have been displayed as below.

Now lets change window size and display some few accumulative GC skew graph. Notify that shift points appear very clear.

So now, lets predict the replication/end place using accumulative GC skew. G-language System is equipped in standard with function find_ori_ter(), which uses accumulative GC skew to predict replication/finish point inside it.

-window is the only option of find_ori_ter(). This is a option to change accumulative GC skew, and it is set to 1000 in default. By using it with cum_gcskew() and changing window size lets predict the replication/finish point.

 find_ori_ter(\$gb->{SEQ}, -window=>500); 

It is used as the above.

If window size is 10000 bp, in E.coli, for example prediction as the below will be outputted.


 Window size = 10000 
 Predicted Origin:   3915000 
 Predicted Terminus: 1545000 

Interesting result may be able to be seen if comparing and analyzing the state of the third base that has the most outstanding alteration between the whole genome, inside the gene region, the non coding region and three base codon of GC skew because it is said that the generation of GC skew is related to the usage of codons. In G-language System the formula for this analysis is also equipped in standard as genomicskew().

Formula genomicskew() has the following options. option description -divide GC skew calculating total window amount(in default it is set to 250) -at when it is 1, it looks for AT skew(In default it is set to 0) -output when in f, it outputs file, when in g, it outputs time, and when it is show, it shows the time graph(default is set to show) -filename Specifies the output file name.(In default when -output is f, data/genomicskew.csv, when -output is g, graph/genomicskew.gif, is the settings) -application It is an application to show image(in default it is set to gimp)

 genomicskew($gb, -divide=>250); 

lets execute it as the above. Notice that the first argument is $gb.

Output is as the following. Consider it with the all the previous analysis.

Exercise 3: GC skew analysis has still much more multilateral use. Use the kaleidoscopic G-language System's standard formulas and do various analysis. It may result interesting when comparing each others.

Step 4- for higher analysis

G-language System is equipped in standard with various genome analysis formulas and thus it possible to do a large range of analysis. However as a bioinformatitian it is required to use these formulas only as a tool to research deeper for the quest of life phenomena.

G-language System not only is equipped with genome analysis formulas, but also is prepared with platforms to handle genome database with ease. For more information refer to periodic documentation in but that platform is formula $gb, which can be called from a instance in the G-language System and it processes each gene, start/stop codon sorroundings, intron/exon, etc.

GC skew analysis may not be the only analysis but a necessary tool in the process.

For example when looking for the relation of each gene's occurence and second codon, usage of GC skew to detect the tendency near it, is something that can be thought. I will write the way to express this using the G-language System as one example.

 use G; 
 $gb = new G("ecoli.gbk"); 
 $w_val = w_value($gb); 

foreach $cds ($gb→cds()){

    $secondcodon = $gb->after_startcodon($cds, 3); 
    $w_second = $$w_val{$secondcodon}; 
    $cai = $gb->{$cds}->{cai}; 
    if ($w_second < 0.5 && $cai > 0.8){ 
       $afterstart = $gb->after_startcodon($cds, 99); 
       gcskew(\$afterstart, -window=>9, -filename=>"$cds-gcskew.gif"); 

First start up the G-language System and load the genome data base. Using the cai() formula, it inserts the CAI value(Codon Adaptation Index: It is a parameter of translation efficiencies but can be used as a parameter for gene development amount) . Bias of codon usage value, the W value can be obtained by w_value() and store it to $w_val.

 foreach $cds ($gb->cds()){ 


is the most basic way to process every CDS using the G-language System. $gb→cds() returns all names inside the genome data base in $gb. In other words, by doing foreach, it is possible to analyse all CDS.

$gb は $gb has


structure body such as the above stored with FEATURE information. It is to say that each CDS information is in a structure body with a name as CDS+number, and information is accessed to each hierarchicaly as $gb→{CDS534}→{start}.

    $secondcodon = $gb->after_startcodon($cds, 3); 
    $w_second = $$w_val{$secondcodon}; 
    $cai = $gb->{$cds}->{cai}; 

this part, first takes three letters after the start codon of $cds with standard formula after_startcodon(), and inputs it to $secondcodon. Also it acquires the W value and the CAI value of that gene as well.

    if ($w_second < 0.5 && $cai > 0.8){ 
       $afterstart = $gb->after_startcodon($cds, 99); 
       gcskew(\$afterstart, -window=>9, -filename=>"$cds-gcskew.gif"); 

this section is for watching the start codon down stream GC skew 99 bp in a window every 9 bp (three codons) when the CAI value is more or equal to 0.8, or in other words, genes placed just after the start codon with high development quantity and with a W value of less or equal to 0.5, a rear codon.

This is a script complex for some degree, but it only has 13 lines. G-language System allows complex analysis efficiently and easily as this.

For the rest, go for the quest to discover the truth of life using the G-language System.

tutorialgcskewenglish.1187532552.txt.gz · Last modified: 2014/01/18 07:44 (external edit)