User Tools

Site Tools


problem_13

Practice 1-3 - Basic DNA sequence analysis (3)

A consecutive pair of nucleotides are called dinucleotide and there are 16 types of nucleotides as a combination of four bases. In the Practice 1-3, modify previous program for dinucleotide frequency analysis. Calculate the dinucleotide composition in the Mycoplasma genome and compare it with the result of Practice 1-2, considering the result of Practice 1-2 as the expected value. Key method in this practice is (1) the calculation of the expected value and (2) how to compare the observed and expected values. O/E (observed/expected ratio) values are defined as observed value divided by the expected value. By using O/E values, one can identify a certain bias in the observed value from the expected value, such as to say, "AA is only half abundant than expected, but CT is twice as abundant as what has been expected.

1 Counting the dinucleotide frequency

1.1 Setting variables

my %dinuc;   # Numbers of dinucleotides
my %diexp;   # Expectation of dinucleotides
my %diobs;   # Observed frequency of dinucleotides
my %oe;       # O/E value for dinucleotides

Above variables seem to be required in this analysis. “%” in front of variable name indicates a "hash" in Perl. The idea of hash is similar to array, which has “@” in front of the variable name, but is different from arrays since hash can accept character keys.

@array = (1, 2, 3);

Above is an example of array. To access to value in array users need to $array[1] to get value 2.

%hash = {'string'=>1, 'message'=>2, 'line'=>3}

But in hash, users can access to value 2 with key “message” like this: $hash{'message'}.

1.2 Improve counting

Since dinucleotides are composed of two characters, one cannot use tr/ or s/ because it does not return replaced number. To solve this problem, let's use the "for" statement, and slide along the sequence single character at a time, taking two characters in a frame. For example, if a sequence was “atgcggctg” first frame will be “at” and second will be “tg”. In this way, you can check a sequence from the first character until the one-minus-the-last character. Note, however, that Perl position number starts from 0 and not 1, so the last position for counting dinucleotides will be two minus the length of the sequence.

There are function in Perl called substr() which cuts out partial sequence from the given sequence. So to make a frame composed of two characters, use substr() to select two characters.

$parts = substr($seq, 0, 2);

Here, use hash to add its count by supplying the dinucleotide as a key.

$diobs{$parts} ++;

$parts in above code is a frame composed of two characters, dinucleotide, as a key for hash.

Make sure that “$” is in front of variable name, and not “%”, because here the counting value is a scalar part of the hash, and not the hash itself.

2 Calculating the O/E value

2.1 Calculating the expected value (Basic)

What is an expected value? Literally, it is a value that is expected for an event. Thinking about dinucleotide, theoretical expectation for each of them is 1/16. But as you have seen in Practices 1-1 and 1-2, every species possesses unique A, T, G and C composition. Therefore, some nucleotide is less than other nucleotide and same things can be said for the balance between the dinucleotides.

Therefore, one can calculate the expected value of dinucleotide by multiplying the percentages of the first and second nucleotides.

2.2 Calculate expectation (Advanced)

$diexp{'aa'} = $percent_a * $percent_a;
$diexp{'at'} = $percent_a * $percent_t;
                    :
                    :
                    :

Perl has "foreach" statement to access values in an array.

foreach $content (@array){
	print $content;
}

Above code shows that an element in the array is stored into variable $content. Beneath is an example of access to hash values.

foreach $key (sort keys %hash){
	print "$key = " , $hash{$key};
}

“sort keys” in the code sorts the keys to the %hash in an alphabetical order. Now use foreach statement to access the values in hash %diobs and calculate the O/E values.

2.3 Calculate the O/E value

O/E value is calculated by dividing the observed value by expected value. Calculate O/E value by previously acquired observed and expected values, and store it in a hash.

3 Output

All the values are now stored in hash %oe with dinucleotides as its key (aa, at, …). For each line of output let’s print out “Name of dinucleotide”, “O/E value”, “Observed value” and “Expected value”.

     O/E: Observation/Expectation
aa  *.22:        0.*5/0.*2
at  *.77:        *.*9/*.*2
                 :
                 :
                 :

Fix the decimal places in the numeral if needed.

print ("     O/E: Observation/Expectation\n");
printf ("%s  %.2f:       %.2f/%.2f\n", $key, $oe{$key}, $diobs{$key}, $diexp{$key});

Above code is one example for fixed numerals and hash handling.

problem_13.txt · Last modified: 2014/01/18 07:44 (external edit)