User Tools

Site Tools


Practice 1-2 - Basic DNA sequence analysis (2)

Refine previous program that enables to load whole genome sequence of M. genitalium and compare the compositional difference (bias). Practices 1-1 and 1-2 are similar, and the major difference comes from the type of DNA sequence format. In Problem 1-1 data are already prepared by cutting and pasting target sequence from data, but in Practice 1-2, users need to acquire the DNA sequence from genome data format which includes annotations and other information. If the DNA sequence is parsed out, counting can be done in the same way as done in 1-1. Again, read the passage and think logically, try to think about the meanings of every procedure. Then make a basic design of program.

1 Programming design

1.1 Improve data loading

The script for the Practice 1-1 looks like this.


open(FILE, ?????);

while (<FILE>) {
  ?????  # Remove linefeed code
  ?????  # Join a sequence into variable $seq

$A = ?????
$T = ?????
$G = ?????
$C = ?????

Look at the genome data to see what needs to be done for acquiring DNA sequence.

LOCUS       L43967     580074 bp    DNA   circular  BCT       05-NOV-1998
DEFINITION  Mycoplasma genitalium G37 complete genome.
        1 taagttatta tttagttaat acttttaaca atattattaa ggtatttaaa aaatactatt
   580021 gaaatgatca tatatttaaa tgattataat atttctttaa tactaaaaaa atac

Firstly, there are comments such as “LOCUS”, “DEFINITION” and “ACCESSION” in the beginning of data. Target DNA sequence is in the last part of the file and one line above the sequence is the tag “ORIGIN”. So, the strategy will be to skip until the comment comes out, and read the sequences until the very last line of the data marked with two slashes.

From these, loading data in the program can be improved to the following:



open(FILE, ?????);

while (<FILE>) {
     if (?????) {
          while (<FILE>) {
              last if (/\/\//);  # Terminate program if a sequence includes "//"
                   ?????              # Remove linefeed code
                   ?????              # Join a sequence into variable $seq


$A = ?????;
$T = ?????;
$G = ?????;
$C = ?????;

2 Comparing both results

2.1 Decimal calculation

In the Practice 1-2, comparison of result with that of Practice 1-1 is required. Sine here one would be comparing two sets of unequal populations, consider outputting results in percentages of up to the second decimal place.

If users are to count adenine for instance then the script should look like this.


length() function returns the length of a variable. If the variable was a string then it returns the number of characters so that length() function for $seq returns number of nucleotide in M. genitalium.

2.2 Output of decimals

Outputs in Perl can be formatted by printf() function like in C. So let’s cut up to the second decimal places.

printf("A:   %.2f\n", $percent);

“.2” in the front of f means to print the numeral to the second decimal places. In the same way, “.4” means to print the numeral to the forth decimal places.

3 Advanced: Combine two script into a single process

You have now made two programs, one from Practice 1-1 and the another from Practice 1-2. These two programs are very much similar so as an advanced problem, let’s combine the two into one. Some ideas to improve the program may be to make it prompt for the input file name. In this way, this script can be a generic program for calculating the nucleotide composition.

problem_12.txt · Last modified: 2014/01/18 07:44 (external edit)