User Tools

Site Tools


problem_11

Practice 1-1 - Basic DNA sequence analysis (1)

Write a program that counts all the number of each base (A,T,G and C) from a given specific area of genome. First, read the question carefully and think well about the meanings and solutions. Then design a scheme of the program.

1 Programing design

1.1 Overview

By dividing the scheme of program into three parts, one can clearly see what to make.

  1. Open the data file.
  2. Load the data file and count up the bases.
  3. Analyze the data file and save it to a file.

Here, users need some basic understanding on programming Perl such as standard functions for data inputs and outputs. Syntax on Perl is different from some other languages so reference books can help you to follow the Perl programming procedures.

Once steps and the logic for programming is ready, you can begin to write a script, but let’s prepare a little more for better programming.

1.2 Preparation

Now we are going to think about a global settings of the script. The Perl language has high extensibility that enables users to process any kind of processes. Therefore certain arrangements for the script can guide users to ease programming.

A variable is a place where values are stored in computer. Let’s think about variables first. In the Perl language users can ignore variable "types" such as strings and numerals that are different from other popular languages such as C or Java. $var is used in Perl to represent a string variable with “$” in the prefix of variable name. In this Practice 1-1, you are going to count up four types of bases so you will need four types of variables to store each value. Also, another variable for keeping a DNA sequence will be helpful.

Altogether, five variables are fundamental for this program.

In Perl programming, try to use “my” whenever declaring a variable. Declaration is not necessary in Perl, but declarations do make programming much more readable and as a result it leads to more comfortable programming experience. Scope for each variable is an important factor in programming so take care when declaring variables within a loop process where variables become a local variable.

Lastly, put the following line at the top of script to abbreviate perl command in the shell when the file permission is 755.

#!/usr/local/bin/perl

This makes running of the script shorter, from

% perl myprogram.pl

to

% myprogram.pl

From the above preparations, basic script should now look like this.

#!/usr/local/bin/perl

my($seq, $A, $T, $G, $C);

In the next section, let’s write an actual program. This time we will skip constructing subroutines since it is a short and simple script.

2 Open file

2.1 Filehandle

Perl has very powerful system for file handling. Inside the script, users declare special variable called the filehandle which is different from the normal variables mentioned above. All the letters in the filehandle is written in upper case like FILEHANDLE for instance. So within the script users do not use the filename directly, but use the filehandle to cope with file input and output.

2.2 Open

When opening a file, use the "open" function and "close" function when closing a file. This open and close is common for file loading and outputting.

open(FILEHANDLE, "filename");

Above isa typical way of Perl programming. Set absolute path for a file to open, and the system will load the file. If users want to save or append a file then put “>” or “»” identifier in front of the file name. For example if you want to save a data into “out.txt” then write a script like this.

open(OUT, ‘>out.txt’);

Metaphorically, filehandle is a pipe for a file which plays a role of interface between a file and a program. When handling a file user always use filehandle so that once a file is set to a filehandle there is no need to rewrite file name again. Moreover, filehandle keeps track of the position within the given file, starting from the top line, progressing line-by-line as programming loop continues.

2.3 Refine open

Adding some extra line to the script makes the program much more friendly to users and computer.

open(FILE, “my.seq”) || die(“ERROR: file does not exist\n”);
print STDERR“open data file my.seq\n";

The above script returns an error message and quits when the program fails to open the specified file, as "ERROR: file does not exist”, and continues by printing “open data file my.seq” upon success. STDERR is an identifier for standard error output.

2.4 Close

The following syntax closes a file. Remember to close the filehandle when file handling is finished. Once a filehandle is closed, the filehandle name can be reused for other files.

close(FILEHANDLE);

3 Data analysis

The following are a few hints to the practice.

3.1 Using the "while" statement

The logic of this program is to load the data and count the number of bases. This section will provide a clue for reading a file in Perl. The "while" statement provides a way to manage basic loop process that reads a file from top to bottom.

while (<FILE>) {


}

Above code is the syntax for "while" statement in Perl. A filehandle covered with brackets “<>” reads each line of a file. Brackets returns 1 if there is any line and 0 if it reaches the end of file, and finishes the process of "while" statement.

In each process of "while" statement, loaded line is stored into special variable $_ so use this variable within the loop for necessary procedures.

In the while statement, variables within are initialized in each loop process. Therefore, users need to declare each variable before the while statement to escape and avoid initialization problem.

3.2 Store the DNA sequence

Before calculating the number of bases, read the file and store the DNA sequence into a variable. Declare a variable $seq before while statement and join each sequence by each loop. Make sure when storing a new sequence into to the previous sequence remove linefeed code in the end of every line.

To remove linefeed code, use replace function "s/" in Perl. Replacement process is one of the most powerful functionality in Perl that replaces characters into another, separated by slashes. For example, when replacing linefeed code into ENTER then code will be like this. $seq =~ s/\n/ENTER/; Put a switch g so that replacement goes through all instances of "¥n". $seq =~ s/\n/ENTER/g; Above code will replace all linefeed code into ENTER. As advanced comment, function s can use regular expression for pattern matching. So to remove a character that is NOT lower case as an example, it can be like this. $seq =~ s/[^a-z]g;

Regular expression is useful in Perl so if users have any interest we recommend learning further.

Last tip in this section is the joining of two strings. In Perl, use .= operator to append a string to another.

$seq1 .= $seq2;

3.3 Counting the bases

You can count each base (A, T, G and C) in the C Programming language-like way.

if ($nucleotide eq ‘a’){
  $A ++;
}

However, since you are learning Perl, let’s write it in more Perlish way! In Perl, there is another replacement function "tr/ / /" for single character replacement. This function returns the number of conducted replacement. For example, if users replaced “a” into “t”, beneath code does the same process.

$seq =~ s/a/t/g;
$seq =~ tr/a/t/;

But, as mentioned tr/ / / returns the replaced number so changing the code into following will do nice work.

$count = $seq =~ tr/a/t/;

In variable $count, number of replaced “a” to “t” is stored.

$count = $seq =~ tr/a/a/;

By using this syntax, users can count up bases like this if target base was adenine.

3.4 Combining parts

The framework to solve Practice 1-1 should look like this.

#!/usr/local/bin/perl
my($A,$T,$G,$C,$seq);

open(FILE, ?????);

while (<FILE>) {
   ?????  # Remove linefeed code
   ?????  # Join a sequence into variable $seq
}
close(FILE);

$A = ?????
$T = ?????
$G = ?????
$C = ?????
problem_11.txt · Last modified: 2014/01/18 07:44 (external edit)