G::Seq Primitive
SummaryIncluded librariesPackage variablesDescriptionGeneral documentationMethods
Summary
  G::Seq::Primitivew - Basic sequence analysis methods
Package variables
Privates (from "my" definitions)
%CodonTable = ( 'gac', 'D', 'caa', 'Q', 'gca', 'A', 'ctg', 'L', 'gat', 'D', 'cag', 'Q', 'gcc', 'A', 'ctt', 'L', 'gaa', 'E', 'agc', 'S', 'gcg', 'A', 'ata', 'I', 'gag', 'E', 'agt', 'S', 'gct', 'A', 'atc', 'I', 'aga', 'R', 'tca', 'S', 'gga', 'G', 'att', 'I', 'agg', 'R', 'tcc', 'S', 'ggc', 'G', 'cca', 'P', 'cga', 'R', 'tcg', 'S', 'ggg', 'G', 'ccc', 'P', 'cgc', 'R', 'tct', 'S', 'ggt', 'G', 'ccg', 'P', 'cgg', 'R', 'aca', 'T', 'gta', 'V', 'cct', 'P', 'cgt', 'R', 'acc', 'T', 'gtc', 'V', 'atg', 'M', 'aaa', 'K', 'acg', 'T', 'gtg', 'V', 'tgg', 'W', 'aag', 'K', 'act', 'T', 'gtt', 'V', 'tgc', 'C', 'cac', 'H', 'tac', 'Y', 'tta', 'L', 'tgt', 'C', 'cat', 'H', 'tat', 'Y', 'ttg', 'L', 'taa', '/', 'aac', 'N', 'ttc', 'F', 'cta', 'L', 'tag', '/', 'aat', 'N', 'ttt', 'F', 'ctc', 'L', 'tga', '/' )
Included modules
G::Messenger
SelfLoader
SubOpt
Inherit
Exporter
Synopsis
No synopsis!
Description
           This class is a part of G-language Genome Analysis Environment,
           collecting basic sequence analysis methods.
Methods
DoubleHelixDescriptionCode
complementDescriptionCode
shuffleseqDescriptionCode
splitprintseqDescriptionCode
to_fastaDescriptionCode
to_fastqDescriptionCode
translateDescriptionCode
Methods description
DoubleHelixcode    nextTop
  Name: DoubleHelix   -   we all love the DNA molecule!

  Description:
    This method prints the given sequence as an ASCII text-based graphic
    depicting a DNA molecule. Enjoy :-)
    
  Usage: 
    DoubleHelix($seq);

  Options:
    -speed      controls the time to display one base-pair.
                default is 0.05 (second)

  REST:
    http://rest.g-language.org/NR_029721/DoubleHelix/speed=0
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20060511-01 initial posting
complementcodeprevnextTop
   Name: complement   -   get the complementary nucleotide sequence

   Description:
         Given a sequence, returns its complement.

         eg. complement('atgc');  # returns 'gcat'

   REST: 
      http://rest.g-language.org/complement/atgc 
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20010727-01 initial posting
shuffleseqcodeprevnextTop
  Name: shuffleseq   -   create randomized sequence with conserved k-mer composition

  Description:
    Shuffle and randomize the given sequence, conserving the nucleotide/peptide
    k-mer content of the original sequence. 

    For k=1, i.e. shuffling sequencing preserving single nucleotide composition, 
    Fisher-Yates Algorithm is employed. 

    For k>1, shuffling preserves all k-mers (all k where k=1~k). For example,
    k=3 preserves all triplet, doublet, and single nucleotide composition. 
    Algorithm for k-mer preserved shuffling is non-trivial, which is solved
    by graph theoretical approach with Eulerian random walks in the graph of 
    k-1-mers. See references 3-5 for details of this algorithm. 
    
  Usage:
    $shuffled_seq = shuffleseq($gb);

  Options:
    -k            k-mer to preserve the composition (default: 1)
                  k=1 means only the base composition is preserved and the 
                  order is shuffled. k=1 preserves dinucleotide composition
                  as well as the single nucleotide composotion. Similarly,
                  k=n preserves all of 1~n -mers. Note that k < 5
                  with nucleotide sequence will take several minutes to 
                  finish. 

  References:
    1. Fisher R.A. and Yates F. (1938) "Example 12", Statistical Tables, London
    2. Durstenfeld R. (1964) "Algorithm 235: Random permutation", CACM 7(7):420
    3. Jiang M., Anderson J., Gillespie J., and Mayne M. (2008) "uShuffle: 
       a useful ool for shuffling biological sequences while preserving the k-let
       counts", BMC Bioinformatics 9:192
    4. Kandel D., Matias Y., Unver R., and Winker P. (1996) "Shuffling biological 
       sequences", Discrete Applied Mathematics 71(1-3):171-185  
    5. Propp J.G. and Wilson D.B. (1998) "How to get a perfectly random sample 
       "from a generic Markov chain and generate a random spanning tree of a 
       directed graph", Journal of Algorithms 27(2):170-217

  Author:
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20080421-01 added k-mer count preservation option 20070612-01 initial posting
splitprintseqcodeprevnextTop
  Name: splitprintseq   -   format sequence data for printing

  Description:
    This method splits the given sequence string in segments of 
    certain length and inserts newline code ("\n") to print the
    sequence in a formatted way. By default, this function splits
    the string in segments of 60 letters. This length can be
    changed by supplying the second argument.

    
  Usage : 
    print splitprintseq($seq); # print sequence in a formatted way.
      or
    $formatted_seq = splitprintseq($seq, 100);  
    # split into segments of 100 characters and add "\n" to each line.

 Options:
    Length of segments can be specified as the second argument.
    Default is 60 characters to match GenBank.

  Author: 
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20050116-01 initial posting
to_fastacodeprevnextTop
  Name: to_fasta   -   output given sequence to a FASTA file

  Description:
    This method outputs the given sequence as a FASTA file.
    
  Usage : 
    $fasta_string = to_fasta($seq, -name=>"My sequence"); # output the sequence to out.fasta

 Options:
    -name        string for FASTA header (default: sequence)
                 if the first argument is an instance of new G(), 
                 $gb->{LOCUS}->{id} is used as default
    -length      number of characters per line (default: 60)
    -filename    output filename (default: "out.fasta")
    -output      "f" to output to file, and return the fasta as a string.
                 "return" to just return the fasta as a string (default: "return")

  Author: 
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20090312-01 changed the deafult behavior to return fasta as a string as well as outputting to a file 20070612-01 added -output option 20050116-01 initial posting
to_fastqcodeprevnextTop
  Name: to_fastq   -   output given sequence to a FASTQ file

  Description:
    This method outputs the given sequence as a FASTQ file. If the given sequence
    object does not contain $file->{QUAL} section, 'X' is used for quality score
    (corresponding to 50 in Phred quality).
    
  Usage : 
    $fastq_string = to_fastq($seq, -name=>"My sequence"); # output the sequence to out.fasta

 Options:
    -name        string for FASTQ header (default: sequence)
                 if the first argument is an instance of new G(), 
                 $gb->{LOCUS}->{id} is used as default
    -length      number of characters per line (default: 60)
    -filename    output filename (default: "out.fastq")
    -output      "f" to output to file, and return the fasta as a string.
                 "return" to just return the fasta as a string (default: "return")

  Author: 
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20091201-01 initial posting
translatecodeprevnextTop
  Name: translate   -   translate nucleotide sequence to amino acid sequence

  Description:
    This method translates a given nucleotide sequence into amino acid 
    sequence, using standard codon table. Change to the codon table or
    custom codon table can be supplied as optional arguments. Stop codons
    are denoted by '/' by default.

    eg. translate('ctggtg');  # returns 'LV'
    
  Usage : 
    $amino = translate($nucleotide);

    # to set stop codons to 'X'
    $amino = translate($nucleotide, tga=>"X", taa=>"X", tag=>"X");

    # to use custom codon table
    $amino = translate($nucleotide, %customCodonTable);

  Options:
    Custom codon table can be optionally used (see above).

  REST: 
      http://rest.g-language.org/translate/ctggtg
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20091203-01 added cutom codon table option 20010727-01 initial posting
Methods code
DoubleHelixdescriptionprevnextTop
sub DoubleHelix {
    SubOpt::opt_default(speed=>0.05);
    my @args = opt_get(@_);  
    my $gb = opt_as_gb(shift @args);
    my $speed = opt_val("speed");

    $| = 1;

    my $i;
    my (@offset) = qw/1 0 0 0 1 2 3 4 5 5 4 3 2 1 0 0 0 1/;
    my (@dist)   = qw/0 2 3 4 4 4 3 2 0 0 2 3 4 4 4 3 2 0/;

    foreach my $base (split(//, $gb->{SEQ})){
        my $msg = "         " .  q// /x$offset[($i%scalar(@offset))];
$msg .= uc($base); $i ++; $msg .= q//-/x$dist[($i%scalar(@dist))];
$msg .= uc(complement($base)) . "\n"; msg_send($msg); select(undef,undef,undef,$speed) unless($speed == 0); }
}
complementdescriptionprevnextTop
sub complement {
    my $nuc = reverse(shift);

    $nuc =~ tr
[acgturymkdhbvwsnACGTURYMKDHBVWSN]
[tgcaayrkmhdvbwsnTGCAAYRKMHDVBWSN];
return $nuc;
}
shuffleseqdescriptionprevnextTop
sub shuffleseq {
    opt_default("k"=>1);
    my @args = SubOpt::opt_get(@_);
    my $gb = SubOpt::opt_as_gb(shift @args);
    my $k = opt_val("k");

    if($k == 1){
	return join('', shuffle(split(//, $gb->{SEQ})));
    }

    my $graph = {};
    my $revgraph = {};
    my $arborescence = {};

    my $length = length($gb->{SEQ});
    my %oligo = oligomer_counter($gb, -length=>$k);
    
    foreach my $nuc (keys %oligo){
	$graph->{substr($nuc, 0, $k - 1)}->{substr($nuc, -1, 1)} = $oligo{$nuc};
	$revgraph->{substr($nuc, 1, $k - 1)}->{substr($nuc, 0, 1)} = $oligo{$nuc};
    }
    my $numoligos = keys(%$graph);
    undef %oligo;


    msg_error("Compiling arborescene...\n");
    
    my %edged;
    my $count = 1;
    $edged{substr($gb->{SEQ}, - ($k - 1))} ++;
    my %visited = %edged;

    while($numoligos > $count){
	my @keys = keys %edged;
	my $node = $keys[int(rand(@keys))];
	my @nucs = keys %{$revgraph->{$node}};
	my $rnuc = $nucs[int(rand(@nucs))];

	my $newnode = $rnuc . substr($node, 0, $k - 2);
	unless($visited{$newnode}){
	    $count ++;
	    my $first = substr($node, -1, 1);
	    $visited{$newnode} ++;
	    $edged{$newnode} ++;
	    $arborescence->{$newnode} = $first;
	    $graph->{$newnode}->{$first} --;
	}
	delete($revgraph->{$node}->{$rnuc});
	delete($edged{$node}) if(scalar(@nucs) == 1);
    }
    undef $revgraph;

    msg_error("Compilation of arborescence finished.\nNow generating shuffled sequence.\n");

    my $seed = substr($gb->{SEQ}, 0, $k - 1);
    my $seq = $seed;

    while(length($seq) < $length){
	my $num = 0;
	foreach my $val (values %{$graph->{$seed}}){
	    $num += $val;
	}
	if ($num < 1){
	    if(defined $arborescence->{$seed}){
		my $oldseed = $seed;

		$seq .= $arborescence->{$seed};
		$seed = substr($seed, 1) . $arborescence->{$seed};

		delete($arborescence->{$oldseed});

		next;
	    }else{
		die("Dead end: failed to trace the Eulerian. Try shuffling again.");
		#Mathematically, this should not happen.
} } my $rand = int(rand($num)); foreach my $key (keys %{$graph->{$seed}}){ if($rand <= $graph->{$seed}->{$key}){ $graph->{$seed}->{$key} --; delete($graph->{$seed}->{$key}) if($graph->{$seed}->{$key} < 1); $seed = substr($seed, 1) . $key; $seq .= $key; last; }else{ $rand -= $graph->{$seed}->{$key}; } } } msg_error("Done.\n"); return $seq;
}
splitprintseqdescriptionprevnextTop
sub splitprintseq {
    my $seq = shift;
    my $len = shift || 60;
    my $ret = '';

    while(length $seq){
	$ret .= substr($seq, 0, $len) . "\n";
	substr($seq, 0, $len) = '';
    }
    
    return $ret;
}
to_fastadescriptionprevnextTop
sub to_fasta {
    SubOpt::opt_default(length=>60, filename=>"out.fasta", name=>"sequence", output=>"return");
    my @args = SubOpt::opt_get(@_);
    my $first = shift @args;
    die("Error: No sequence passed to to_fasta") unless(length($first));
    my $gb = SubOpt::opt_as_gb($first);
    my $filename = SubOpt::opt_val("filename");
    my $name = SubOpt::opt_val("name");
    my $length = SubOpt::opt_val("length");
    my $output = SubOpt::opt_val("output");

    $name = $gb->{LOCUS}->{id} if($name eq 'sequence' && length $gb->{LOCUS}->{id});

    if($output eq 'f'){
	open(OUT, ">$filename") || die($!);
	printf OUT ">%s\n%s", $name, splitprintseq($gb->{SEQ}, $length);
	close(OUT);
	return sprintf ">%s\n%s", $name, splitprintseq($gb->{SEQ}, $length);
    }else{
	return sprintf ">%s\n%s", $name, splitprintseq($gb->{SEQ}, $length);
    }
}
to_fastqdescriptionprevnextTop
sub to_fastq {
    SubOpt::opt_default(length=>75, filename=>"out.fastq", name=>"sequence", output=>"return");
    my @args = SubOpt::opt_get(@_);
    my $first = shift @args;
    die("Error: No sequence passed to to_fastq") unless(length($first));
    my $gb = SubOpt::opt_as_gb($first);
    my $filename = SubOpt::opt_val("filename");
    my $name = SubOpt::opt_val("name");
    my $length = SubOpt::opt_val("length");
    my $output = SubOpt::opt_val("output");

    $name = $gb->{LOCUS}->{id} if($name eq 'sequence' && length $gb->{LOCUS}->{id});
    $gb->{QUAL} = 'X' x length($gb->{SEQ}) unless(length($gb->{QUAL}));

    if($output eq 'f'){
	open(OUT, ">$filename") || die($!);
	printf OUT "\@%s\n%s\+%s\n%s", $name, splitprintseq(uc($gb->{SEQ}), $length), $name, splitprintseq($gb->{QUAL}, $length);
	close(OUT);
	return sprintf "\@%s\n%s\+%s\n%s", $name, splitprintseq(uc($gb->{SEQ}), $length), $name, splitprintseq($gb->{QUAL}, $length);
    }else{
	return sprintf "\@%s\n%s\+%s\n%s", $name, splitprintseq(uc($gb->{SEQ}), $length), $name, splitprintseq($gb->{QUAL}, $length);
    }
}
translatedescriptionprevnextTop
sub translate {
    my $seq = lc(shift);
    my %option = @_;

    my %codons = %CodonTable;
    foreach my $key (%option){
	$codons{$key} = $option{$key};
    }

    my $amino = '';

    while(3 <= length($seq)){
	my $codon = substr($seq, 0, 3);
	substr($seq, 0, 3) = '';
	if ($codon =~ /[^atgc]/){
	    $amino .= 'X';
	}else{
	    $amino .= $codons{$codon};
	}
    }

    msg_error("Translation: illegal length.\n") if(length($seq));

    return $amino;
}
General documentation
No general documentation available.