G documentation.

 use G;                          # Imports G-language GAE module 
   
 $gb = new G("ecoli.gbk");       # Creates G's instance as $gb 
                                 # At the same time, read in ecoli.gbk. 
                                 # Read the annotation and sequence 
                                 # information 
                                 # See DESCRIPTION for details
   
 $gb->seq_info();                # Prints the basic sequence information.

 $find_ori_ter($gb);             # Give $gb as the first argument to 
                                 # most of the analysis functions

Description

 The G-language GAE fully supports most sequence databases.

 Stored annotation information:

 LOCUS  
         $gb->{LOCUS}->{id}              -accession number 
         $gb->{LOCUS}->{length}          -length of sequence  
         $gb->{LOCUS}->{nucleotide}      -type of sequence ex. DNA, RNA  
         $gb->{LOCUS}->{circular}        -1 when the genome is circular.
                                          otherwise 0
         $gb->{LOCUS}->{type}            -type of species ex. BCT, CON  
         $gb->{LOCUS}->{date}            -date of accession 

 HEADER  
         $gb->{HEADER}  

 COMMENT  
         $gb->{COMMENT}  

 FEATURE  
         Each FEATURE is numbered(FEATURE1 .. FEATURE1172), and is a 
         hash structure that contains all the keys of Genbank.   
         In other words,  in most cases, FEATURE$i's hash at least 
         contains informations listed below: 
         $gb->{FEATURE$i}->{start}  
         $gb->{FEATURE$i}->{end}  
         $gb->{FEATURE$i}->{direction}
         $gb->{FEATURE$i}->{join}
         $gb->{FEATURE$i}->{note}  
         $gb->{FEATURE$i}->{type}        -CDS,gene,RNA,etc.
         $gb->{FEATURE$i}->{feature}     -same as $i

         To analyze each FEATURE, write: 

         foreach my $feature ($gb->feature()){
               print $gb->{$feature}->{type}, "\n";
         }  

         In the same manner, to analyze all CDS, write:  
 
         foreach my $cds ($gb->cds()){
               print $gb->{$cds}->{gene}, "\n";
         }

         Feature or gene information can also be accessed with CDS numbers:
         $gb->{CDS$i}->{start}

         or with locus_tags or gene names (for CDS, tRNA, and rRNA)
         $gb->{thrL}->{start}
         $gb->{b0001}->{start}

 BASE COUNT  
         $gb->{BASE_COUNT}  

 SEQ  
         $gb->{SEQ}              -sequence data following "ORIGIN" 

         or
 
         $gb->seq()

Methods description

None available.

Methods code

No methods available.

General documentation

Supported methods of G-language Genome Analysis Environment

Top

$gb = new G("genome file")

Top

     Name: $gb = new G("genome file")   -   create a G instance

         Creates a G instance.
         First option is the filename of the database. Default format is
         the GenBank database. Database format is guessed from the extensions.
         (eg. .gbk => GenBank, .fasta => FASTA, .embl => EMBL)

         There are also several sample bacterial genomes included in the system.
         $eco   = new G("ecoli"); # Escherichia coli K12 MG1655 - NC_000913
         $bsub  = new G("bsub");  # Bacillus subtilis           - NC_000964
         $mgen  = new G("mgen");  # Mycoplasma genitalium       - NC_000908
         $cyano = new G("cyano"); # Synechococcus sp.           - NC_005070
         $pyro  = new G("pyro");  # Pyrococcus furiosus         - NC_003413

         Second option specifies detailed actions.

           'no msg'                  suprresses all STDOUT messages printed 
                                     when loading a database, including the
                                     copyright info and sequence statistics.

           'no cache'                suppresses the use of database caching.
                                     By default, databases are cached for
                                     optimized performance. (since v.1.6.4)

           'force cache'             rebuilds database cache.

           'multiple locus'          this option merges multiple loci in the 
                                     database and load the information
                                     as G-language instance.

           'bioperl'                 this option creates a G instance from 
                                     a bioperl object. 
                                     eg. $bp = $bp->next_seq();       # bioperl
                                         $gb = new G($bp, "bioperl"); # G

           'longest ORF annotation'  this option predicts genes with longest ORF
                                     algorithm (longest frame from start codon
                                     to stop codon, with more than 17 amino 
                                     acids) and annotates the sequence.

           'glimmer annotation'      this option predicts genes using glimmer2,
                                     a gene prediction software for microbial
                                     genomes available from TIGR.
                                     http://www.tigr.org/softlab/
                                     Local installation of glimmer2 and setting
                                     of PATH environment value is required.

               - following options require bioperl installation -

           'Fasta'              this option loads a Fasta format database.
           'EMBL'               this option loads a EMBL  format database.
           'swiss'              this option loads a swiss format database.
           'SCF'                this option loads a SCF   format database.
           'PIR'                this option loads a PIR   format database.
           'GCG'                this option loads a GCG   format database.
           'raw'                this option loads a raw   format database.
           'ace'                this option loads a ace   format database.
           'net GenBank'        this option loads a GenBank format database from 
                                NCBI database. With this option, the first value to 
                                pass to new() function will be the accession 
                                number of the database.

$gb->output()

Top

   Name: $gb->output()   -   output the G instance data to file

   Description:
         Given a filename and an option, outputs the G-language data object 
         to the specified file in a flat-file database of a given format.
         The options are the same as those of new().  Default format is 'GenBank'.

         eg. $gb->output("my_genome.embl", "EMBL");
             $gb->output("my_genome.gbk"); # with GenBank you can ommit the option.

complement

Top

   Name: complement   -   get the complementary nucleotide sequence

   Description:
         Given a sequence, returns its complement.

         eg. complement('atgc');  # returns 'gcat'

translate

Top

   Name: translate   -   translate a nucleotide sequence to amino acid sequence

   Description:

         Given a sequence, returns its translated sequence.
         Regular codon table is used.
         eg. translate('ctggtg');  # returns 'LV'

$gb->seq()

Top

   Name: $gb->seq()   -   get the sequence data from G instance

   Description:
         Returns the entire sequence. Same as $gb->{SEQ};

$gb->seq_info()

Top

   Name: $gb->seq_info()   -   display basic statistics about the data

   Description:
         Prints the basic information of the genome to STDOUT.

$gb->find()

Top

   Name: $gb->find()   -   search through the genome data object with keywords

   Description:
         This method provides powerful means to search within the G-language genome
         data object with keywords. Given a set of keywords, this method returns
         the list of feature IDs corresponding to the search query. In G-language Shell,
         search results are also directly printed out.

         eg. @features = $gb->find('RNA', 'tyrosine');    # multiple keywords are allowed.

         Keywords can be specific to each of the feature attributes:

         eg. $gb->find(-type=>'CDS', -product=>'metabolism', 'subunit');

         Regular expressions are allowed for keywords:

         eg. $gb->find(-type=>'CDS', -EC_number=>'^2.7.');

$gb->getseq()

Top

   Name: $gb->getseq()   -   get nucleotide sequence of the given positions (Perl coordinates)

   Description:
         Given the start and end positions (starting from 0 as in Perl),
         returns the sequence specified.

         eg. $gb->getseq(1,3); # returns the 2nd, 3rd, and 4th nucleotides.

   Options:
       -circular   when the first position is larger than the second position,
                   retrieves the sequece spanning across the end of the circular
                   chromosome. (ex: $gb->getseq(4639670, 5, -circular))

$gb->get_gbkseq()

Top

   Name: $gb->get_gbkseq()   -   get nucleotide sequence of the given positions (GenBank coordinates)

   Description:
         Given the start and end positions (starting from 1 as in 
         Genbank), returns the sequence specified.

         eg. $gb->get_gbkseq(1,3); # returns the 1st, 2nd, and 3rd nucleotides.

   Options:
       -circular   when the first position is larger than the second position,
                   retrieves the sequece spanning across the end of the circular
                   chromosome. (ex: $gb->getseq(4639670, 5, -circular))

$gb->get_cdsseq()

Top

   Name: $gb->get_cdsseq()   -   get nucleotide sequence of the given CDS

   Description:
         Given a CDS ID, returns the CDS sequence. 
         'complement' is properly parsed.

         eg. $gb->get_cdsseq('CDS1'); # returns the 'CDS1' sequence.

$gb->get_geneseq()

Top

   Name: $gb->get_geneseq()   -   get nucleotide sequence of the given gene

   Description:
         Given a CDS ID, returns the CDS sequence, or the exon sequence
         If introns are present.
         'complement' is properly parsed, and introns are spliced out.

         eg. $gb->get_geneseq('CDS1'); # returns the 'CDS1' sequence or exon.

$gb->feature()

Top

   Name: $gb->feature()   -   get a list of feature IDs

   Description:
         Returns the array of all feature IDs.
         Features are ignored when $gb->{$feature}->{on} is 0.

         eg.
           foreach ($gb->feature()){
               $gb->get_cdsseq($_);
           }
           #prints all feature sequences.

         Optionally, feature type can be supplied to return only the
         specifies features.

         eg. $gb->feature("tRNA"); # returns feature IDs only for tRNAs

         Option of "all" always returns all features regardless of the
         value of $gb->{$feature}->{on}.

$gb->cds()

Top

   Name: $gb->cds()   -   get a list of CDS IDs

   Description:
         Returns the array of all feature IDs of CDS.
         Features are ignored when $gb->{FEATURE$i}->{on} OR
         $gb->{CDS$i}->{on} is 0.

         !CAUTION! the object name is actually the FEATURE ID,
         to enable access to all feature values. However, most of the
         time you do not need to be aware of this difference.

         eg.
           foreach ($gb->cds()){
               $gb->get_geneseq($_);
           }
           #prints all gene sequences.

         Option of "all" always returns all features regardless of the
         value of $gb->{$feature}->{on}.

$gb->tRNA()

Top

   Name: $gb->tRNA()   -   get a list of feature IDs of tRNAs

   Description:
         Returns the array of all feature IDs of tRNAs.

$gb->rRNA()

Top

   Name: $gb->rRNA()   -   get a list of feature IDs of rRNAs

   Description:
         Returns the array of all feature IDs of rRNAs.

$gb->intergenic()

Top

   Name: $gb->intergenic()   -   get a list of IDs of intergenic regions

   Description:
         Returns the array of all IDs of intergenic regions.

$gb->gene()

Top

   Name: $gb->gene()   -   get a list of feature IDs of genes

   Description:
         Returns the array of all feature IDs of genes.

$gb->next_feature()

Top

   Name: $gb->next_feature()   -   get the next feature ID

   Description:
         Given a feature ID, returns the ID of the next feature.
         Second argument can be used to specify the type of the 
         next feature.

         eg. $gb->next_feature(FEATURE1234); # returns 'FEATURE1235'
             $gb->next_feature(FEATURE1234, 'tRNA'); 
             # returns next feature ID whose type is 'tRNA'

$gb->next_cds()

Top

   Name: $gb->next_cds()   -   get the feature ID of next CDS

   Description:
         Given a feature ID, returns the ID of the next cds.
         This is same as $gb->next_feature($featureID, 'CDS');

$gb->previous_feature()

Top

   Name: $gb->previous_feature()   -   get the previous feature ID

   Description:
         Given a feature ID, returns the ID of the previous feature.
         Second argument can be used to specify the type of the 
         next feature.

         eg. $gb->previous_feature(FEATURE1234); # returns 'FEATURE1233'
             $gb->previous_feature(FEATURE1234, 'tRNA'); 
             # returns previous feature ID whose type is 'tRNA'

$gb->previous_cds()

Top

   Name: $gb->previous_cds()   -   get the feature ID of previous CDS

   Description:
         Given a feature ID, returns the ID of the previous cds.
         This is same as $gb->previous_feature($featureID, 'CDS');

$gb->startcodon()

Top

   Name: $gb->startcodon()   -   get the start codon of the given CDS

   Description:
         Given a CDS ID, returns the start codon.

         eg. $gb->startcodon("FEATURE$i"); # returns 'atg'

$gb->stopcodon()

Top

   Name: $gb->stopcodon()   -   get the stop codon of the given CDS

   Description:
         Given a CDS ID, returns the stop codon.

         eg. $gb->stopcodon("FEATURE$i"); # returns 'tag'

$gb->before_startcodon()

Top

   Name: $gb->before_startcodon()   -   get the upstream sequence of the given CDS

   Description:
         Given a CDS ID and length, returns the sequence upstream of 
         start codon.

         eg. $gb->before_startcodon('CDS1', 100); 
             # returns 100 bp sequence upstream of the start codon of 'CDS1'.

   Options:
         Second argument specifying the length of sequence to retrieve is
         optional. (default: 100).

$gb->after_startcodon()

Top

   Name: $gb->after_startcodon()   -   get the sequence downstream of start codon of the given CDS

   Description:
         Given a CDS ID and length, returns the sequence downstream of 
         start codon.

         eg. $gb->after_startcodon('CDS1', 100); 
             # returns 100 bp sequence downstream of the start codon of 'CDS1'.

   Options:
         Second argument specifying the length of sequence to retrieve is
         optional. (default: 100).

$gb->before_stopcodon()

Top

   Name: $gb->before_stopcodon()   -   get the sequence upstream of stop codon of the given CDS

   Description:
         Given a CDS ID and length, returns the sequence upstream of 
         stop codon.

         eg. $gb->before_stopcodon('CDS1', 100); 
             # returns 100 bp sequence upstream of the stop codon of 'CDS1'.

   Options:
         Second argument specifying the length of sequence to retrieve is
         optional. (default: 100).

$gb->after_stopcodon()

Top

   Name: $gb->after_stopcodon()   -   get the downstream sequence of the given CDS

   Description:
         Given a CDS ID and length, returns the sequence downstream of 
         stop codon.

         eg. $gb->after_stopcodon('CDS1', 100); 
             # returns 100 bp sequence downstream of the stop codon of 'CDS1'.

   Options:
         Second argument specifying the length of sequence to retrieve is
         optional. (default: 100).

$gb->get_exon()

Top

   Name: $gb->get_exon()   -   get a list of exon sequences of the given CDS

   Description:
         Given a CDS ID, returns the exon sequence.
         'complement' is properly parsed, and introns are spliced out.

         eg. $gb->get_exon('CDS1'); returns the 'CDS1' exon.

$gb->get_intron()

Top

   Name: $gb->intron()   -   get a list of intron sequences of the given CDS

   Description:
         Given a CDS ID, returns the intron sequences as array of 
         sequences.

         eg. $gb->get_intron('CDS1'); 
             # returns ($1st_intron, $2nd_intron,..)

$gb->pos2feature()

Top

   Name: $gb->pos2feature()   -   get a feature ID from position

   Description:
         Given a GenBank position (sequence starting from position 1) 
         returns the feature ID (ex. FEATURE123) of the feature at
         the given position. If multiple features exist for the given
         position, the first feature to appear is returned. Returns 
         NULL if no feature exists.

$gb->pos2gene()

Top

   Name: $gb->pos2gene()   -   get a feature ID of CDS from position

   Description:
         Given a GenBank position (sequence starting from position 1) 
         returns the feature ID (ex. FEATURE123) of the gene at
         the given position. If multiple genes exists for the given
         position, the first gene to appear is returned. Returns 
         NULL if no gene exists.

$gb->gene2id()

Top

   Name: $gb->gene2id()   -   get a feature ID from canonical gene name

   Description:
         Given a GenBank gene name, returns the feature ID (ex. FEATURE123). 
         Returns NULL if no gene exists.

$gb->next_locus()

Top

   Name: $gb->next_locus()   -   read the next locus and update the G instance

   Description:
         Reads the next locus.
         the G instance is then updated.

         eg. 
           do{
  
           }while($gb->next_locus());
           #  Enables multiple loci analysis.

$gb->clone()

Top

   Name: $gb->clone()   -   create a copy of the G instance

   Description:
         Returns cloned G instance, which is a new G instance with
         identical data.

$gb->del_key()

Top

   Name: $gb->del_key()   -   delete a data object from G instance

   Description:
         Given a object, deletes it from the G instance structure
         eg. $gb->del_key('FEATURE1'); # deletes 'FEATURE1' hash

AUTHOR

Top

Kazuharu Arakawa, gaou@sfc.keio.ac.jp