- G-language Maps
Kazuharu Arakawa, Ph.D.
G-language Project Leader
Institute for Advanced Biosciences
Kazuharu Arakawa, Ph.D.
G-language Project Leader
Institute for Advanced Biosciences
REST Service for G-language System provides URL-based access to all functions of G-language Genome Analysis Environment. Here all analysis resource can be accessed through the web browser on your smartphone, tablet, or PC.
For example for Escherichia coli genome (ecoli),
http://useG.jp/ecoli shows genome information such as genome size and G+C content.
http://useG.jp/ecoli/recA retrieves information for recA gene.
http://useG.jp/ecoli/base_entropy calculates sequence conservation using entropy.
http://useG.jp/ecoli/codon_usage calculates codon usage for all genes.
http://useG.jp/ecoli/gcwin calculates G+C content along the genome.
Details for the methods are described in Arakawa K et al. (2008).
Here we introduce examples of file upload, genome/gene information retrieval, and analysis methods for sequence conservation, DNA replication, and composition of nucleotides, amino acids, and codons.
You can upload your genome file at http://rest.g-language.org/upload/. Push [Choose file] to select a file and then push [Submit] button. This returns a unique reference ID for the uploaded file. You can use this ID for the rest of the analysis. For example, if you uploaded a GenBank file (e.g. NC_002127.gbk) and received an ID of "2F9509",
http://rest.g-language.org/organism_list/ shows NCBI RefSeq ACCESSION No. and DEFINITION for available genomes (chromosomes and plasmids) of bacteria as follows.
ACCESSION DEFINITION NC_000913 Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome.
Genome information (e.g. Length of Sequence and GC Content) is given by
For example, information for Escherichia coli genome (NC_000913) is given by http://rest.g-language.org/NC_000913
The following genomes can also be accessed with G-language preset names.
ACCESSION Genome (preset name) NC_000913 Escherichia coli K12 MG1655 (ecoli) NC_000964 Bacillus subtilis (bsub) NC_000908 Mycoplasma genitalium (mgen) NC_005070 Synechococcus sp. (cyano) NC_003413 Pyrococcus furiosus (pyro) NC_001318 Borrelia burgdorferi B31 (bbur) NC_002483 Plasmid F (plasmidf) NC_001416 Enterobacteria phage lambda (lambda)
For E.coli (ecoli), http://rest.g-language.org/ecoli/TAXONOMY prints taxonomic information where '0' indicates 'all' taxonomic ranks, and taxonomic ranks 1, 2, 3, 4, and 5 correspond with domain, phylum, class, order, and family, respectively.
repB gene information for plasmid F (plasmidf) is given by http://rest.g-language.org/plasmidf/repB
G-language assigns a FEATURE number and CDS number to each gene. For example, repB gene has a FEATURE number of 110 and CDS number of 35.
'feature' => 110, 'cds' => 35,
As shown below, the genome coordinate for repB gene is 36643..37620, and its direction (either direct or complement) is complement.
'start' => '36643' 'end' => '37620', 'direction' => 'complement',
Functional annotations for repB gene are
'product' => 'replication initiator protein', 'note' => 'binds to repeated iterons in RepFIB',
Gene IDs for repB gene are
'protein_id' => 'NP_061412.1', 'db_xref' => 'GI:9507746 GeneID:1263561',
G-Links (http://link.g-language.org/) collects related information from multiple databases (e.g. GO, KEGG, Pfam, PubMed, and UniProtKB) to a given gene with gene IDs or nucleotide/amino acid sequences.
http://rest.g-language.org/plasmidf/repB/translation shows amino acid sequence for repB gene.
http://rest.g-language.org/plasmidf/repB/get_geneseq shows nucleotide sequence for repB gene.
http://rest.g-language.org/plasmidf/repB/before_startcodon/200 shows 200 bp sequence upstream of the start codon of repB gene.
http://rest.g-language.org/plasmidf/*/product shows product names of all genes.
http://rest.g-language.org/plasmidf/product=replication/product show product names of genes containing "replication" in the product feature tag.
For E.coli O157:H7 str. Sakai plasmid pOSAK1 (NC_002127),
For E.coli str. K-12 substr. MG1655 chromosome (NC_000913),
The PatSearch class is a collection of sequence analysis methods related to pattern searches for oligonucleotides, including oligomer_search and palindrome. For example, you can search an inverted repeat (5' TTACGnnnnnnCGTAA 3') and palindrome (5' TTACGCGTAA 3') as follows.
For E.coli (ecoli), http://rest.g-language.org/ecoli/oligomer_search/TTACGCGTAA searches for an oligomer "TTACGCGTAA" and returns list of positions where oligomers are found as follows.
http://rest.g-language.org/ecoli/oligomer_search/TTACGnnnnnnCGTAA/return=both searches for an inverted repeat "TTACGnnnnnnCGTAA" and returns both positions and oligomers as follows.
Oligomer can be specified using degenerate code (e.g. "grtggngg") or regular expressions (e.g. "g[ag]tgg[a-z]gg").
http://rest.g-language.org/plasmidf/palindrome/shortest=10 searches 10-bp or longer palindrome sequences in plasmid F (plasmidf)
Shannon information theory has been used to identify conserved regions such as transcription factor binding sites and Shine-Dalgarno/Kozak sequences for ribosome binding sites. (Arakawa K et al., 2008)
For E.coli (ecoli), http://rest.g-language.org/ecoli/base_entropy calculates and graphs the sequence conservation in regions around the "start" codons using Shanon uncertainty (entropy).
The excess of G over C and T over A in the leading strand of DNA replication relative to the lagging strand is observed in many bacteria, and this is thought to reflect strand-specific mutational bias. (Arakawa K et al., 2008)
http://rest.g-language.org/ecoli/gcskew calculates and graphs GC skew (C-G)/(C+G) for E.coli (ecoli). Replication origin and terminus are located around the GC skew shift points.
As shown in manual of gcskew analysis method (http://rest.g-language.org/help/gcskew), default parameters are 10,000-bp window size (window=10000), and graph output and display (output=show).
http://rest.g-language.org/ecoli/gcskew/output=f output the data in CSV format.
http://rest.g-language.org/ecoli/gcskew/window=100000 calculates GC skew with 100,000-bp window size.
http://rest.g-language.org/ecoli/gcskew/at=1 calculates AT skew (A-T)/(A+T) instead of GC skew.
http://rest.g-language.org/ecoli/gcskew/cumulative=1 calculates cumulative GC skew. Cumulative graph of GC skew is used to clarify the shift points, where the maximum and minimum points correspond to the replication origin and terminus (3.9e+06 bp and 1.5e+06 bp), respectively.
http://rest.g-language.org/ecoli/find_ori_ter predicts the replication origin and terminus based on the vertices of cumulative skew graphs.
rep_ori_ter gets the position of replication origin and terminus by several means. For example, http://rest.g-language.org/ecoli/rep_ori_ter retrieves the position of replication origin (3924034) and terminus (1588773) in E.coli genome (ecoli) from the databases. If the positions of origin or terminus cannot be found in the databases, rep_ori_ter calls find_ori_ter method.
http://rest.g-language.org/ecoli/genomicskew calculates the GC skew in different regions of the given genome (whole genome, coding regions, intergenic regions, and third codon positions).
dnaA gene (http://rest.g-language.org/ecoli/dnaA) is located around the replication origin.
http://rest.g-language.org/ecoli/find_dnaAbox finds dnaA box (5'-TT A/T TNCACA-3') in both strands.
http://rest.g-language.org/plasmidf/find_iteron finds iteron (5'-TGAGGG G/A C/T-3') in both strands of plasmids.
Genomic G+C content (G+C)/(A+T+G+C) is correlated with a number of factors including genome size, aerobiosis, optimal growth temperature, and free-living lifestyle (Hildebrand F et al., 2010). To identify genomic islands (clusters of foreign genes), intragenomic variation in G+C content is computed using sliding windows (Karlin S., 2001).
http://rest.g-language.org/mgen/gcwin calculates and graphs the GC content along the genome of Mycoplasma genitalium (mgen).
http://rest.g-language.org/mgen/gcwin/window=1000 uses 1,000-bp window size instead of the default 10,000-bp window size.
Each organism has its characteristic “genomic signature” defined as the ratios between the observed and expected frequencies of oligonucleotides (di-, tri-, and tetranucleotides). The genomic signature has been applied to taxonomic classification and predicting plasmid hosts (Teeling H and Glöckner FO, 2012)(Campbell A et al., 1999)(Suzuki H et al., 2008)(Suzuki H et al., 2010).
For M.genitalium (mgen), http://rest.g-language.org/mgen/signature calculates 2-mer signature, and http://rest.g-language.org/mgen/signature/wordlength=3 calculates 3-mer signature.
http://rest.g-language.org/mgen/bui calculates the following base usage indices for protein-coding sequences (CDS) in M.genitalium (mgen).
http://rest.g-language.org/mgen/bui/position=3 uses only bases at 3rd position of codons to calculates the base usage indices.
http://rest.g-language.org/mgen/bui/tag=start shows start positions of CDS in the genome. Values for the tag= option can be 'start', 'end', 'gene', 'product', 'locus_tag', 'protein_id', 'db_xref', and so on.
For M.genitalium (mgen), absolute frequencies (data=A0) and relative frequencies (data=A1) of amino acids summed across all proteins are given by
http://rest.g-language.org/mgen/aaui calculates the following amino acid usage indices for proteins in M.genitalium (mgen).
Multivariate statistical methods such as correspondence analysis (COA) have been used to identify major sources of variation in amino acid usage among proteins (Lobry JR et al., 1994)(Zavala A et al., 2002). For example, http://rest.g-language.org/mgen/codon_mva/method=coa/data=A0 performs COA of amino acid usage for M.genitalium (mgen) proteome. The first axis is correlated with protein hydrophobicity (gravy), and it discriminated integral membrane proteins from the others.
Synonymous codon usage varies both between organisms and among genes within a genome, and arises due to differences in G+C content, replication strand skew, or gene expression levels. Thus, synonymous codon usage analysis provides a way to identify horizontally transferred or highly expressed genes. (Arakawa K et al., 2008)
Different kinds of representations of codon usage data (termed here R0-R4) have been used in codon usage studies.
For plasmid F (plasmidf), codon usage data R0-R4 for all genes are graphically shown by
http://rest.g-language.org/plasmidf/codon_compiler/data=R0/output=stdout (standard output)
http://rest.g-language.org/plasmidf/codon_compiler/data=R0/output=stdout/id=FEATURE110 calculates codon counts for repB gene ('feature' ⇒ 110) accessible at http://rest.g-language.org/plasmidf/repB
The mean distance (Dmean) between all pairs of genes can be used to measure the level of diversity in synonymous codon usage among genes (Suzuki H et al., 2009). http://rest.g-language.org/plasmidf/Dmean calculates the Dmean value for plasmid F (plasmidf).
Several measures have been used to estimate the degree of deviation from equal usage of synonymous codons, including ENC (Effective Number of Codons), SCS (Scaled Chi-Square), CBI (Codon Bias Index), ICDI (Intrinsic Codon Deviation Index), and Ew (weighted sum of relative entropy). Ew ranges from 0 (maximum bias) to 1 (no bias). Ew takes into account all three aspects of amino acid usage (i.e., the number of different amino acids, their relative frequency, and their codon degeneracy), and indeed is little affected by amino acid usage biases (Suzuki H et al., 2004).
For plasmid F (plasmidf), values of ENC, SCS, CBI, ICDI, and Ew for each gene are calculated by
Various methods of predicting gene expression level from codon usage bias have been proposed, including P2 index, Fop (Frequency of OPtimal codons), CAI (Codon Adaptation Index), tAI (tRNA adaptation index), and PHX (Predicted Highly eXpressed).
For E.coli (ecoli), P2, Fop, CAI, tAI, and PHX are implemented by
http://rest.g-language.org/ecoli/cai/tag=product shows functional annotations ("product") instead of "locus_tag" for genes.
P2 indicates the efficiency of the codon–anticodon interaction, and highly expressed genes in E.coli have high P2 values (>0.7).
Fop takes values from 0.0 (where no optimal codons are used) to 1.0 (where only optimal codons are used).
CAI is a measure of the relative adaptiveness of the codon usage of a gene towards the codon usage of highly expressed genes. CAI ranges from 0.0 to 1.0.
PHX calculates codon usage difference of a gene from all genes (BgC) and from highly expressed genes (BgH), and Expression measure (E_g = BgC/BgH). A gene is deemed Predicted Highly eXpressed (phx = 1) if BgH is lower than BgC and thus E_g > 1.0. A gene is deemed Putative Alien (pa = 1) provided both BgH and BgC exceed the median value for all genes. http://rest.g-language.org/ecoli/phx/output=stdout (standard output)
Values of CAI and PHX analyses derived from different genomes cannot be simply compared because these values are based on highly expressed genes in the genomes.
In some bacteria exhibiting no evidence of translational selection on codon usage, highly expressed genes do not have unusual codon usage, and thus codon usage cannot be used to predict gene expression levels. (Henry I and Sharp PM, 2006)
S_value (http://rest.g-language.org/help/S_value) calculates the strength of translationally selected codon usage bias (S) (Sharp PM et al., 2005). Fast-growing bacteria tend to have more rRNA/tRNA genes and higher S-values. For example, S-values are higher in E.coli (http://rest.g-language.org/ecoli/S_value) and Bacillus subtilis (http://rest.g-language.org/bsub/S_value) than in B.burgdorferi (http://rest.g-language.org/bbur/S_value) and M.genitalium (http://rest.g-language.org/mgen/S_value).
Multivariate analysis methods, such as correspondence analysis (COA) and principal component analysis (PCA), are often used to identify gene features contributing to the variations in synonymous codon usage among genes.
Of the existing COA methods, Within-group Correspondence Analysis (WCA) performs best because it does not mask variation in synonymous codon usage caused by amino acid composition and codon degeneracy (Suzuki H et al., 2008).
codon_mva (http://rest.g-language.org/help/codon_mva) performs WCA of codon usage data for a given genome, and analyzes correlations between the WCA axes and various gene parameters (e.g. Laa, aroma, gravy, mmw, gcc3, gtc3, and P2). In the WCA plots, the first four axes (Comp1 to Comp4) obtained by WCA are shown in y-axes, and the gene features (e.g. gcc3 and gtc3) having the largest absolute correlation coefficients (|r|) are shown in x-axes.
http://rest.g-language.org/mgen/codon_mva/output=stdout (standard output) shows the contribution of each axis (%), the mean absolute standard score (z-score) for highly expressed genes on each axis, and the absolute correlation coefficient (|r|) between each axis and each gene parameter (e.g. Laa, aroma, gravy, mmw, gcc3, gtc3, and P2).
Of the five codon usage data (R0-R4), only R4 is independent of all three biases (gene length, amino acid composition, and codon degeneracy). Indeed, Principal Component Analysis (PCA) of R4 data (PCA-R4) performs best because it is not affected by any of these biases (Suzuki H et al., 2005). http://rest.g-language.org/mgen/codon_mva/method=pca/data=R4 performs PCA-R4 for M.genitalium (mgen).