User Tools

Site Tools


restgenomeanalysisenglish

Introduction

REST Service for G-language System provides URL-based access to all functions of G-language Genome Analysis Environment. Here all analysis resource can be accessed through the web browser on your smartphone, tablet, or PC.

For example for Escherichia coli genome (ecoli),

http://useG.jp/ecoli shows genome information such as genome size and G+C content.

http://useG.jp/ecoli/recA retrieves information for recA gene.

http://useG.jp/ecoli/base_entropy calculates sequence conservation using entropy.

http://useG.jp/ecoli/codon_usage calculates codon usage for all genes.

http://useG.jp/ecoli/gcwin calculates G+C content along the genome.

http://useG.jp/ecoli/gcskew calculates GC skew (C-G)/(C+G) along the genome. A documentation for gcskew function can be viewed at http://useG.jp/help/gcskew.

Details for the methods are described in Arakawa K et al. (2008).

Base URL
list of available methods and data

Here we introduce examples of file upload, genome/gene information retrieval, and analysis methods for sequence conservation, DNA replication, and composition of nucleotides, amino acids, and codons.

File upload

You can upload your genome file at http://rest.g-language.org/upload/. Push [Choose file] to select a file and then push [Submit] button. This returns a unique reference ID for the uploaded file. You can use this ID for the rest of the analysis. For example, if you uploaded a GenBank file (e.g. NC_002127.gbk) and received an ID of "2F9509",

Genome information retrieval

http://rest.g-language.org/organism_list/ shows NCBI RefSeq ACCESSION No. and DEFINITION for available genomes (chromosomes and plasmids) of bacteria as follows.

ACCESSION  DEFINITION                    
NC_000913  Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome.

Genome information (e.g. Length of Sequence and GC Content) is given by

http://rest.g-language.org/[ACCESSION]

For example, information for Escherichia coli genome (NC_000913) is given by http://rest.g-language.org/NC_000913

The following genomes can also be accessed with G-language preset names.

ACCESSION Genome                      (preset name)

NC_000913 Escherichia coli K12 MG1655 (ecoli) 
NC_000964 Bacillus subtilis           (bsub)
NC_000908 Mycoplasma genitalium       (mgen) 
NC_005070 Synechococcus sp.           (cyano)
NC_003413 Pyrococcus furiosus         (pyro)
NC_001318 Borrelia burgdorferi B31    (bbur)
NC_002483 Plasmid F                   (plasmidf)
NC_001416 Enterobacteria phage lambda (lambda)

For E.coli (ecoli), http://rest.g-language.org/ecoli/TAXONOMY prints taxonomic information where '0' indicates 'all' taxonomic ranks, and taxonomic ranks 1, 2, 3, 4, and 5 correspond with domain, phylum, class, order, and family, respectively.

Gene information retrieval

repB gene information for plasmid F (plasmidf) is given by http://rest.g-language.org/plasmidf/repB

G-language assigns a FEATURE number and CDS number to each gene. For example, repB gene has a FEATURE number of 110 and CDS number of 35.

'feature' => 110,
'cds' => 35,

As shown below, the genome coordinate for repB gene is 36643..37620, and its direction (either direct or complement) is complement.

'start' => '36643'
'end' => '37620',
'direction' => 'complement',

Functional annotations for repB gene are

'product' => 'replication initiator protein',
'note' => 'binds to repeated iterons in RepFIB',

Gene IDs for repB gene are

'protein_id' => 'NP_061412.1',
'db_xref' => 'GI:9507746	GeneID:1263561',

G-Links (http://link.g-language.org/) collects related information from multiple databases (e.g. GO, KEGG, Pfam, PubMed, and UniProtKB) to a given gene with gene IDs or nucleotide/amino acid sequences.

http://rest.g-language.org/plasmidf/repB/translation shows amino acid sequence for repB gene.

http://rest.g-language.org/plasmidf/repB/get_geneseq shows nucleotide sequence for repB gene.

http://rest.g-language.org/plasmidf/repB/before_startcodon/200 shows 200 bp sequence upstream of the start codon of repB gene.

http://rest.g-language.org/plasmidf/*/product shows product names of all genes.

http://rest.g-language.org/plasmidf/product=replication/product show product names of genes containing "replication" in the product feature tag.

Output sequence data in FASTA format

For E.coli O157:H7 str. Sakai plasmid pOSAK1 (NC_002127),

Retrieving rRNA genes

For E.coli str. K-12 substr. MG1655 chromosome (NC_000913),

Pattern searches

The PatSearch class is a collection of sequence analysis methods related to pattern searches for oligonucleotides, including oligomer_search and palindrome. For example, you can search an inverted repeat (5' TTACGnnnnnnCGTAA 3') and palindrome (5' TTACGCGTAA 3') as follows.

For E.coli (ecoli), http://rest.g-language.org/ecoli/oligomer_search/TTACGCGTAA searches for an oligomer "TTACGCGTAA" and returns list of positions where oligomers are found as follows.

209570,1164188,1443204,1934579,2167198,2919269,4203297

http://rest.g-language.org/ecoli/oligomer_search/TTACGnnnnnnCGTAA/return=both searches for an inverted repeat "TTACGnnnnnnCGTAA" and returns both positions and oligomers as follows.

843936,ttacgaaacagcgtaa,3112312,ttacgcacaggcgtaa

Oligomer can be specified using degenerate code (e.g. "grtggngg") or regular expressions (e.g. "g[ag]tgg[a-z]gg").

http://rest.g-language.org/plasmidf/palindrome/shortest=10 searches 10-bp or longer palindrome sequences in plasmid F (plasmidf)

Sequence conservation

Shannon information theory has been used to identify conserved regions such as transcription factor binding sites and Shine-Dalgarno/Kozak sequences for ribosome binding sites. (Arakawa K et al., 2008)

For E.coli (ecoli), http://rest.g-language.org/ecoli/base_entropy calculates and graphs the sequence conservation in regions around the "start" codons using Shanon uncertainty (entropy).

Replication strand analysis

The excess of G over C and T over A in the leading strand of DNA replication relative to the lagging strand is observed in many bacteria, and this is thought to reflect strand-specific mutational bias. (Arakawa K et al., 2008)

http://rest.g-language.org/ecoli/gcskew calculates and graphs GC skew (C-G)/(C+G) for E.coli (ecoli). Replication origin and terminus are located around the GC skew shift points.

As shown in manual of gcskew analysis method (http://rest.g-language.org/help/gcskew), default parameters are 10,000-bp window size (window=10000), and graph output and display (output=show).

http://rest.g-language.org/ecoli/gcskew/output=f output the data in CSV format.

http://rest.g-language.org/ecoli/gcskew/window=100000 calculates GC skew with 100,000-bp window size.

http://rest.g-language.org/ecoli/gcskew/at=1 calculates AT skew (A-T)/(A+T) instead of GC skew.

http://rest.g-language.org/ecoli/gcskew/cumulative=1 calculates cumulative GC skew. Cumulative graph of GC skew is used to clarify the shift points, where the maximum and minimum points correspond to the replication origin and terminus (3.9e+06 bp and 1.5e+06 bp), respectively.

http://rest.g-language.org/ecoli/gcsi calculates the GC Skew Index (GCSI) to quantify the degree of GC Skew. E.coli genome has GCSI of 0.09666.

http://rest.g-language.org/ecoli/find_ori_ter predicts the replication origin and terminus based on the vertices of cumulative skew graphs.

rep_ori_ter gets the position of replication origin and terminus by several means. For example, http://rest.g-language.org/ecoli/rep_ori_ter retrieves the position of replication origin (3924034) and terminus (1588773) in E.coli genome (ecoli) from the databases. If the positions of origin or terminus cannot be found in the databases, rep_ori_ter calls find_ori_ter method.

http://rest.g-language.org/ecoli/genomicskew calculates the GC skew in different regions of the given genome (whole genome, coding regions, intergenic regions, and third codon positions).

dnaA gene (http://rest.g-language.org/ecoli/dnaA) is located around the replication origin.

http://rest.g-language.org/ecoli/find_dnaAbox finds dnaA box (5'-TT A/T TNCACA-3') in both strands.

http://rest.g-language.org/plasmidf/find_iteron finds iteron (5'-TGAGGG G/A C/T-3') in both strands of plasmids.

Nucleotide composition analysis

G+C content along the genome

Genomic G+C content (G+C)/(A+T+G+C) is correlated with a number of factors including genome size, aerobiosis, optimal growth temperature, and free-living lifestyle (Hildebrand F et al., 2010). To identify genomic islands (clusters of foreign genes), intragenomic variation in G+C content is computed using sliding windows (Karlin S., 2001).

http://rest.g-language.org/mgen/gcwin calculates and graphs the GC content along the genome of Mycoplasma genitalium (mgen).

http://rest.g-language.org/mgen/gcwin/window=1000 uses 1,000-bp window size instead of the default 10,000-bp window size.

Oligonucleotide composition analysis

Each organism has its characteristic “genomic signature” defined as the ratios between the observed and expected frequencies of oligonucleotides (di-, tri-, and tetranucleotides). The genomic signature has been applied to taxonomic classification and predicting plasmid hosts (Teeling H and Glöckner FO, 2012)(Campbell A et al., 1999)(Suzuki H et al., 2008)(Suzuki H et al., 2010).

For M.genitalium (mgen), http://rest.g-language.org/mgen/signature calculates 2-mer signature, and http://rest.g-language.org/mgen/signature/wordlength=3 calculates 3-mer signature.

Base Usage Indices (bui)

http://rest.g-language.org/mgen/bui calculates the following base usage indices for protein-coding sequences (CDS) in M.genitalium (mgen).

  • acgt: A + T + G + C
  • ryr: purine/pyrimidine ratio (A + G)/(T + C)
  • gcc: G+C content (G + C)/(A + T + G + C)
  • gcs: GC skew (C - G)/(C + G)
  • ats: AT skew (A - T)/(A + T)

http://rest.g-language.org/mgen/bui/position=3 uses only bases at 3rd position of codons to calculates the base usage indices.

http://rest.g-language.org/mgen/bui/tag=start shows start positions of CDS in the genome. Values for the tag= option can be 'start', 'end', 'gene', 'product', 'locus_tag', 'protein_id', 'db_xref', and so on.

Amino acid usage analysis

Amino acid frequency

For M.genitalium (mgen), absolute frequencies (data=A0) and relative frequencies (data=A1) of amino acids summed across all proteins are given by

Amino Acid Usage Indices (aaui)

http://rest.g-language.org/mgen/aaui calculates the following amino acid usage indices for proteins in M.genitalium (mgen).

  • Laa: length in amino acids
  • ndaa: number of different amino acids
  • aroma: relative frequency of aromatic amino acids
  • gravy: mean hydropathic indices of each amino acid
  • mmw: mean molecular weight
Multivariate analyses of amino acid usage

Multivariate statistical methods such as correspondence analysis (COA) have been used to identify major sources of variation in amino acid usage among proteins (Lobry JR et al., 1994)(Zavala A et al., 2002). For example, http://rest.g-language.org/mgen/codon_mva/method=coa/data=A0 performs COA of amino acid usage for M.genitalium (mgen) proteome. The first axis is correlated with protein hydrophobicity (gravy), and it discriminated integral membrane proteins from the others.

Codon usage analysis

Synonymous codon usage varies both between organisms and among genes within a genome, and arises due to differences in G+C content, replication strand skew, or gene expression levels. Thus, synonymous codon usage analysis provides a way to identify horizontally transferred or highly expressed genes. (Arakawa K et al., 2008)

Representation of codon usage data

Different kinds of representations of codon usage data (termed here R0-R4) have been used in codon usage studies.

  • R0: absolute codon frequency (or codon count).
  • R1: relative codon frequency in a complete sequence.
  • R2: relative codon frequency in each amino acid.
  • R3: Relative Synonymous Codon Usage (RSCU).
  • R4: Relative adaptiveness (or W value).

For plasmid F (plasmidf), codon usage data R0-R4 for all genes are graphically shown by

http://rest.g-language.org/plasmidf/codon_compiler/data=R0/output=stdout (standard output)

http://rest.g-language.org/plasmidf/codon_compiler/data=R0/output=stdout/id=FEATURE110 calculates codon counts for repB gene ('feature' ⇒ 110) accessible at http://rest.g-language.org/plasmidf/repB

Synonymous codon usage diversity

The mean distance (Dmean) between all pairs of genes can be used to measure the level of diversity in synonymous codon usage among genes (Suzuki H et al., 2009). http://rest.g-language.org/plasmidf/Dmean calculates the Dmean value for plasmid F (plasmidf).

Synonymous codon usage evenness

Several measures have been used to estimate the degree of deviation from equal usage of synonymous codons, including ENC (Effective Number of Codons), SCS (Scaled Chi-Square), CBI (Codon Bias Index), ICDI (Intrinsic Codon Deviation Index), and Ew (weighted sum of relative entropy). Ew ranges from 0 (maximum bias) to 1 (no bias). Ew takes into account all three aspects of amino acid usage (i.e., the number of different amino acids, their relative frequency, and their codon degeneracy), and indeed is little affected by amino acid usage biases (Suzuki H et al., 2004).

For plasmid F (plasmidf), values of ENC, SCS, CBI, ICDI, and Ew for each gene are calculated by

Predicting gene expression level

Various methods of predicting gene expression level from codon usage bias have been proposed, including P2 index, Fop (Frequency of OPtimal codons), CAI (Codon Adaptation Index), tAI (tRNA adaptation index), and PHX (Predicted Highly eXpressed).

For E.coli (ecoli), P2, Fop, CAI, tAI, and PHX are implemented by

http://rest.g-language.org/ecoli/cai/tag=product shows functional annotations ("product") instead of "locus_tag" for genes.

P2 indicates the efficiency of the codon–anticodon interaction, and highly expressed genes in E.coli have high P2 values (>0.7).

Fop takes values from 0.0 (where no optimal codons are used) to 1.0 (where only optimal codons are used).

CAI is a measure of the relative adaptiveness of the codon usage of a gene towards the codon usage of highly expressed genes. CAI ranges from 0.0 to 1.0.

PHX calculates codon usage difference of a gene from all genes (BgC) and from highly expressed genes (BgH), and Expression measure (E_g = BgC/BgH). A gene is deemed Predicted Highly eXpressed (phx = 1) if BgH is lower than BgC and thus E_g > 1.0. A gene is deemed Putative Alien (pa = 1) provided both BgH and BgC exceed the median value for all genes. http://rest.g-language.org/ecoli/phx/output=stdout (standard output)

Values of CAI and PHX analyses derived from different genomes cannot be simply compared because these values are based on highly expressed genes in the genomes.

Detecting translational selection

In some bacteria exhibiting no evidence of translational selection on codon usage, highly expressed genes do not have unusual codon usage, and thus codon usage cannot be used to predict gene expression levels. (Henry I and Sharp PM, 2006)

S_value (http://rest.g-language.org/help/S_value) calculates the strength of translationally selected codon usage bias (S) (Sharp PM et al., 2005). Fast-growing bacteria tend to have more rRNA/tRNA genes and higher S-values. For example, S-values are higher in E.coli (http://rest.g-language.org/ecoli/S_value) and Bacillus subtilis (http://rest.g-language.org/bsub/S_value) than in B.burgdorferi (http://rest.g-language.org/bbur/S_value) and M.genitalium (http://rest.g-language.org/mgen/S_value).

Multivariate analyses of codon usage data

Multivariate analysis methods, such as correspondence analysis (COA) and principal component analysis (PCA), are often used to identify gene features contributing to the variations in synonymous codon usage among genes.

Of the existing COA methods, Within-group Correspondence Analysis (WCA) performs best because it does not mask variation in synonymous codon usage caused by amino acid composition and codon degeneracy (Suzuki H et al., 2008).

codon_mva (http://rest.g-language.org/help/codon_mva) performs WCA of codon usage data for a given genome, and analyzes correlations between the WCA axes and various gene parameters (e.g. Laa, aroma, gravy, mmw, gcc3, gtc3, and P2). In the WCA plots, the first four axes (Comp1 to Comp4) obtained by WCA are shown in y-axes, and the gene features (e.g. gcc3 and gtc3) having the largest absolute correlation coefficients (|r|) are shown in x-axes.

  • For E.coli (http://rest.g-language.org/ecoli/codon_mva), Comp1 (20.8% of variation) is correlated with gcc3 (G+C content at 3rd codon position) (r = 0.70). Comp2 (9.9% of variation) clearly separates highly expressed genes (red circles) from the other genes (black crosses) (the mean absolute standard score for highly expressed genes, z = 3.14), suggesting that translational selection is acting on synonymous codon usage.
  • For M.genitalium (http://rest.g-language.org/mgen/codon_mva), Comp1 is correlated with gcc3 (r = 0.96). Intragenomic variation in G+C content mostly reflects the existence of regions with anomalous nucleotide composition, putatively acquired by horizontal transfer. The exception to this is M.genitalium, in which intragenomic G+C variation is continuous along the genome (http://rest.g-language.org/mgen/gcwin).

http://rest.g-language.org/mgen/codon_mva/output=stdout (standard output) shows the contribution of each axis (%), the mean absolute standard score (z-score) for highly expressed genes on each axis, and the absolute correlation coefficient (|r|) between each axis and each gene parameter (e.g. Laa, aroma, gravy, mmw, gcc3, gtc3, and P2).

Of the five codon usage data (R0-R4), only R4 is independent of all three biases (gene length, amino acid composition, and codon degeneracy). Indeed, Principal Component Analysis (PCA) of R4 data (PCA-R4) performs best because it is not affected by any of these biases (Suzuki H et al., 2005). http://rest.g-language.org/mgen/codon_mva/method=pca/data=R4 performs PCA-R4 for M.genitalium (mgen).

restgenomeanalysisenglish.txt · Last modified: 2014/12/13 08:25 by haruo