Rice Genome Annotation Project

Automated Annotation of the Rice Genome

All rice BAC/PAC were downloaded from HTGS division of GenBank as well as the PLANT division of GenBank. The BAC/PAC sequences were assembled into 12 pseudomolecules. Each of the pseudomolecule sequences were processed by our annotation pipeline as described below. Please note that all data is from automated processes and is NOT manually curated.

Steps involved in the automated annotation:

The pseudomolecule sequences and results of all analyses are stored in our central relational database (Postgres).

Gene prediction programs
- FGENESH (Fgenesh predicted results were used as default working gene models in the automated annotation)
- Genemark.hmm (rice)
- Genscan (Maize)
- Genscan+ (Arabidopsis)
- GeneSplicer, to predict exon/intron splicing sites
- tRNAscan-SE, to predict tRNA
Database search
- Rice ESTs and FL-cDNAs and transcript assemblies (PUTs) from the PlantGDB were aligned to the pseudomolecules using gmap. Only the FL-cDNA and PUTs alignments are shown in the browser. Only the EST and FL-cDNA alignments were used for gene model improvement by PASA.
- We also search each pseudomolecule sequence against a rice repeat database to identify known repeats and transposons (DNA transposons, retroelements, MITEs, etc).
- Simple repeats are identified and annotated with RepeatMasker
Improve gene model structures with rice EST/FL-cDNA using PASA
Criteria for the definition of genes
- Gene models with protein matches are named after the database entries to indicate similarity.
- Specifically, gene models with greater than 30% identity and greater than 50% coverage are annotated as "xxxx, putative".
- Genes that have transcript support are additionally annotated as "expressed".
- Genes that are not supported by transcript evidence but that do align to protein sequence from a known gene are annotated as "conserved hypothetical protein".
- Predicted genes with no alignments to known genes, transcripts or protein sequences are simply labeled as "hypothetical proteins".
- Gene models that do not have homology to know genes or proteins but that are supported by rice transcript evidence are labeled as "expressed protein".
For additional information, see these publications.
- Ouyang, S., Zhu, W., Hamilton, J., Lin H., Campbell, M., Childs, K., Thibaud-Nissen, F., Malek, R.L., Lee, Y., Zheng, L, Orvis, J., Haas, B., Wortman, J. and Bueel, C.R. 2007. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Research 35: D883-D887
- Yuan, Q., Ouyang, S., Wang, A., Zhu, W., Maiti, R., Lin, H., Hamilton, J., Haas, B., Sultana, R., Cheung, F., Wortman, J., and Buell, C.R. 2005. The Institute for Genomic Research Osa1 Rice Genome Annotation Database. Plant Physiology 138: 18-2

The sequences of the annotated genes, along with supporting evidence, can also be found on our web site.

Software Links

FGENESH
GeneMark.hmm (Borodovsky and Lukashin, School of Biology, Georgia Institute of Technology)
Genscan (Chris Burge, Massachusetts Institute of Technology)
Genscan+ (Chris Burge, Massachusetts Institute of Technology)
tRNAscan-SE (T.M. Lowe, USCS)
GMAP
dds/gap2, dps/nap (Xiaoqiu Huang, Dept of Computer Science, Michigan Technological University)
RepeatMasker (A.F.A. Smit & P. Green, University of Washington)

This work is supported by grants (DBI-0321538/DBI-0834043) from the National Science Foundation and funds from the Georgia Research Alliance, Georgia Seed Development, and University of Georgia.