Rice Genome Annotation Project

Oryza Repeat Database Background

Simple repeats: Based on previous experimental and bioinformatic genome analysis we know that repetitive sequences are found in tandemly repeated microsatellites (1-7 bp), longer and more complex minisatellite repeating units (up to 40 bp), and satellite DNAs with lengths of 140 to 360 bp. McCouch et al., have calculated that the rice genome includes 5700 to 10,000 microsatellites with the relative frequency of different repeats decreasing with increasing size of the motif (McCouch et al 1997). Simple repeat microsatellites can be identified and "masked" in genomic sequence with programs such as RepeatMasker.

Mobile elements: Mobile DNA sequences, such as transposons and retrotransposons make up a high proportion of plant middle repetitive DNA. In maize these types of sequences make up more than 50% of the nuclear genome (San Miguel et al., 1996). Hirochika et al. estimate that there are at least 1000 retroelements in rice (Hirochika et al., 1997). Retroelements are divided into mobile sequences with long terminal repeats (LTRs) and non-LTR retrotransposons (LINEs, long interspersed nuclear elements and the related SINEs, short interspersed nuclear elements). Plant genomes may also contain solo-LTRs, miniature inverted-repeat transposable elements (MITEs), DNA transposons, and virus-like sequences. A comprehensive survey of wild-type Oryza sativa gene sequences by Bureau et al., found that mobile elements are frequently found within and in regions flanking rice genes (Bureau et al., 1996). The most prevalent type of element found in these genes were a collection of MITEs, which account for ~5% of rice genomic sequences (Ning and Wessler, 2001). Rice intergenic regions also contain DNA transposons belonging to the Mutator-family (Yoshida et al., 1998), En/Spm-family (Motohashi et al., 1996), Ac/Ds-family (Song et al., 1998), etc. The identification of mobile elements can be improved by searching genomic sequence against a curated set of known mobile elements. This Rice Repeat Database is a compilation of sequences that can aid in the identification and annotation of such elements in the rice genome.

Other repeats: Analysis of rice centromeric sequences indicates that the centromere is a complex region with stretches of tandemly repeated sequences intermixed with middle repetitive sequences. At least 7 centromeric repetitive DNA families have been described in the rice centromere -- 6 middle repetitive sequences (50 to 300 copies) and one tandem 168 bp repeat RCS2, (Dong et al., 1998) that is unique to rice centromeres. RCS2 monomers are estimated to be present in the rice genome at ~6200 copies and Fiber-FISH signals suggest that there may be up to 151 kb of uninterrupted RCS2 sequences and up to 556 kb of centromeric repeats with interspersed gaps. The telomeric sequences, at the ends of most plant and animal chromosomes, are highly conserved between most species. In rice, telomeric DNA consists of conserved 7 bp repeats (CCCTAAA), (Ohmido and Fukui, 1997; Wu et al., 1994). A final class of repetitive sequences found in all eukaryotic genomes are the 18S-5.8S-25S and 5S rRNA gene loci, clustered at a small number of sites, that encode the structural RNA components of ribosomes. In Oryza sativa japonica these sequences have been identified on chromosome 9 (and one 5S on 11), while in Indica varieties a second locus on chromosome 10 and perhaps a weak one on 11 has been identified (Fukui et al., 1994). Where available, we have included these types of repeats in this database.

This work is supported by grants (DBI-0321538/DBI-0834043) from the National Science Foundation and funds from the Georgia Research Alliance, Georgia Seed Development, and University of Georgia.