The Rice (Oryza sativa) P450 Genbank Inventory

Last modified Nov. 14, 2001 David R. Nelson There has been much progress in sequencing the rice genome in the past two years. Monsanto paid for a private sequencing effort that covered the genome to 5X depth. This data was given to the Rice Genome Initiative to be used in finishing the clone by clone strategy they are using. Unfortunately, that 5X data was not deposited in Genbank where everyone could look at it. Monsanto/Pharmacia have decided to make the data available to researchers by password access after legal documents are signed. There are still some restrictions that apply. Before gaining access to the 5X data, I have searched Genbank exhaustively for rice P450 sequences to identify what is already available. Blast searches were performed on the nr, est, htgs and gss sections using one member from each P450 clan in plants (9 different sequences = CYP51A2, a 71B, 72A7, 74A, 86A1, 85A, 97A, 710A1, 711A1). Additional searches were also done as indicated by what was found. If a partial 90C-like sequence was found in HTGS then HTGS was searched with 90C to find the rest of the sequence. An alphabetical list of accession numbers was prepared and each search output was compared against this list for new entries (see below for a convenient way to do this). In Sept. 2000, 20 familes of plant P450s were not found in the rice set. To be sure that small fragments in the EST or GSS section of Genbank had not been overlooked, especially fragments from less conserved regions like the extreme C-terminal, blast searches were done with one member from each of these 20 families against the EST or GSS section of Genbank limited to rice. This identified six additional entries in three different families. Three of these were identical sequences containing a small fragment upstream of the I-helix in CYP73. One was from a different region of CYP73. One was a single small exon including the heme binding site of a CYP708 P450. One was from a CYP92 sequence. After these nine X 4 searches were done as well as the additional searches, 622 accession numbers of rice P450 sequences had been identified. These contained 756 P450 sequence fragments, since some accession numbers held clusters of nine or even 15 P450 genes. All sequences from the blast output were compared against each other by Do-It-Yourself Wu-Blast 2.0. or on our new rice blast server after it was online. This resulted in joining overlapping fragments into larger contigs. The result was 296 contigs, of which 172 are full length P450 sequences, 52 are pseudogene fragments and 72 are partial P450s that might still be completed. All 296 sequences were compared to a database of all 273 Arabidopsis P450s plus seven additional P450s from families not present in Arabidopsis (CYP80, CYP92, CYP99, CYP719, CYP723, CYP725, CYP726). This identified the fragments to specific families. Once this had been done, there were still 15/53 plant P450 families that had no P450s present in the rice set. (CYP80, 82, 83, 702, 705, 708, 712, 714, 716, 718, 719, 720, 721, 725, 726) Some of these were borderline cases and some rice sequences could be placed in these families at 38-39% identity. Some of these were from unusual plants like euphorbias (CYP726) or yew (CYP725) or Cryptomeria (719) and it is not surprising that they do not match rice. Comparing families between the rice and Arabidopsis genomes, 34 of 45 families (76%) were present in both species. CYP92 was found in rice but not in Arabidopsis. Since the rice genome is incomplete in the public databases, it appears that most plant P450 families existed before the monocot-dicot divergence. As more data comes in, the number of families missing between the two species will probably drop to a very small number. These may be specific to eudicots or even of more limited range. As mentioned above, there are seven families that are not seen in Arabidopsis, and this specialization may occur in each major lineage of plants. The following files and servers are available. Rice P450 Blast Server Rice alphanumeric accession number list Use the accession number to locate the sequence in the contig collection below. Rice contig collection, 296 sequences Nov. 14, 2001 These are sorted by clan and family Rice FASTA file this is the same as the contig collection except the extra information has been stripped out to leave only a single identifier line with each sequence. Duplicate sequences are removed. The sequence order is not the same. This file can be used with the Do-It-Yourself blast search. This is an easy way to compare a new blast search output against an alphabetical listing of accession numbers to identify new hits. Open your file of alphabetized accession numbers in Word. Select and copy the blast output accession number list only and place it at the top of the accession number file. Change it to Courier 9 point font. Select and color the blast output list red. Using the replace command, replace all gb, emb and dbj occurrences with nothing. This will delete them from the description list in the blast output and align most of the accession numbers directly above one another. Those with 6 or 7 digit gi numbers will need to be adjusted by hand so the accession numbers line up. Using the alt key and by holding down the mouse, (on a Mac) select the vertical block of text that precedes the accession numbers (the gi numbers and some vertical bars ||) and delete this block. That will leave the accession numbers flush against the left edge. Select all the content of the file by apple key + A (on a Mac). Use the sort command from the table menu on the tool bar and sort all the accession numbers. You may have to do this twice for some unknown reason. This will sort all the accession numbers from the blast output in red with all the accession numbers in your list (in black) and you can visually compare them in a minute or two to identify new hits. This whole process takes only a couple of minutes and saves lots of time. I have done it with about 100 new accession numbers compared to about 700 old numbers.