From Chamydomonas to rice: the evolution of green P450s.
LA P450 Diversity Meeting 
August 21, 2002
David Nelson

[slide 1 meeting Logo session title] (no link)
Genomic sequencing is creating raw sequence data at a tremendous pace.  
The number of eukaryotic genomes is leaving the single digit range and with 
that come many new P450 sequences.  [slide 2 homepage] My website is 
dedicated to cytochrome P450 nomenclature and evolution.  Since the 
nomenclature is based on sequence relatedness, the names mean something 
and are useful for rapid comparison across species, across phyla and 
sometimes across Kingdoms.  In Japan last month I talked about human and 
Fugu P450s.  Today the title of my talk is From Chlamydomonas to rice: the 
evolution of green P450s.  I will begin with Chalmydomonas, move on to 
rice and in the process we will run into Arabidopsis, but I will let Soren Bak 
do the honors for Arabidopsis.  [slide 3 chalmy, rice, Arab] (no link)  I chose 
Chlamydomonas because all green plants evolved from a single celled green 
algae over 430 million years ago.  For perspective, tetrapods (that's us) 
diverged from ray-finned fishes about 420 million years ago, so fish and 
humans are about as far apart as green algae and rice.  If we want to 
understand the origins of Plant P450s, we must look at green algae and 
Chlamydomonas just happens to have a genome project and an EST project.  
[slide 4 JGI chlamy]  I should say that there are several other labs involved 
in this, not just JGI. The ChlamyEST database is located at Duke and there 
was an EST project at Kazusa Japan, and some BAC clones were being 
sequenced at the University of Oklahoma.  

There are some detectable P450s in Chlamydomonas, but the whole genome 
is not yet searchable.  What I could find from searches of the ESTs, BAC 
ends and sequences in Genbank were 15 sequence fragments from P450s, 
but no full length sequences.  Looking at the C-terminal half only, there are 
at least seven p450s in Chlamydomonas, but this is a minimum count, since 
some of the N-terminal sequences may belong to other P450s.  This number 
is bound to go up, but probably not by a factor of 10.  So Chlamydomonas 
has a dozen or at most a few dozen P450s, not anything like Arabidopsis.  

What is there?  CYP51 is present as expected for a free living eukaryote.  If 
diet supplies the needed sterols, some eukaryotes have lost the CYP51.  It is 
not found in C. elegans, Drosophila or more recently Ciona (the sea squirt). 
However, it is not possible for plants to lose CYP51 because they have to 
make everything from scratch.  In addition to CYP51, CYP97 is present.  
There are both 97A and 97B-like sequences that are also seen in 
Arabidopsis.  The last recognizable sequence is CYP711, named in 
Arabidopsis.  [slide 5 711 align] (no link)  We really do not know what the CYP97 
and CYP711 sequences are doing, But they are ancient versions of the P450 
enzyme.  CYP711's best animal match is CYP5 or thromboxane A2 
synthase, which is closely related to CYP3, so 711 may share a common 
ancestor with the CYP3 family.  CYP97 clusters near the CYP4 clan in 
animals, so it might be related to an ancestor to CYP4. CYP97 has also been 
seen in diatoms which are Stramenopile protists, so CYP97 predated the 
divergence of the crown group of eukaryotes.  Because these sequences 
evolved so early, they are probably doing some fundamental biochemistry 
present even in single celled eukaryotes.  They would be important targets 
for analysis.  

The other CYPs in Chlamydomonas are too different or too short to 
recognize by family.  We will have to wait for the genome sequence 
promised this summer.  

But we do not have to wait for rice.  [slide 6 rice terraces] (no link) The rice genome has 
been sequenced by four projects.  Syngenta in Switerland, Beijing Genomics 
Institute in China, Monsanto and the Rice Genome Project international 
consortium in Japan.  Syngenta is private so you have to sign legal 
agreements to see the data.  Monsanto has a similar arrangement, but they 
have given their data to the Japan consortium, where it is being released 
slowly as finished contigs. The Beijing group published in Science in April 
and their data is blast searchable at NCBI.  They are more complete than the 
public project, but the reads are shotgun assemblies and there are a few 
problems associated with that.  The public project should reach phase 2 high 
quality draft sequence for the whole genome by the end of this year.  

I have spent many hours assembling and annotating the P450s from rice.  As 
of this past Sunday morning I was able to get the sequences sorted in family 
bins and named.  There is still some work to be done to complete all the 
genes and finalize the naming, but I do have a gene count.  You may think I 
am joking, but I assure you I am not.  There are approximately 452 P450 
genes in rice.  Some of these are only named to the family level and they still 
need to be broken down into subfamilies, but that will follow soon.  I 
remember speaking at the 3rd Biodiversity meeting in Woods Hole in 1995 
where the title of my talk was 450 cytochrome P450s.  That was all that was 
known then.  Now we have that many in a single genome.  

The last count [slide 7 stats]  of named plant P450s from May 16, showed 
607 named sequences (not counting alleles).  The total for all species was 
1925 named genes, so plants represented 32%.  Now with the addition of 
names for the rice P450s there will be 452 more named plant sequences for a 
total of 1059 from plants and 2377 total.  Plants are now 45% of the total.  
Also please note that all these (452/2377) genes were named in the last 
month. That is about 20% of all named P450s.

How were these sequences found and how accurate is the count?  The 
sequences were found by blast searching.  To cover the P450 sequence space 
at least one sequence from each P450 clan was used.  [slide 8 clans] There 
are 10 P450 clans in plants that include 53 P450 families.  The first sequence 
from a P450 clan usually gives new hits not seen with other P450 sequences.  
The percent identity detectable in a single search is fairly low so sequences 
can be detected from other plant clans.  More than one sequence was used 
from the larger clans, and the number of new hits recorded was monitored 
with each search to see if additional searches were needed.  Toward the end 
of the search process each new search only retuned 4-6 new accession 
numbers and these were all short fragments, often pseudogene pieces.  This 
process was carried out first on the Japonica data in Genbank.  After April 5, 
the same process was done on the indica subspecies data.  

The sequences, initially from japonica, and later from indica were placed on 
a P450 blast server at the University of Tennessee [slide 9 blast server] .  
This server was set up by Rob Edwards, a Linux and bioinformatics 
afficionado in our Department.  I often update the sequences on the server 
when doing this type of genome work on an hourly basis to get the new 
sequences into the Blast server.  Here you can search 14 different sets of 
P450s from 13 species.  From this server, indica sequences were compared 
against japonica to find orthologs.  [slide 10 orthologs table] Here is a 
sample of the comparison between the P450s.  The first column is the indica 
accession number. a, b, c after the accession indicates multiple genes on a 
single accession in order of appearance.  $FI is shorthand for full length 
indica sequence.  I use the $ because it is a unique character that can be 
counted in a word processor very easily.  So the number of $FI = the number 
of full length indica sequences.  Orth means ortholog followed by the 
accession number of the ortholog and its status.  i means incomplete $PI 
means pseudogene.  

By making this type of table it was possible to tell if a sequence was 
complete, if it had an ortholog and if one sequence was incomplete and its 
ortholog was complete you could use the ortholog for additional blast 
searches to find the missing sequence.  It was also possible in some cases to 
construct a complete hybrid sequence from two incomplete orthologs.  Since 
they are 99% identical this can be used in tree building.  

The sequence discovery phase resulted in 545 accession numbers for the 
indica P450s and 762 japonica P450 sequences or fragments on 628 
accession numbers.  The sequences from japonica were done earlier and they 
were assembled into contigs by blast searching all fragments against the 
others to find overlaps.  This resulted in 341 japonica sequence contigs and 
452 total sequences from both subspecies.  

To name the sequences I did blast searches with one named member from 
each P450 family and in the cases of large families like CYP71, I used 
members from different subfamilies.  I tried to use rice, sorghum, corn or 
wheat sequences when available.  This identified the sequences to the most 
similar family, but not always to a known subfamily.  The sequences were 
then put in family bins and in some cases subfamily bins.  At this step I used 
a fairly strict cutoff of about 45-46% identity.  Starting with 543 indica 
sequences, after searching these with all 53 plant families, I only had 267 
identified to family and 276 not assigned.  That was a lot of unassigned 
sequences, so to get an idea what was going on I decided to make a tree of 
the full length or near full length sequences, using clustalW and Phylip.  I 
went through the set of 276 and deleted all partials that would affect the tree 
making process and kept about 100 sequences.  After a first pass to remove 
sequences that were behaving in odd ways, the second generation tree is 
shown here  [slide 11 of 90 rice tree].  These were all unnamed rice 
sequences.

The sequences are clustered in three groups.  The top was the 86 clan with 
CYP86, 94, 96 and 704 members.  The bottom is the 85 clan and in between 
is one large group.  All these sequences were unnamed and I thought I had 
come across a unique rice set of P450s not belonging to the known families.  
That turned out not to be true.  [Slide 12 plant group A] Here is a section of 
a larger tree showing the CYP85, CYP86 and plant group A clans. After 
doing another tree with one member from each of the families in the plant 
group A added I found the following result.  [slide 13 ricetree90 ] The 
vertical red bars represent the plant group A sequences.  The rest are all from 
rice.  By looking at this tree it was possible to place all the rice sequences 
except for the top two and the bottom six, in existing families, which I could 
not do before by using the blast results.  Most of the new sequences fell in 
the labeled families.  36 were in CYP71, 11 were in CYP76, 11 more fell in 
the 86 clan (at the top).  Only three new families were created, the very top 
branch which may belong to the 86 clan, and the two lower deep branches 
that belong in the 85 clan.

Based on the this tree and some more detailed trees of smaller regions like 
the 71 family by itself,  [slide 14] the 76 family  [slide 15] and the 86 clan, 
 [slide 16] names were assigned.  Remember that I only used 90 unidentified 
sequences to start this process and some were removed for tree building 
reasons.  That left about 200 shorter sequences still to be identified.  
However, after all the new subfamilies were created by this tree building 
process the 200 could be blast searched against the full length sequences in 
the tree to get an ID.  There were 16 new subfamilies created in the 71 
family.  This second series of blast searches identified all but 10 sequences 
down to the family level.  Three new families were created in the process for 
a total of 6 new rice P450 families.  Three of these are in the CYP85 clan.

The ten unnamed sequences are still too short and need completion of the 
genome to finish them or identify them as pseudogenes.  293 of the rice 
genes are full length either in indica or japonica.  I will be making a 
complete set of these available for download and for blast searching on the 
home page.  

I would like to finish by making a comparison to Arabidopsis, since I am 
sure that this is going through your minds now.  What I will do is show you 
tables of the P450 families with the number of genes given in rice and 
Arabidopsis.  That way expanded families will be obvious.  Dramatic 
differences are highlighted in red. (these last tables are only on the laptop right now)

The CYP51 family in rice has expanded to 13 sequences compared to 2 in 
Arabidopsis.  These will require the introduction of subfamilies in CYP51, 
something I have been avoiding in the past.  Rice has greatly expanded the 
71 family with 111 vs 54 sequences. These form 16 new subfamilies in the 
71 family.  CYP76 is also expanded.  Arabidopsis is not always the lesser in 
these comparisons.  Arabidopsis has more CYP79 sequences and 5 CYP82s 
where rice has none.  The CYP85 clan includes 87 and 90.  These two 
families have significantly more sequences than Arabidopsis.  Two new 
families appear in rice CYP728 and 729, that belong in this clan.  Rice 
appears to be doing a lot of biochemistry with these related enzymes.  At 
least CYP85 and CYP90 are involved in brassinosteroid metabolism.  The 
86 clan is expanded in rice but only in the CYP94 family, CYP86, 96 and 
704 are nearly the same.  

Note that 92 has 16 members in rice but 0 in Arabidopsis.  92 is seen in 
Tobacco, so it will be curious to know what happened to this family in 
Arabidopsis.  On the flip side of that coin, Arabidopsis has 8 702 and 33 705 
members while rice has none.   

38 P450 families are seen in both plants, so these families existed before the 
monocots diverged from dicots 150MYA.  Two others are probably 
nomenclature errors CYP99 clusters inside the CYP71s now and CYP712 is 
probably a CYP705 subfamily.  This suggests that most of the present day 
P450 families were in place early on in the history of flowering plants.

My last slide is this plain picture of a bowl of rice.  To borrow from eastern 
philosophy, this picture of rice is not satisfying, not like eating the actual 
bowl of rice.  Genomes of rice or any other species are like pictures of a 
bowl of rice.  They are not satisfying until experienced by annotation.  So go 
get your chopsticks.