Comparative genomics of cytochrome P450 between human and Fugu (Japanese pufferfish)
This is a web version of a poster presented at a Tennessee Bioinformatics meeting held at Paris Landing State Park March 22-23, 2002
David R. Nelson, Dept. of Molecular Sciences
University of Tennessee Health Sciences Center, Memphis TN 38163
In 1958, cytochrome P450 was discovered by spectroscopy of turbid suspensions of rat liver microsomes. The name comes from the strong absorbance at 450nm seen in a reduced CO difference spectrum (pigment at 450nm). The 450nm peak is unusual for heme proteins. It is caused by a thiolate anion (S-) from cysteine acting as the 5th ligand to the heme iron. In the 1960s, experiments showed that the P450 protein could be induced in rat liver by as much as 50 fold by various agents like phenobarbital. SDS gels showed that there was more than one form of the enzyme and the push to purify these cytochromes began. During the 70s and 80s, more and more P450s were found in rats, rabbits, mice and humans. At one point it was estimated that there might be as many as 100 different genes in mammals. This was a slight over estimate. The human genome sequence contains 57 probable functional P450s in 18 families, though the substrates are not known for all of them yet. Rats and mice have more than humans as a result of expansion of certain families and subfamilies. Since P450s are involved in steroid biosynthesis, retinoic acid metabolism, protaglandin metabolism and other lipid pathways like cholesterol biosynthesis, their evolution may parallel the evolution of new signaling pathways in animals and plants. They may be key enzymes in the circuitry of developmental pathways. They are also linked to several human diseases. By following the evolution of P450 back to the first eukaryotic cell, it may be possible to link evolutionary innovation to the appearance of new classes of P450 at the family level.
The P450 superfamily exists in vertebrates other than mammals. In fact, the superfamily is ubiquitous, being found in every branch on the tree of life. The complete genome of the fruit fly showed 90 P450 genes (86 functional and 4 pseudogenes) C. elegans had a similar number (76 functional genes and 4 pseudogenes). However, the conservation of sequence at the family level between vertebrates and invertebrates is low. Questions about the evolutionary history of P450 in vertebrates required new genomes that were outside mammals. The recent assembly of the Fugu genome (Oct. 26, 2001) has provided a partial set of P450s from a fish. These genes are not all complete and there may be some that are missing, but the genome does give a first view into vertebrate P450 evolution on a 400 million year time scale. Further analysis of the 14X coverage of the sea squirt (Ciona savignyi) will help to fill in gaps in our knowledge about when P450 families appeared and when they expanded. Comparison of the upstream regions of orthologous genes may reveal conserved promoter elements and help to clarify the regulation of these genes.
It is a fairly simple matter to do a blast search and find hits to a P450 sequence in a database. It is not so simple to find all members of a large gene family in a genome. A systematic search procedure needs to be implemented. With the Fugu genome, a BLAST server was already available to search several data sets Fugu and zebrafish Blast server.
Scaffolds (assembled genomic sequence from BACs and cosmids)20, 862 sequences, 309 Mbases
Cosmids47,048 sequences, 22Mbases
Cosmid ends82,271 sequences, 82 Mbases
ESTs (cDNA) 3208 sequences, 1.2 Mbases
All Fugu (sum of databases)153,389 sequences, 416 Mbases
It was necessary to find all P450 containing fragments from all of these databases. An assumption was made that fish P450s would fall in similar family groups (greater than 40% sequence identity) to mammalian members of the same family. Therefore, it would only be necessary to search the All Fugu database with 18 sequences (one from each mammalian family) to find all members of P450 in Fugu. Since some mammalian subfamiles are nearly in distinct families (they are on the border of the 40% definition) These subfamiles were also included in the search. Since positive hits were often in the mid 20% range and they could still be identified as valid P450s, this strategy was deemed sufficient to pick up most vertebrate p450 members, even if they were in a new family not seen in mammals. It should also be able to detect a wide variety of pseudogenes.
Phase 1 Blast Searching
The process of finding all members of the P450 family in Fugu can be divided in to three stages. The first is the identification of all accession numbers in the dataset that have p450s or parts of p450s on them. The blast searches are done with each member of the query set and the output is examined to see which are legitimate hits. The accidental matches are thrown out. An expect value of 10 is used in this process. This will give some false matches, but there are true matches that are found very close to the excpect value of 10, they are ususally from the middle region of the P450s that is poorly conserved. For automation of this process, it might be possible to make the expect value 1 instead of 10. This would make the false positives very rare, but it would miss some true hits. The rationale for doing this would be that the true hits would have an expect value lower than 1 when another query sequence was used.
Once the first search has been done and all false hits have been discarded, a file is made of the accession numbers for the true hits. This is sorted alphanumerically and saved for comparison to the next search. Search two is done and the same procedure is followed. Now the two lists of accession numbers are compared. And any duplicate hits are deleted from the list. In an automated version of this process, it might be useful to keep a tally of the number of times an accession number is found and the percent identity of the blast result for the best match. The rare hits might be in new families. The process is repeated with all members of the query set. In an automated version of this process, there is no reason why every human sequence could not be used instead of the 18-20 used in the manual search. This would be more comprehensive and might pick up a few more accession numbers that would be missed otherwise.
Phase 2 Sorting into gene families and individual genes
A blast server has been set up by Rob Edwards on a Linux server in the Bioinformatics suite that has all P450 members from 12 different species, including human, rat and mouse. This is a curated dataset that is non-redundant and comprehensive. As new members are found, they are added to these databases. The server is advertised on the Cytochrome P450 Homepage and is available to the world as a service. The blast search results from phase 1 are compared to the complete human set of p450s to identify the best match. This is the process of family or subfamily identification for each accession number. This procedure can be shortened somewhat by early identification of multiple members of the same gene. These accession numbers do not have to be searched again. The results from the phase 1 searches resulted in 332 accession numbers. These were joined into 17 gene familes and numerous subfamiles by blasting against the human set.
Individual genes were assembled into gene bins for later assembly. These often had multiple exons, but these were not yet assembled at the level of GT-AG boundaries. The exons were put in the order they occurred in the genomic DNA and these rough gene translations were put on the P450 blast server. To identify all unique protein contigs from this data, each sequence was blast searched against all other P450 protein sequences from Fugu and overlapping pieces were sorted into the same gene bins. This process reduced the number of contigs from 332 accessions to 75 non-overlapping gene contigs.
Phase 3 Assembly
The genes were still not assembled to identify the intron exon boundaries. This step required comparison to mammalian gene models and as the process created complete fish genes to fish gene models. Many of the exon boundaries were in the same place between humans and Fugu. The phase was also the same at these boundaries. That made the process of gene assembly easier. Several gene clusters were found on the same scaffold and these genes tended to be highly similar, making assembly by comparison a possibility. The few pseudogenes that were found were recognized as pseudogenes by multiple frameshifts, in frame stop codons and missing exons.
Results and Discussion
The cytochrome P450 set from Fugu is not yet complete. The genome project claims about 90% coverage of the genome in their assembly, however, from the number of incomplete p450 genes found (about half) this may be an optimistic estimate. There are 35 full length genes and 12 more that are missing only a small portion. There are 29 partials. Some of these partials are pseudogenes and they are as complete as they can get. Some of the incomplete genes had orthologs in zebrafish or other fish like medaka or trout. The best match has been added on to the assembled genes in lower case to indicate that it is not from Fugu. These chimeric genes were used to search once more in Fugu for the missing pieces, but they have not been found. These parts of the gene are missing from the available data.
One family, CYP39 is missing in Fugu. This sequence is missing from zebrafish and every other fish in genbank, both in the EST database and in mRNA or genomic sequence. This gene probably is unique to mammals, or at least has arisen in the lineage to mammals after it diverged from fish. This sequence has not been found outside of mammals yet (as in birds or reptiles). The function is oxysterol 7 alpha-hydroxylase with a preference for 24-hydroxycholesterol (1). This gene provides an alternative pathway to the synthesis of 7 alpha hydroxylated bile acids. The other gene that does this is another P450 CYP7B1, with very little sequence similarity. The two genes are sexually dimorphic with CYP39 being expressed to a higher level in females, while CYP7B1 is higher in males.
Figure 1 shows a phylogenetic tree of 47 Fugu P450s compared with 60 human P450s (includes three pseudogenes) and eight other fish p450s. Only full length or nearly full length protein sequences were used. Human sequences are in red, Fugu are blue and other fish are gray. The other fish sequences were included when there was no human ortholog or the fish sequence was a much better match. The sequence alignment is posted on the P450 Homepage under pufferfish.
As mentioned above, only one P450 family out of 18 was missing in Fugu and probably in fish in general. All 17 other P450 families have clear orthologs present. This means that the diversity of P450 families seen in mammals predated the tetrapod-ray finned fish divergence (about 420 million years ago (2)). For a simplified phylogenetic tree of animals see Figure 2.
Legend to Figure 2. The dates of divergence of the main animal lineages have been used to draw a tree showing the evolutionary history of animals. Mouse-human 96MYA, zebrafish-Fugu 150 MYA(3), bird-reptile 222 MYA(5), mammal-bird 310 MYA(5), amphibians-amniotes 360 MYA, ray finned fishes-tetrapods 420 MYA(2), vertebrates-Ciona = Cambrain explosion 530-544 MYA, chordates-echinoderms 600 MYA(4), protostomes-deuterostomes 670 MYA(4), Drosophila-C. elegans unknown but very deep, cnidarians-bilateria unknown but very deep.
The next branch on this tree after the ray-finned fishes is Amphioxus, a cephalochordate, followed by Ciona, a urochordate. The genome of Ciona has been sequenced and I will be seaching for all P450 genes in this simple chordate to pursue the question of P450 evolution down the evolutionary scale. Since the P450 families in the Drosophila and C. elegans genomes are mostly different from the vertebrate families, the origin of most of the 17 vertebrate P450 families seen in Fugu must have occurred between 670 million years ago and 420 million years ago. These familes should be found either in echinoderms or in the Ciona genome.
An overview of the relationships of the P450s found in human and Fugu is shown in Figure 3. 61 human sequences (57 functional genes and four pseudogenes) are linked to their orthologs in 73 Fugu contigs. Panel A shows the CYP2 family. This is the least conserved family, with most members falling in new subfamilies except CYP2R1 and 2U1, which show clear orthologous relationships. The function of 2R1 and 2U1 are not known, but I suspect they act on endogenous substrates rather than exogenous substrates. The CYP2 family in general is involved in metabolism of foreign compounds and so it is not surprising that it is highly variable over 420 million years.
Panel B shows families 1, 3, 4, 5, 17 and 21. All six of these families are present in both species, but there are subfamilies that are new in each species, such as 1C and 3B in Fugu, and 4A, 4B, 4X and 4Z in human. Panel C shows nearly a one to one correspondence among the remaining 11 families. Here we may be seeing some remnants of a genome duplication event that took place in teleost fish but not in tetrapods. Fugu has two CYP19s, CYP8As, CYP8Bs and CYP17s (panel B). There are only single copies of these genes in mammals. The CYP19s are responsible for synthesis of estrogen from testosterone by aromatization of the A ring. Fugu has a brain form and an ovary form, showing that there could be specialization of the production and use of estrogen in these tissues.
1. Li-Hawkins, J. et al. (2000) Expression cloning of an oxysterol 7alpha-hydroxylase selective for 24-hydroxycholesterol. J. Biol. Chem. 275, 16543 -16549.
2. Ahlberg, P. and Milner, A. (1994) The origin and early diversification of tetrapods.Nature 368, 507-514. tetrapods-ray-finned fishes 420 MYA
3. Cantatore P, Roberti M, Pesole G, Ludovico A, Milella F, Gadaleta MN, Saccone C.(1994) Evolutionary analysis of cytochrome b sequences in some Perciformes: evidence for a slower rate of evolution than in mammals. J. Mol. Evol. 39, 589-597.Zebrafish-Fugu 150 MYA
4. Ayala FJ, Rzhetsky A, Ayala FJ. (1998) Origin of the metazoan phyla: molecular clocks confirm paleontological estimates. Proc Natl Acad Sci U S A. 95, 606 – 611.protostome-deuterostome 670 MYA, chordate-echinoderm 600 MYA
5. Kumar, S and Hedges, S.B. (1998) A molecular timescale for vertebrate evolution. Nature 392, 917-920.Bird-mammal 310 MYA, bird-reptile 222 MYA