The Yield of Information from the Human Genome
David Nelson, Sapporo, Japan July 26, 2002
Last modified July 18, 2PM.

It is a pleasure to be here on my first visit to Japan.  Many thanks to Dr. Kamataki, for inviting 
me, also to all the organizers who worked so hard to make this meeting a success.  I thought I 
would open with a picture [slide 1] from 
one of my favorite books: The Tale of Genji, about a mythical Japanese prince from 1000 
years ago.  This picture could be from one of the earliest scientific meetings as Prince Genji 
presents his paper to an attentive audience.

Two years ago, on the other side of the world in Stresa, I spoke on the new 
rough draft version of the human genome [slide 2 Science cover], just announced in June of 
that year.  
At that time only 21% of the human genome was in finished form and 66% was in 
draft condition, a less than optimal state.  [slide 3] The first genome assembly 
about 120,000 gaps.  Today, the Human Genome Sequencing Center at the Baylor 
College of Medicine shows that 78% is in finished form.  Blast searches of the 
human genome assembly at NCBI give only 2044 sequences covering 2.9 billion 
base pairs, so most of the gaps have been filled in.  The genome has progressed pretty 
far in two years.

It is time to look at the human genome again and see what is new.  In Stresa I 
reported there were 53 full length P450 genes that probably made functional 
proteins.  I predicted there might be one or two more.  Today there are 57 
full length P450 genes, so I was off by a little bit.  The four new P450s are 
[slide 4] CYP2W1, CYP4A22, CYP20, 
CYP26C1.  These dates refer to when I became aware of the sequences, not their first 
appearance in the database.

2W1 is only 42% identical to CYP2D6, so it is a borderline CYP2 member, however it does 
have 9 exons with the same intron-exon boundaries as other mammalian CYP2s The 
exceptions to this gene structure are CYP2R1 and 2U1 which only have 5 exons. The 2W1 
sequence is seen in mouse and it is 77% identical to human.  

4A22 is 95% identical to 4A11.  The gene sequence for 4A22 was at first assumed to be the 
gene that matched the 4A11 mRNA sequences that have been known for a long time.  Three 
different gene sequences were identified for 4A22 and they all matched each other but were 
different from the mRNA for 4A11, at this point it became clear that 4A22 was a new gene.  
No mRNA could be found that matched the 4A22 gene, and no gene could be found that 
matched the 4A11 mRNA.  That changed in April this year when the 4A11 gene [slide 5] 
was deposited in Genbank as accession AL731892.  The most recent revision to that accession 
number was on June 27 less than one month ago, that accession is now complete.  4A11 is the 
last human P450 gene discovered, even though the mRNA was known much earlier.

 [slide 6] CYP20 was found in human 
pheochromocytoma cells by the Chinese National Human Genome Center at Shanghai about 
two years ago.  It is the most recent new vertebrate P450 family to be discovered.  It was not 
found by me by blast searching for new P450s, because it has only 23% sequence identity to 
CYP3A4 over 437 aa.  CYP20 is 27% identical to a sponge p450.  The heme signature is 
short by one amino acid, but the RYG, EXXR, WXXP and PERF motifs are present.  CYP20 
does not have the usual conserved I-helix motif AGX(D,E)T so its substrate may carry its own 
oxygen.  The ortholog is found in cow (90% identical) mouse (82%) and Fugu (59%) so it is 
more than 420 million years old.

[slide 7] 26C1 was discovered while 
doing a Blast search at the Stresa MDO 2000 meeting.  That demonstrates that interesting 
things can happen at these meetings, and science is not only done in the lab.  CYP26C1 is 
related to the retinoic acid metabolizing P450s CYP26A1 and CYP26B1.  

Some human P450s are not represented in the human EST database, or they only have one or 
two ESTs. [slide 8]

2A13 has only one EST in the Unigene database from lung cystic fibrosis epithelial cells
A paper by DingŐs lab showed CYP2A13 mRNA is expressed at the highest level in the nasal 
mucosa, followed by the lung and the trachea.  

2C19 had no ESTs in Unigene.  A blast search of 1000bp from the 3 prime UTR found no 
candidates for a 2C19 EST  This sequence was cloned in Joyce GoldsteinŐs lab as a single 
clone from 83 full length P450 clones in a liver cDNA library, so it was a rare cDNA even 
when trying to clone p450 mRNAs.

4A22 is not listed in Unigene since it is too new.  However, blasts of the human ESTs showed 
no hits. A blast search of 1000bp from the 3 prime UTR found 4A11 sequences (T95288 
AI261826 AV690226) but no 4A22 ESTs.  

26C1 is also not listed in Unigene yet.  Blasts of the human ESTs showed no hits. A blast 
search of 1000bp from the 3 prime UTR also found no hits.  

27C1 has 2 ESTs from astrocytoma and testis, so at least we know that this gene is expressed.

We donŐt really know yet if 4A22 or 26C1 are expressed genes.

There are 4.5 million human ESTs in dbEST release 062802 from June 28, 2002.  The 
absence or very low representation of a p450 in this database indicates low levels of expression 
in the tissue libraries represented, or expression during limited time windows.  There is the 
possibility that some of these genes like 26C1 might be important transiently during 

The assembly of the human genome and the creation of several genome browsers like Map 
Viewer at NCBI and Ensembl or the UC Santa Cruz browser, makes it possible to map the 
locations of human genes.  I have done exhaustive blast searches of the human genome 
assembly and mapped all the P450 genes and pseudogenes on ideograms of the human 
chromosomes.  I will now show you the locations of these genes in the next five slides.  

 [slide 9] The 57 functional P450 
genes are shown in red, 47 pseudogenes are in blue.  The cluster of CYP4A, 4B, 4X and 4Z 
sequences on chromosome 1 has 5 functional genes and 3 pseudogenes.  46A4P and 2J2 are 
outside this block, but I could not show that, since there is limited room.  The 4Z2P sequence 
is a full length pseudogene with only one stop codon in exon 8, so it is probably a very 
recently formed pseudogene.  

On chromosome 2 notice the 5 4F pseudogenes.  For some unkown reason the 4F subfamily 
has generated about 15 pseudogene fragments that appear on 5 different chromosomes.  
These are small pieces, not full length genes. 

Chromosome 3 has CYP51P1, one of three CYP51 pseudogenes, all on different 
chromosomes.  2D31P is the only 2D pseudogene outside the 2D locus on chromosome 22.

Chromosome 4 is interesting because it has 4V2 mapped to three different locations. The 
Santa Cruz browser has all three locations and the Ensembl browser has the lower two.  Two 
of these are probably errors, but it shows that the mapping and genome assembly process is 
still imperfect. 

 [slide 10] Chromosome 5 does not 
have any P450s, so it is not included here.  There are some CYP2 sequences mapped to 
chromosome 5 but they were sequenced in the same lab as chromosome 19 clones where the 
real CYP2 gene cluster is located and this is apparently a misslabeled clone.  

The CYP39A1 gene is mapped to the centromere by MapViewer, but this is probably 
incorrect.  Ensembl maps it to the top of 6p12.3, so MapViewer is probably off in its locations 
for some genes.

Chromosome 7 has the 3A cluster with 4 active genes and 3 pseudogenes.  P450 clusters 
always seem to have pseudogene fragments interspersed with the whole genes.  

 [slide 11] Chromosome 10 has 
the 2C subfamily cluster

 [slide 12] Chromosome 16 and 
17 have no p450s and 18 has only one small pseudogene fragment.  Chr 19 is very full of 
P450s in two different clusters, the 4F cluster with 6 functional genes and the CYP2 cluster 
also with 6 functional genes.  Both clusters have their share of pseudogene pieces.  

 [slide 13] The X chr has one 2C 
pseudogene fragment.  There are five of these scattered outside the the 2C cluster.  About half 
of all the P450 pseudogenes are of 2F or 2C origin.

The mouse genome is not as complete as the human and I have not tried to map the mouse 
p450s on the genome assembly yet.  I can make some general remarks about the mouse 
P450s.  There are 84 known full length mouse p450s and 27C1 is expected because it is found 
in humans and Fugu.  4Z1 is also expected since it is found in human.  That is 29 more 
functional genes than seen in humans.  [slide 14] The CYP2 cluster in humans had 6 
functional genes.  Mice have 12 in those same subfamilies.  Humans have one 2D6 gene while 
mice have seven 2d genes.  Humans have 2J2 while mice have at least 5 2js. We have 4 3As 
mice have 6. Humans have two 4As and mice have 4.  Humans have 6 4Fs, mice have 9.

                          Human                  mouse
CYP2 cluster                6                      12
CYP2C cluster               4                      10
CYP2D                       1                       7
CYP2J                       1                       5
CYP3A                       4                       6
CYP4A                       2                       4
CYP4F cluster               6                       9

Total                      24                      53 (29 more)

Not shown in the slide is the 2C luster.  Humans have 4 2Cs while mice have at least 9.
This accounts for 28 extra sequences seen in mouse as compared to human.
Aside from expansion of these three families, mouse and human are very similar.  The same 
families and subfamilies are present.  

Where species comparisons get more interesting is in comparing human to Fugu, the Japanese 
pufferfish. [slide 15 of Fugu].  The 
Fugu genome has been assembled and it is nearly complete.  I have done searches to find all 
the P450s that I could find in this genome and I have named them.  There are 71 non-
overlapping contigs of P450 sequences assembled from the Fugu genome.  45 of these are 
complete P450 genes and one is a nearly intact pseudogene.  8 more are missing only one or 
two exons or less.  The next three slides show a comparison of the human and Fugu P450s 
side by side.  

Ray finned fishes and tetrapods like ourselves diverged 420 million years ago, so it is 
interesting to see how similar we are to fish at the level of our p450s.  This first slide 
[slide 16] shows the CYP2 
family.  It is the most diverged of all the P450 families.  Note that only CYP2R1 and 2U1 are 
conserved as subfamilies.  These genes have only 5 exons as compared to 9 exons in typical 
CYP2s.  The other subfamilies shown here have diverged so they are no longer recognizable 
as belonging to a common fish and mammalian subfamily.  The conservation of 2R1 and 2U1 
argues that they may be acting on conserved endogenous substrates, while the other genes are 
acting more on exongenous substrates or mammalian or fish specific endogenous substrates.  

[slide 17]  This next 
panel shows the 3, 4 and 5 families and 1, 17 and 21.  We can see more lines drawn 
connecting subfamilies in this section.  Notice that fish have a 1C subfamily and a 3B 
subfamily not seen in mammals.  The 4F subfamily has only one member in Fugu, probably 
the ancestral condition, with expansion of 4Fs in the mammals.  4A, B, Z and X may be 
derived from the 4T subfamily in fish.  This will be more apparent when I show you a tree of 
these sequences.  

[slide 18]  The third panel 
shows the most conserved sequences.  These are cholesterol, steroid, bile acid and retinoid 
metabolizing P450s (except for CYP20 which has no known function).  There is a one to one 
relationship between these sequences except for CYP39.  CYP39 does not exist in Fugu or 
any other fish genomes searched so far.  CYP39 seems to be the only mammalian innovation 
in p450s in 420 million years.  It catalyzes the same reaction as CYP7B1 to make 7 alpha 
hydroxylated bile acids.  The 7B subfamily is also missing in fish, though there is a 7C 
subfamily that could have an equivalent role.

The relationships between these sequences are shown in more detail in this phylogenetic tree 
 [slide 19] of 60 human, 54 
Fugu and 8 other fish sequences. The tree can be viewed as a small number of major branches 
that I have named clans.  These sequences always cluster together on trees and probably share 
a common ancestor over 400 million years ago.  Some of these clans are seen in invertebrates 
so they are older than 600 million years.  In the top half of the tree we see alternation of red 
and blue branches.  This indicates a 1:1 corresponence between Fugu and human sequences 
that we saw in the last slide.  This begins to change in the 4 clan where there are clusters of 
red human sequences and only 2 blue sequences.  This shows an expansion in human.  The 
4A, B, X and Z subfamilies seeem to origfinate from the fish 4T sequences.  The 3 clan has 
expanded in both fish and human, but after they diverged.  The 2 clan shows the greatest 
expansion with clusters of red and blue sequences that have formed after fish and tetrapods 

The next level of genome comparison as we step back on the evolutionary time scale is the 
urochordate Ciona or the sea squirt.  Two genomes of Ciona species are being done 
[slide 20, 21] Ciona savignyi 
and Ciona intestinalis shown 
here.  In the larval stage they have a tadpole appearance, with a notochord.  The two genomes 
are only about 70% identical to each other, which is less than mouse and human.  I have 
assembled over 800 sequence fragments from Ciona into about 200 contigs so far, with 18 full 
length genes assembled.  This is a slow process because the relationships to known P450s is much 
less than seen in the fish, and the genome is not assembled yet, so the reads are all out of order 
and  unlinked, which makes assembly difficult.  Ciona has 68 unique heme signatures so this is 
probably the approximate number of P450s.  A majority are related to the CYP2s, but their 
intron- exon structure does not preserve even one boundary.  The complete Ciona P450 set will 
afford a better view on the evolution of the P450 family in the deuterostome line and in the 

I opened with one wood block print and I will close with another. [slide 22] Here is a curious 
human lifting the veil at the edge of the known and peering out into the inner workings of the 
universe.  I feel we are at this point today with the genome projects going forth.  In a very 
short time we should have a detailed view of the molecular evolution of life on earth.  It is a 
great time to be a biologist.