Bioinformatics and mining the human genome
Stresa Italy, July 10, 2000 MDO2000 meeting
David Nelson

Today, genome projects are the fashion. The news is full of the race to finish the human
genome. Two weeks ago, on June 26, Francis Collins and Craig Venter stood with
President Clinton in the White House as he heralded the completion of a rough draft of the
human genome. If one knows where to look, the progress can be monitored on genome
meters like this one from the Arabidopsis genome project, (92.9%
complete on July 25) slide 1 NCBI has a similar meter for the Human Genome Project
slide 2. Here we see that on June 25, the human genome was about 21% finished
and about 66% more was in draft form (meaning 99.9% accuracy or better). With all this
attention given to these genome projects, don’t be surprised if you get
a notice in the mail for a new journal called Nature Genomics.

I don’t mean to downplay the importance of all this work. It has been and continues to be
an exciting time, one that will not be repeated. Once the human and mouse genomes are
done and the genes have been identified by comparison of the two genomes, much of the
excitement will pass. But for now, we are in the middle of it. Bets are actually being
taken on the number of human genes, and the number ranges from 34000 to 153000. The human gene
count is a surprisingly elusive and slippery number. Here is a graph taken from the NHGRI showing
sequencing progress. Slide 3 As you can see from this figure, most of the human genome sequence
has been acquired in the past 15 months, since March 1999. If you had looked for a human gene
in Genbank before 1998 you probably would not have found it. If you looked today the chances
are about 90% that you would find it.

How can it be that we have 87% of the human genome sequence, but we cannot agree
on the number of genes within a factor of 4 to 5? This paradox begins to reveal
the difficulties of analyzing the DNA sequence. For those of you that have
looked at the human genome rough draft sequence, you have probably had some
frustrations. For those of you who have not, you may not know what is there.
This morning I would like to talk about data mining in the human genome and what
king of problems you may run into. I will use examples from the cytochrome P450
genes. We are really faced with numerous problems when looking at the rough
draft of the human genome. First, most of the sequence is fragmentary. The 66%
that is in draft form is at four levels of completion. Phase 0 sequence is the
most preliminary. Slide 4 A large BAC clone, is shotgun sequenced and all the
unassembled reads from this sequence set are given one accession number and
placed in Genbank in the HTGS (high throughput genomic) section. The fragments
are unordered and their correct orientation is unknown. The clone size may be
as large as 280000 bp and the number of reads in a single entry may be as high
as 360. The read lengths are about 600-750 bp. Because P450 genes are
moderately long, about 10000 to 40000 bp, the short reads in phase 0 sequence
cannot cover a whole gene. One read may only cover one exon or a part of one
exon. If there is only one P450 gene on a whole BAC clone it will be possible
to assemble the gene based on comparison to other complete P450s and ESTs, but
there are likely to be regions missing in the gaps between some of the reads.
You also do not know if there is more than one P450 on a clone. The pieces
found may belong to two or more genes.

The next level is phase 1. Slide 5 Here the reads have been assembled into
contigs by searching for overlaps. The number of contigs drops as the fragments
are joined. Phase 1 sequence is still composed of unordered pieces. The
accession number is given a version number when it is created. As revisions are
made and more fragments are joined the version number is increased. I have seen
version numbers reach 28, but they are usually in the 1-6 range. It is still
difficult and hazardous to assemble genes from phase 1 sequence. I’ll give
you an example from chromosome 19 which has a CYP2 family gene cluster. In late
March 2000, this accession number was version 2 and it consisted of 16 unordered
pieces. Ten of them had P450 sequence on them, but there was evidence for five
different subfamilies being present and probably 7 different genes. At the time
it was not possible to assemble all these genes with any confidence. On May 4,
the same accession number was at version 3. This was a phase two sequence with
only two fragments and they were ordered, so it was now possible to assemble the
genes correctly assuming that none of them fell in the single gap or ran off one
end or the other. It is also interesting to note that the version 3 sequence
was 137000 bp while the earlier version was 170000, so 33000 bp of sequence was
discarded. The P450 genes on this sequence look like this now. Slide 6. Six
genes are present, but the cluster is not complete. In moving from version 2 to
version 3, the 2A6 gene was deleted as well as the C-terminal exon of 2B7P1.
The 2A6 gene has reappeared on AC025769.3, but this sequence is labeled as being
from chromosome 5. The other sequences seen in this slide are all present on
this chromosome 5 sequence, so my guess is that there is a mistake in labeling.
Chromosome 19 is probably correct based on other mapping data. The genome
sequence is thus in a state of flux.

Phase 2 sequence slide 7 has fewer contigs and they are ordered. This
is a significant improvement over the unordered pieces of phase 0 and phase 1
sequence. There are still gaps and a long human P450 gene could be missing some
exons. The longest human P450 gene I know about so far is CYP5A1. This gene is
on a completed genomic sequence [NT_001551] and it spans 197000 bp. If this was
in a phase 2 sequence it might be missing half the gene.

After phase 2 the sequence reaches complete status. There are no more gaps in
complete sequences. The last stage in processing the sequence is annotation,
with identification of the coding regions, repeats, pseudogenes etc. The
annotation is always subject to change. There are often differences of opinion
about what constitutes a gene. I have had discussions with genome sequencing
experts that say they have found twice as many genes on the completed and
annotated human chromosome 22 as the published article claims exist there. In
my own experience, I have found two P450 genes in a cluster joined together with
the other parts of the two genes overlooked. I have seen this in Drosophila,
Arabidopsis and C. elegans, so it is a fairly common problem. This underscores
the need for expert human annotation in addition to computer generated
annotation.

So far, the problems I have mentioned with assembling P450 genes or any other
genes have had to do with the incomplete nature of the sequence data. This will
vanish as the sequences are all moved out of phase 0, 1 and 2 to complete
status. The problems of gene identification and assembly do not end there.
There is still the matter of fate in selection of the DNA to be sequenced. On
Chromosome 22, the first completed human chromosome, there is a small cluster of
P450 genes. These are the 2D6 gene and two pseudogenes near it. This cluster
has been well studied and documented. However, the individual whose DNA was
sequenced for the Human Genome Project carried an allele of CYP2D6 called 2D6*5
where the whole 2D6 coding region was deleted and only parts of the pseudogenes
remained. So when I tried to find 2D6 on chromosome 22, I could not find it.
This is the problem of polymorphisms. I show a comment in this slide that
emphasizes the problem. Slide 8 There are numerous deletions. This raises
questions about P450s that might be real genes or pseudogenes. On AC008537,
there is a gene named CYP2G1P. This gene is normal in every way except it is
missing exons 4 and 5. Is this due to a polymorphic deletion in this individual
or is CYP2G1P really a pseudogene. We won’t know until the gene can be
resequenced from other people. The Celera genomic data is taken from 5
individuals and sequencing on all five was completed June 23, so this question
is probably answered in the Celera data set.

Pseudogenes are also a difficult matter. Currently there are 24 human P450
pseudogenes. These fragments of genes often flank real genes and one has to
wonder if there might be alternative splicing in some cases. Among human P450s,
there seem to be a large number of 4F and 2C pseudogenes. It is not clear why
these parents should throw off so many more pseudogenes than other P450s, but
they seem to be more abundant. Pseudogenes may be a problem in humans in
general. Another gene that I work on is called the adenine nucleotide
translocator or ANT1 and this gene is reported to have at least 9 pseudogenes on
various chromosomes. With the draft sequence quality at more than 1 error in
10000bp, it is often difficult to tell a pseudogene from a sequence error. It
may take some time or some resequencing to clarify those P450 genes that have
one or two in frame stop codons. This is the case for the gene CYP2G2P.
Otherwise it looks like a normal gene. Is it real or not?

Another issue that comes up in such a large project is contamination. There is
one P450 sequence I found in human draft sequence that did not seem to be human.
The best match to this human sequence is a mushroom P450 with 50% sequence
identity. Slide 9 I sent notice to Genbank that this was so and asked them to
relay the message to the sequencers. When I repeated the search for this
sequence in June, I did not find it. Calling up the accession number, I found
that the version number had changed and the new sequence was shorter. Slide 10
Looking at the contigs in both the version 2 and version 3 sequence one can see
that the first two small contigs have been removed. The P450 sequence was on
contig 2. So the sequencers decided to drop those sequences in version 3. I
strongly suspect a fungal contamination of the library used in making this
clone. Unfortunately, this may affect other sequences in the library that are
undetected so far. This is especially hard if the sequence is chimeric, with
human DNA on both ends, then it will look like a human sequence and be joined
into larger contigs. Errors like this will need to be checked by sequencing
multiple individuals to eliminate contaminants.

Possibly the most difficult problem in assembling genes from the human genome is
not due to sequence errors. Even a perfect sequence would have this problem.
It is the difficulty in assembling any gene from genomic sequence, finding the
exons, especially the N-terminal exon. One example of this is the new human
P450 CYP3A43. This gene is 38000 bp long and it is composed of 13 exons. Slide 11
The first intron is over 8000 bp and the first exon is only 24 amino acids
long. Seven exons are less than 40 amino acids long and one is only 18.
Luckily the 3A subfamily is pretty well conserved and these short exon fragments
can be detected by a systematic search. This would not be the case if the
percent identity between a new sequence and some other known sequence was low.
In that case one would need to depend on cDNA sequence to help locate the
missing pieces, especially the N-terminal. This is exactly the problem with the
Dictyostelium discoideum P450 gene sequences. Most of them seem to have a very
short and poorly conserved N-terminal exon that just cannot be detected by
examining the genomic DNA sequence.

That brings up the flip side of the human genome project. In addition to
sequencing the genomic DNA, nearly two million EST sequences from human cDNA
have been deposited in GenBank. These can be very helpful in defining the
intron exon boundaries of new genes. More than 1.5 million of these sequences
have been incorporated into UNIGENE. Unigene scans the EST sequences and groups
them in clusters that are from the same gene, or very similar genes. The May 24
version had 89632 clusters. It is tempting to assume that this means there are
about 90000 human genes, but there is not a one to one correspondence between
UNIGENE clusters and genes. Over 30000 of these clusters have only one EST
sequence and it has been argued recently that many of these are probably
accidental missprimings. Serious attempts to estimate the number of human genes
ignore these singleton ESTs. That leaves almost 60000 clusters with two or more
ESTs. However, different clusters may be from different parts of the same gene.

I have tried to catalog the UNIGENE entries for human, mouse and rat, though
this is like shooting at a moving target. Slides 12, 13 see TABLE The data
changes so quickly that many of the UNIGENE numbers are retired due to splitting
and mergers. Of course pseudogenes that do not make mRNA will not have an entry
in UNIGENE. However, there are pseudogenes that do have ESTs such as the new
CYP2T2P with 12 ESTs mostly from breast and placenta. This makes one wonder if
it is really a pseudogene. A similar gene CYP2T1 has been found in the rat and
it is not a pseudogene. So perhaps humans have just recently lost a functional
version of this gene. Several other P450 pseudogenes have a UNIGENE entry.
These include CYP2B7P1, 2D8P and 3A5P2.

There are also six human P450s that are not pseudogenes yet they have no ESTs
associated with them. These are CYP2A13, 2C19, 2F1, 4F22, 7A1 and 27C1.
Slide 14 This shows that the EST database does not have every gene represented
even though there are 2 million sequences. These 6 P450s represent 11% of the
known human P450s, so gene coverage in the human EST database is about 89% based
on P450 genes.

Searching the human ESTs for a P450 will not always find a good hit, even when
there are ESTs for that gene. Some P450s have long 3 prime untranslated
sequences and ESTs are often from this part of the gene. CYP26B1 has a 3000bp
untranslated 3 prime sequence and no ESTs in the coding region.

I think I have given you a summary of the problems encountered in data mining
the human genome sequence data. I do not want to leave you with the impression
that it is not worth the trouble. On the contrary, in the past year, I have
found several new P450 genes from the incomplete and sometimes fragmentary human
genomic sequences. slide 15 These include, CYP3A43, CYP2G1P, 2G2P, 2T2P, 2T3P,
2U1, 4F22, 4V2 (related to a trout P450 fragment called 4V1), 26B1, and 27C1.
This last gene is missing some sequence at the end and in the middle, but it
may be an important new member of the mitochondrial P450s.

The discovery of new human P450s is winding down. With almost 90% of the genome
in the databases, one might expect only a 10% increase in gene counts. This is
actually misleading, because there are some P450s that are known from cloning
that have not been found yet in the genomic sequencing. Slide 16 These make up
a part of the missing 10%. CYP2C9, 3A4, 11B1, 11B2 and 26A1 still need to be
found in the genomic DNA. (Note, on July 7 CYP3A4 was on AC069294.3. This
includes parts of 3A5, 3A7 and 3A43 see file
) We have 53 human P450s now. If we subtract the five given above that are
not yet found in the genomic DNA that gives 48 and a 10% increase would be 53
again. Based on these data, I don’t expect to find more than one or two
more P450 genes in humans, and those are probably going to be in existing
families and subfamilies. There should not be any major surprises in store. Of
course, I like surprises. Slide 17

Note: 2C9 has been found on AL133513 joined to 2C19 to make a hybrid sequence.
The 2C cluster has four 2C genes close together. I did a blast of the intron
region at the joint of 2C9 and 2C19 and found that most of the intron region
about 800bp of the 1160bp region was a line1 repeat element. It had 91%
identity to many other line1 repeats in the human genomic sequence. If all four
2C genes have this Line1 repeat in this intron, they could recombine to make six
different chimeric P450s that would be functional genes. The original 2C17X
gene could have been caused by this phenomenon. The person chosen for
sequencing may have a deletion polymorphism of the 3 prime end of 2C9 and the 5
prime end of 2C19. This would be similar to the deletion of 2D6 on chromosome
22.