Bioinformatics and mining the human genome
Stresa Italy, July 10, 2000 MDO2000 meeting
David Nelson
Today, genome projects are the fashion. The news is full of the race to finish the human genome. Two weeks ago, on June 26, Francis Collins and Craig Venter stood with President Clinton in the White House as he heralded the completion of a rough draft of the human genome. If one knows where to look, the progress can be monitored on genome meters like this one from the Arabidopsis genome project, (92.9% complete on July 25) slide 1 NCBI has a similar meter for the Human Genome Project slide 2. Here we see that on June 25, the human genome was about 21% finished and about 66% more was in draft form (meaning 99.9% accuracy or better). With all this attention given to these genome projects, don’t be surprised if you get a notice in the mail for a new journal called Nature Genomics.
I don’t mean to downplay the importance of all this work. It has been and continues to be an exciting time, one that will not be repeated. Once the human and mouse genomes are done and the genes have been identified by comparison of the two genomes, much of the excitement will pass. But for now, we are in the middle of it. Bets are actually being taken on the number of human genes, and the number ranges from 34000 to 153000. The human gene count is a surprisingly elusive and slippery number. Here is a graph taken from the NHGRI showing sequencing progress. Slide 3 As you can see from this figure, most of the human genome sequence has been acquired in the past 15 months, since March 1999. If you had looked for a human gene in Genbank before 1998 you probably would not have found it. If you looked today the chances are about 90% that you would find it.
How can it be that we have 87% of the human genome sequence, but we cannot agree on the number of genes within a factor of 4 to 5? This paradox begins to reveal the difficulties of analyzing the DNA sequence. For those of you that have looked at the human genome rough draft sequence, you have probably had some frustrations. For those of you who have not, you may not know what is there. This morning I would like to talk about data mining in the human genome and what king of problems you may run into. I will use examples from the cytochrome P450 genes. We are really faced with numerous problems when looking at the rough draft of the human genome. First, most of the sequence is fragmentary. The 66% that is in draft form is at four levels of completion. Phase 0 sequence is the most preliminary. Slide 4 A large BAC clone, is shotgun sequenced and all the unassembled reads from this sequence set are given one accession number and placed in Genbank in the HTGS (high throughput genomic) section. The fragments are unordered and their correct orientation is unknown. The clone size may be as large as 280000 bp and the number of reads in a single entry may be as high as 360. The read lengths are about 600-750 bp. Because P450 genes are moderately long, about 10000 to 40000 bp, the short reads in phase 0 sequence cannot cover a whole gene. One read may only cover one exon or a part of one exon. If there is only one P450 gene on a whole BAC clone it will be possible to assemble the gene based on comparison to other complete P450s and ESTs, but there are likely to be regions missing in the gaps between some of the reads. You also do not know if there is more than one P450 on a clone. The pieces found may belong to two or more genes.
The next level is phase 1. Slide 5 Here the reads have been assembled into contigs by searching for overlaps. The number of contigs drops as the fragments are joined. Phase 1 sequence is still composed of unordered pieces. The accession number is given a version number when it is created. As revisions are made and more fragments are joined the version number is increased. I have seen version numbers reach 28, but they are usually in the 1-6 range. It is still difficult and hazardous to assemble genes from phase 1 sequence. I’ll give you an example from chromosome 19 which has a CYP2 family gene cluster. In late March 2000, this accession number was version 2 and it consisted of 16 unordered pieces. Ten of them had P450 sequence on them, but there was evidence for five different subfamilies being present and probably 7 different genes. At the time it was not possible to assemble all these genes with any confidence. On May 4, the same accession number was at version 3. This was a phase two sequence with only two fragments and they were ordered, so it was now possible to assemble the genes correctly assuming that none of them fell in the single gap or ran off one end or the other. It is also interesting to note that the version 3 sequence was 137000 bp while the earlier version was 170000, so 33000 bp of sequence was discarded. The P450 genes on this sequence look like this now. Slide 6. Six genes are present, but the cluster is not complete. In moving from version 2 to version 3, the 2A6 gene was deleted as well as the C-terminal exon of 2B7P1. The 2A6 gene has reappeared on AC025769.3, but this sequence is labeled as being from chromosome 5. The other sequences seen in this slide are all present on this chromosome 5 sequence, so my guess is that there is a mistake in labeling. Chromosome 19 is probably correct based on other mapping data. The genome sequence is thus in a state of flux.
Phase 2 sequence slide 7 has fewer contigs and they are ordered. This is a significant improvement over the unordered pieces of phase 0 and phase 1 sequence. There are still gaps and a long human P450 gene could be missing some exons. The longest human P450 gene I know about so far is CYP5A1. This gene is on a completed genomic sequence [NT_001551] and it spans 197000 bp. If this was in a phase 2 sequence it might be missing half the gene.
After phase 2 the sequence reaches complete status. There are no more gaps in complete sequences. The last stage in processing the sequence is annotation, with identification of the coding regions, repeats, pseudogenes etc. The annotation is always subject to change. There are often differences of opinion about what constitutes a gene. I have had discussions with genome sequencing experts that say they have found twice as many genes on the completed and annotated human chromosome 22 as the published article claims exist there. In my own experience, I have found two P450 genes in a cluster joined together with the other parts of the two genes overlooked. I have seen this in Drosophila, Arabidopsis and C. elegans, so it is a fairly common problem. This underscores the need for expert human annotation in addition to computer generated annotation.
So far, the problems I have mentioned with assembling P450 genes or any other genes have had to do with the incomplete nature of the sequence data. This will vanish as the sequences are all moved out of phase 0, 1 and 2 to complete status. The problems of gene identification and assembly do not end there. There is still the matter of fate in selection of the DNA to be sequenced. On Chromosome 22, the first completed human chromosome, there is a small cluster of P450 genes. These are the 2D6 gene and two pseudogenes near it. This cluster has been well studied and documented. However, the individual whose DNA was sequenced for the Human Genome Project carried an allele of CYP2D6 called 2D6*5 where the whole 2D6 coding region was deleted and only parts of the pseudogenes remained. So when I tried to find 2D6 on chromosome 22, I could not find it. This is the problem of polymorphisms. I show a comment in this slide that emphasizes the problem. Slide 8 There are numerous deletions. This raises questions about P450s that might be real genes or pseudogenes. On AC008537, there is a gene named CYP2G1P. This gene is normal in every way except it is missing exons 4 and 5. Is this due to a polymorphic deletion in this individual or is CYP2G1P really a pseudogene. We won’t know until the gene can be resequenced from other people. The Celera genomic data is taken from 5 individuals and sequencing on all five was completed June 23, so this question is probably answered in the Celera data set.
Pseudogenes are also a difficult matter. Currently there are 24 human P450 pseudogenes. These fragments of genes often flank real genes and one has to wonder if there might be alternative splicing in some cases. Among human P450s, there seem to be a large number of 4F and 2C pseudogenes. It is not clear why these parents should throw off so many more pseudogenes than other P450s, but they seem to be more abundant. Pseudogenes may be a problem in humans in general. Another gene that I work on is called the adenine nucleotide translocator or ANT1 and this gene is reported to have at least 9 pseudogenes on various chromosomes. With the draft sequence quality at more than 1 error in 10000bp, it is often difficult to tell a pseudogene from a sequence error. It may take some time or some resequencing to clarify those P450 genes that have one or two in frame stop codons. This is the case for the gene CYP2G2P. Otherwise it looks like a normal gene. Is it real or not?
Another issue that comes up in such a large project is contamination. There is one P450 sequence I found in human draft sequence that did not seem to be human. The best match to this human sequence is a mushroom P450 with 50% sequence identity. Slide 9 I sent notice to Genbank that this was so and asked them to relay the message to the sequencers. When I repeated the search for this sequence in June, I did not find it. Calling up the accession number, I found that the version number had changed and the new sequence was shorter. Slide 10 Looking at the contigs in both the version 2 and version 3 sequence one can see that the first two small contigs have been removed. The P450 sequence was on contig 2. So the sequencers decided to drop those sequences in version 3. I strongly suspect a fungal contamination of the library used in making this clone. Unfortunately, this may affect other sequences in the library that are undetected so far. This is especially hard if the sequence is chimeric, with human DNA on both ends, then it will look like a human sequence and be joined into larger contigs. Errors like this will need to be checked by sequencing multiple individuals to eliminate contaminants.
Possibly the most difficult problem in assembling genes from the human genome is not due to sequence errors. Even a perfect sequence would have this problem. It is the difficulty in assembling any gene from genomic sequence, finding the exons, especially the N-terminal exon. One example of this is the new human P450 CYP3A43. This gene is 38000 bp long and it is composed of 13 exons. Slide 11 The first intron is over 8000 bp and the first exon is only 24 amino acids long. Seven exons are less than 40 amino acids long and one is only 18. Luckily the 3A subfamily is pretty well conserved and these short exon fragments can be detected by a systematic search. This would not be the case if the percent identity between a new sequence and some other known sequence was low. In that case one would need to depend on cDNA sequence to help locate the missing pieces, especially the N-terminal. This is exactly the problem with the Dictyostelium discoideum P450 gene sequences. Most of them seem to have a very short and poorly conserved N-terminal exon that just cannot be detected by examining the genomic DNA sequence.
That brings up the flip side of the human genome project. In addition to sequencing the genomic DNA, nearly two million EST sequences from human cDNA have been deposited in GenBank. These can be very helpful in defining the intron exon boundaries of new genes. More than 1.5 million of these sequences have been incorporated into UNIGENE. Unigene scans the EST sequences and groups them in clusters that are from the same gene, or very similar genes. The May 24 version had 89632 clusters. It is tempting to assume that this means there are about 90000 human genes, but there is not a one to one correspondence between UNIGENE clusters and genes. Over 30000 of these clusters have only one EST sequence and it has been argued recently that many of these are probably accidental missprimings. Serious attempts to estimate the number of human genes ignore these singleton ESTs. That leaves almost 60000 clusters with two or more ESTs. However, different clusters may be from different parts of the same gene.
I have tried to catalog the UNIGENE entries for human, mouse and rat, though this is like shooting at a moving target. Slides 12, 13 see TABLE The data changes so quickly that many of the UNIGENE numbers are retired due to splitting and mergers. Of course pseudogenes that do not make mRNA will not have an entry in UNIGENE. However, there are pseudogenes that do have ESTs such as the new CYP2T2P with 12 ESTs mostly from breast and placenta. This makes one wonder if it is really a pseudogene. A similar gene CYP2T1 has been found in the rat and it is not a pseudogene. So perhaps humans have just recently lost a functional version of this gene. Several other P450 pseudogenes have a UNIGENE entry. These include CYP2B7P1, 2D8P and 3A5P2.
There are also six human P450s that are not pseudogenes yet they have no ESTs associated with them. These are CYP2A13, 2C19, 2F1, 4F22, 7A1 and 27C1. Slide 14 This shows that the EST database does not have every gene represented even though there are 2 million sequences. These 6 P450s represent 11% of the known human P450s, so gene coverage in the human EST database is about 89% based on P450 genes. Searching the human ESTs for a P450 will not always find a good hit, even when there are ESTs for that gene. Some P450s have long 3 prime untranslated sequences and ESTs are often from this part of the gene. CYP26B1 has a 3000bp untranslated 3 prime sequence and no ESTs in the coding region.
I think I have given you a summary of the problems encountered in data mining the human genome sequence data. I do not want to leave you with the impression that it is not worth the trouble. On the contrary, in the past year, I have found several new P450 genes from the incomplete and sometimes fragmentary human genomic sequences. slide 15 These include, CYP3A43, CYP2G1P, 2G2P, 2T2P, 2T3P, 2U1, 4F22, 4V2 (related to a trout P450 fragment called 4V1), 26B1, and 27C1. This last gene is missing some sequence at the end and in the middle, but it may be an important new member of the mitochondrial P450s.
The discovery of new human P450s is winding down. With almost 90% of the genome in the databases, one might expect only a 10% increase in gene counts. This is actually misleading, because there are some P450s that are known from cloning that have not been found yet in the genomic sequencing. Slide 16 These make up a part of the missing 10%. CYP2C9, 3A4, 11B1, 11B2 and 26A1 still need to be found in the genomic DNA. (Note, on July 7 CYP3A4 was on AC069294.3. This includes parts of 3A5, 3A7 and 3A43 see file ) We have 53 human P450s now. If we subtract the five given above that are not yet found in the genomic DNA that gives 48 and a 10% increase would be 53 again. Based on these data, I don’t expect to find more than one or two more P450 genes in humans, and those are probably going to be in existing families and subfamilies. There should not be any major surprises in store. Of course, I like surprises. Slide 17
Note: 2C9 has been found on AL133513 joined to 2C19 to make a hybrid sequence. The 2C cluster has four 2C genes close together. I did a blast of the intron region at the joint of 2C9 and 2C19 and found that most of the intron region about 800bp of the 1160bp region was a line1 repeat element. It had 91% identity to many other line1 repeats in the human genomic sequence. If all four 2C genes have this Line1 repeat in this intron, they could recombine to make six different chimeric P450s that would be functional genes. The original 2C17X gene could have been caused by this phenomenon. The person chosen for sequencing may have a deletion polymorphism of the 3 prime end of 2C9 and the 5 prime end of 2C19. This would be similar to the deletion of 2D6 on chromosome 22.