Comments on new human P450s in genomic sequences

David R. Nelson May 8, 2000

The human HTGS (high throughput genomic) DNA sequences have been searched on April 15-27, 2000 for matches to human P450 sequences. This was done to identify every accession number from this set that had a P450 present in its sequence. The result was 114 accession numbers on all chromosomes except 17, 21, X and Y. The current statistics for human genomic sequencing are 18.9% completed and 80.8% draft quality. They add up to 99.7% of the genome and that seems too high. Inquiries made at NCBI were answered with the assurance that the 80.8% number does not include the finished sequence, but the draft sequence has lots of gaps, so the number may be too high. Draft quality is a 1% error rate or better, and draft deposits in Genbank have numerous unordered fragments sometimes over 100. One entry had 276 unordered fragments. These sequences have gaps in them so there are many areas not covered at all in these draft sequences. The HTGS section of Genbank contains unfinished sequences. As they are finished they are moved into the NR section of Genbank. Therefore only the newest sequences will be in the HTGS section. There are some problems in analyzing this data. The error rate in the draft quality sequences is high enough to introduce some frameshifts and stop codons into the sequences. It is not possible to tell if these are real or accidental until the sequence has moved from Phase 1 to Phase 3 sequence. In the meantime, names can be assigned to these sequences, but the names may reflect pseudogene status because of sequence errors. This may have to be corrected later. For example, 2G2P is on chromosome 19. It has two in frame stops, but otherwise it looks like a complete gene. This sequence may become 2G2 if the stops are lost in later versions of the sequence. Another example is CYP4F23P which has only one in frame stop. There is another fairly common problem with HTGS sequences. Most sequences (70/95) are assigned chromosome locations. Sometimes these conflict with one another. The question must be answered if there are two versions of a sequence on different chromosomes or is there a mistaken assignment of one entry or is there chimerism. Examples include 1B1 on chr 5 and chr 2, CYP2J2 on chromosomes 1 and 8. CYP39 seems to be on chr 6 and 18. CYP46 is found on chr 1 and 14. The sequences may not be 100% identical, but is that because they are different sequences, or is it due to the error rate in sequencing? The large cluster of P450s on AC008537 on chr 19 is also found on AC025769 which is from chr 5. I do not know if this is an accidental missassignment of the AC025769 DNA fragment to chr 5 or if there is a duplication of the chr 19 cluster on chr. 5. We will have to watch this and see what happens. I have posted three PDF files of tables listing the HTGS P450 sequences. These files are sorted by CYP name, accession number or chromosome. The location in the sequence where the P450 was found is given as xxxk for the base location in thousands. This may include over a dozen numbers, and they may be from different parts of the sequence, since the fragments are often unordered. The locations depend entirely on the version number of the sequence and this can change, so the location part of this table will go out of date pretty fast. The version number of the accession is included so you can tell if the locations will be the same or different. Most version numbers are 1, 2, 3 or 4 but they can go as high as 28. Some P450s are represented in HTGS as partial sequences. 4F2 only has an N- terminal sequence. Others are missing entirely, Some, like CYP26B1 have moved to the NR section. There are some new P450s and some incomplete P450 sequences are now complete. CYP2R1 is now completed. CYP2U1 and CYP2T2P and CYP2T3P are new. In the 4 family CYP4F22, CYP4F23P and CYP4AH1 are new. CYP4A20 replaces the old partial 4Z1 fragment. A new CYP27 has been found CYP27C1, though it is not complete. A hybrid 2C9/2C19 has been found, but this might be an artifact and we need to wait for the phase 3 sequence to confirm this. There are also numerous pseudogene fragments in the sequence data, especially from the 2C and 4F subfamilies. A search of the NR database was done to find the sequences that are missing from the HTGS section. These include 2B6 (a related gene is present), 2C9, 2D6, 3A4, 4A11, 4F3, 4F8, 4F12, 5A1, 11B1, 11B2, 21A2, 26A1, and 51. Searches done on 4/26/00 show that 2B6, 4A11, 4F3, 4F8, 4F12, 5A1, 21A2 and 51 are known on genomic DNA and are present in the NR section of Genbank. The others 2C9, 2D6, 3A4, 11B1, 11B2 and 26A1 still need to be found in the genomic DNA from the Human Genome Project before it can be considered complete. (2D6 was deleted in the Chr 22 seq chosen for the Human Genome Project see below) The 2D locus is on chromosome 22 and this whole chromosome is done. A search of the chromosome 22 data did find the 2D6 region. Because the genomic sequence is made from short exons, the scores for matches to this sequence were fairly low. It may be necessary when searching for genomic P450 sequences to use individual exons to move the hits to the top of the list rather than having them so low on the list they do not fall in the top 50 hits. It may be necessary to use a similar strategy for the last 6 P450s mentioned above that are not yet found in genomic DNA from the Human Genome Project. The 2D6 locus AL021878 has two pseudogenes present but not 2D6. A search of all the human genome project sequence did not find 2D6. This gene may lie in one of the 10 gaps shown in the foldout of the chr 22 sequence in Nature Dec 2 1999. The Al021878 cosmid is on page 5 of the foldout. Alternatively, 2D6 may be deleted in this individual in which case they would be poor metabolizers of debrisoquine and be susceptible to adverse drug reactions for drugs metabolized by this P450. The sequence from 43541 to 41008 of AL021878 is 99.9% identical to M33388 (CYP2D6 genomic DNA) from 6910-9432 (outside the coding region). This is explained by a comment in OMIM that the 2D6 gene is deleted in this individual. This person has the CYP2D6*5 allele which is a fairly common 12 kb deletion of the whole 2D6 gene.