will be held in Swansea, Wales UK July 23-27 2006.
A Link to the website
CYP26C1 human GenEMBL AL358613.11 May 2, 2001 522 amino acids, 6 exons, (0) = phase 0 intron 52% to 26B1 human, also 15 amino acid insertion in exon 5 vs. 26B1 MFPWGLSCLSVLGAAGTALLCAGLLLSLAQHLWTLRWMLSRDRASTLPLPKGSMGWPFFGETLHWLVQ (0) GSRFHSSRRERYGTVFKTHLLGRPVIRVSGAENVRTILLGEHRLVRSQWPQSAHILLGSHTLLGAVGEPHRRRRK (0) VLARVFSRAALERYVPRLQGALRHEVRSWCAAGGPVSVYDASKALTFRMAARILLGLRL DEAQCATLARTFEQLVENLFSLPLDVPFSGLRK (0) GIRARDQLHRHLEGAISEKLHEDKAAEPGDALDLIIHSARELGHEPSMQELK (0) ESAVELLFAAFFTTASASTSLVLLLLQHPAAIAKIREELVAQGLGRACGCAPGAAGGSEGPPPD CGCEPDLSLAALGRLRYVDCVVKEVLRLLPPVSGGYRTALRTFELD (0) GYQIPKGWSVMYSIRDTHETAAVYRSPPEGFDPERFGAAREDSRGASSRLHYIPFGGGARSCLG QELAQAVLQLLAVELVRTARWELATPAFPAMQTVPIVHPVDGLRLFFHPLTPSVAGNGLCL* CYP27C1 AC027142 43% identical to 27A1 assembled gene intron starting with QIH ending in VDT is from Celera's public data CRA_Gene|hCG42613 /len=10487. This Celera sequence is still missing the C-terminal. Probable last exon is now found in AC027142. AG Intron boundary is in the same Location as CYP26B1. Stop codon is one codon away from 26B1s stop codon. Length is preserved from cys to intron. (n) = intron phase, 9 exons 1 85452 MQTSAMALLARILRAGLRPAPERGGLLGGGAPRRPQPAGARLPAGARAEDKGAGRPGSPPG 85634 61 62 85635 GGRAEGPRSLAAMPGPRTLANLAEFFCRDGFSRIHEIQ (0) 85748 99 100 39574 QKHTREYGKIFKSHFGPQFVVSIADRDMVAQVLRAEGAAPQRANMESWREYRDLRGRATGLISA (2) 39371 163 164 43984 EGEQWLKMRSVLRQRILKPKDVAIYSGEVNQVIADLIKRIYLLRSQAEDGETVTNVNDLFFKYSME (1) 43787 229 230 41743 GVATILYESRLGCLENSIPQLTVEYIEALELMFSMFKTSMYAGAIPRWLRPFIPKPWREFC 41564 290 291 41563 RSWDGLFKFS 41534 300 (1) 301 QIHVDNKLRDIQYQMDRGRRVSGGLLTYLFLSQALTLQEIYANVTEMLLAGVDT (0) 354 (Celera sequence) 355 110201 TSFTLSWTVYLLARHPEVQQTVYREIVKNLGERHVPTAADVPKVPLVRALLKETLR (2) 110034 410 411 108566 LFPVLPGNGRVTQEDLVIGGYLIPKG (0) 108489 436 437 108006 TQLALCHYATSYQDENFPRAKEFRPERWLRKGDLDRVDNFGSIPFGHGVRSCIGRRIAELEIHLVVIQ (0) 107794 504 505 102503 LLQHFEIKTSSQTNAVHAKTHGLLTPGGPIHVRFVNRK* 102619 542 new CYP4A22 sequence >new 4A11 like sequence AL390073.5 95% identical to 4A11 see alignment below MSVSVLSPSRRLGGVSGILQVTSLLILLLLLIKAAQLYLHRQWLLKALQQFPCPPSHWLFGHIQE FQHDQELQRIQERVKTFPSACPYWIWGGKVRVQLYDPDYMKVILGRS DPKSHGSYKFLAPRI GYGLLLLNGQTWFQHRRMLTPAFHNDILKPYVGLMADSVRVML DKWEELLGQDSPLEVFQHVSLMTLDTIMKSAFSHQGSIQVDR NSQSYIQAISDLNSLVFCCMRNAFHENDTIYSLTSAGRWTHRACQLAHQHT DQVIQLRKAQLQKEGELEKIKRKRHLDFLDILLLAK MENGSILSDKDLRAEVDTFMFEGHDTTASGISWILYALATHPKHQERCREEIHGLLGDGASITW NHLDQMPYTTMCIKEALRLYPPVPGIGRELSTPVTFPDGRSLPKG IMVLLSIYGLHHNPKVWPNLE VFDPSRFAPGSAQHSHAFLPFSGGSR NCIGKQFAMNQLKVARALTLLRFELLPDPTRIPIPMARLVLKSKNGIHLRLRRLPNPCEDKDQL* >CYP4A11 NM_000778 12 exons (n) = phase of introns MSVSVLSPSRLLGDVSGILQAASLLILLLLLIKAVQLYLHRQWLLKALQQFPCPPSHWLFGHIQE(0) LQQDQELQRIQKWVETFPSACPHWLWGGKVRVQLYDPDYMKVILGRS (1) DPKSHGSYRFLAPWI (1) GYGLLLLNGQTWFQHRRMLTPAFHYDILKPYVGLMADSVRVML (0) DKWEELLGQDSPLEVFQHVSLMTLDTIMKCAFSHQGSIQVDR (2) NSQSYIQAISDLNNLVFSRVRNAFHQNDTIYSLTSAGRWTHRACQLAHQHT (1) DQVIQLRKAQLQKEGELEKIKRKRHLDFLDILLLAK (0) MENGSILSDKDLRAEVDTFMFEGHDTTASGISWILYALATHPKHQERCREEIHSLLGDGASITW (2) NHLDQMPYTTMCIKEALRLYPPVPGIGRELSTPVTFPDGRSLPKG (1) IMVLLSIYGLHHNPKVWPNPEV (0) FDPSRFAPGSAQHSHAFLPFSGGSR (2) NCIGKQFAMNELKVATALTLLRFELLPDPTRIPIPIARLVLKSKNGIHLRLRRLPNPCEDKDQL* CYP4A22 new seq (top) vs CYP4A11 NM_000778 (bottom) 12 exons Length = 520 Score = 2607 (917.7 bits), Expect = 1.1e-276, P = 1.1e-276 Identities = 494/520 (95%), Positives = 504/520 (96%) Query: 1 MSVSVLSPSRRLGGVSGILQVTSLLILLLLLIKAAQLYLHRQWLLKALQQFPCPPSHWLF 60 MSVSVLSPSR LG VSGILQ SLLILLLLLIKA QLYLHRQWLLKALQQFPCPPSHWLF Sbjct: 1 MSVSVLSPSRLLGDVSGILQAASLLILLLLLIKAVQLYLHRQWLLKALQQFPCPPSHWLF 60 Query: 61 GHIQEFQHDQELQRIQERVKTFPSACPYWIWGGKVRVQLYDPDYMKVILGRSDPKSHGSY 120 GHIQE Q DQELQRIQ+ V+TFPSACP+W+WGGKVRVQLYDPDYMKVILGRSDPKSHGSY Sbjct: 61 GHIQELQQDQELQRIQKWVETFPSACPHWLWGGKVRVQLYDPDYMKVILGRSDPKSHGSY 120 Query: 121 KFLAPRIGYGLLLLNGQTWFQHRRMLTPAFHNDILKPYVGLMADSVRVMLDKWEELLGQD 180 +FLAP IGYGLLLLNGQTWFQHRRMLTPAFH DILKPYVGLMADSVRVMLDKWEELLGQD Sbjct: 121 RFLAPWIGYGLLLLNGQTWFQHRRMLTPAFHYDILKPYVGLMADSVRVMLDKWEELLGQD 180 Query: 181 SPLEVFQHVSLMTLDTIMKSAFSHQGSIQVDRNSQSYIQAISDLNSLVFCCMRNAFHEND 240 SPLEVFQHVSLMTLDTIMK AFSHQGSIQVDRNSQSYIQAISDLN+LVF +RNAFH+ND Sbjct: 181 SPLEVFQHVSLMTLDTIMKCAFSHQGSIQVDRNSQSYIQAISDLNNLVFSRVRNAFHQND 240 Query: 241 TIYSLTSAGRWTHRACQLAHQHTDQVIQLRKAQLQKEGELEKIKRKRHLDFLDILLLAKM 300 TIYSLTSAGRWTHRACQLAHQHTDQVIQLRKAQLQKEGELEKIKRKRHLDFLDILLLAKM Sbjct: 241 TIYSLTSAGRWTHRACQLAHQHTDQVIQLRKAQLQKEGELEKIKRKRHLDFLDILLLAKM 300 Query: 301 ENGSILSDKDLRAEVDTFMFEGHDTTASGISWILYALATHPKHQERCREEIHGLLGDGAS 360 ENGSILSDKDLRAEVDTFMFEGHDTTASGISWILYALATHPKHQERCREEIH LLGDGAS Sbjct: 301 ENGSILSDKDLRAEVDTFMFEGHDTTASGISWILYALATHPKHQERCREEIHSLLGDGAS 360 Query: 361 ITWNHLDQMPYTTMCIKEALRLYPPVPGIGRELSTPVTFPDGRSLPKGIMVLLSIYGLHH 420 ITWNHLDQMPYTTMCIKEALRLYPPVPGIGRELSTPVTFPDGRSLPKGIMVLLSIYGLHH Sbjct: 361 ITWNHLDQMPYTTMCIKEALRLYPPVPGIGRELSTPVTFPDGRSLPKGIMVLLSIYGLHH 420 Query: 421 NPKVWPNLEVFDPSRFAPGSAQHSHAFLPFSGGSRNCIGKQFAMNQLKVARALTLLRFEL 480 NPKVWPN EVFDPSRFAPGSAQHSHAFLPFSGGSRNCIGKQFAMN+LKVA ALTLLRFEL Sbjct: 421 NPKVWPNPEVFDPSRFAPGSAQHSHAFLPFSGGSRNCIGKQFAMNELKVATALTLLRFEL 480 Query: 481 LPDPTRIPIPMARLVLKSKNGIHLRLRRLPNPCEDKDQL* 520 LPDPTRIPIP+ARLVLKSKNGIHLRLRRLPNPCEDKDQL* Sbjct: 481 LPDPTRIPIPIARLVLKSKNGIHLRLRRLPNPCEDKDQL* 520
Human CYP20 (phase of introns shown) AC011737.8|AC011737 Homo sapiens chromosome 2 clone RP11-33N4, WORKING MLDFAIFAVTFLLALVGAVLYLYP (0) ASRQAAGIPGITPTEEK (2) DGNLPDIVNSGSLHEFLVNLHERYGPVVSFWFGRRLVVSLGTVDVLKQHINPNKTS (1) DPFETMLKSLLRYQSGGGSVSENHMRKKLYENGVTDSLKSNFALLLK (0) LSEELLDKWLSYPETQHVPLSQHMLGFAMKSVTQMVMGSTFEDDQEVIRFQKNHGT (0) VWSEIGKGFLDGSLDKNMTRKKQYED (1) ALMQLESVLRNIIKERKGRNFSQHIFIDSLVQGNLNDQQ (0) ILEDSMIFSLASCIITAK (1) LCTWAICFLTTSEEVQKKLYEEINQVFGNGPVTPEKIEQLR (2) YCQHVLCETVRTAKLTPVSAQLQDIEGKIDRFIIPRE (0) TLVLYALGVVLQDPNTWPSPHK (2) genomic seq stops here the rest is cDNA FDPDRFDDELVMKTFSSLGFSGTQECPELR (2) intron site based on fish genomic DNA FAYMVTTVLLSVLVKRLHLLSVEGQVIETKYELVTSSREEAWITVSKRY AK020848 Mus musculus Cyp20 adult retina cDNA plus ESTs for C-term MLDFAIFAVTFLLALVGAVLYLYPASRQASGIPGLTPTEEKDGN LPDIVNSGSLHEFLVNLHERYGPVVSFWFGRRLVVSLGTTDVLKQHFNPNKTSDPFET MLKSLLGYQSGGGSAGEDHVRRKLYGDAVTASLHSNFPLLLQLSEELLDKWLSYPETQ HIPLSQHMLGFALKFVTRMVLGSTFEDEQEVIRFQKIHG TVWSEIGKGFLDGSLDKNTTRKKQYQEALMQLESTLKKIIKERKGGNFRQHT FIDSLTQGKLNEQQILEDCVVFSLASCIITAR LCTWTIHFLTTTGEVQKK LCKEIDQVLGEGPITSEKIEQLSYCQQVLFETVRTAKLTPVSARLQDIEGKVGPFVIPKE 360 TLVLYALGVVLQDPSTWPLPHRFDPDRFADEPVMKVFSSLGFSGTWECPELXFAYMVTAV 540 LVSVLLEKLRLLAVDRQVVEMKYELVTSAREEAWITVSKRH* Bovine CYP20 MLDFAIFAVTFLLALVGAVLYLYPASRQAAGIPGITPTEEKDGNLPDIV NSGSLHEFLVNLHERYGPVVSFWFGRRLVVSLGTVDVLKQHINPNKTLDPFETMLKSLLR YQSDSGNVSENHMRKKLYENGVTNCLRINFALLIKLSEELLDKWLSYPESQHVPLCQHML GFAMKSVTQMVMGSTFEDEQEVIRFQKNHGTVWSEIGKGFLDGSLDKSTTRKKQYEDALM QLESILKKIKERKGRNFSQHIFIDSLVQGNLNDQQILEDTMIFSLAS CMITAKLCTWAVCFLTTYEEIQKKLYEEIDQVLGKGPITSEKIEELRYCRQVLCETVRTA KLTPVSARLQDIEGKIDKFIIPRETLVLYALGVVLQEXGTWSSPYKFDPERFDDESVMKT FSLLGFSGTRECPELRFAYMVTAVLLSVLLRRLHLLSVE GQVIETKYELVTSSKEEAWITVSKRY 498 Fish homologs CYP20 From oryzias latipes (fish) MLDFAIFAVTFVVILVGAVLYLYPSSRRASGVPGLFPTDEKDGNLQDIVDRGSLHEFLV GLHEQFGPVASFWFGRQPVVSLGSVDPLRQHINPNHTTDSFETMLKSLLGYQAGAGGGAN ESVMRKKLYESAINNALKNSFPAVLKVAEELVDKWSSVPEDQHIPLCAHLLGLALKTV Human to fill gap in fish based sequence TQMVMGSTFEDDQEVIRFQKNHGT (phase 0) VWSEIGKGFLDGSLDKNMTRKKQYED (phase 1) 227 this part from Takifugu rubipes pufferfish ALSEMESTLLSVVKERKSQRNKSVFVDSLIQSTLTERQ 265 IMEDCMVFMLAGCAITAN 283 (1) Tetraodon nigroviridis freshwater pufferfish 284 VCIWALHFLSSSEDVQDRLHQELEEVLGSGPVSLEKIPQL RYCQQVLNETVRTAKLTPVAAGLQEVEGKVDQHLIPKE TLVIYALGVILQDSHTWDAPCR FHPDRFEEESVRKSFRLLGFSGSQTCPELR VAYTVATVLLSAVVRQLRLHRLEDTLVEVRSELVSTPREETWITFSRRN
The human chromosome 21 sequence was announced this week. I examined it for P450 genes and found the author's have annotated one pseudogene of the 4F family on chromosome 21. It has been named CYP4F28P and I have posted it to my human P450 sequence file. It is 81% identical to CYP4F25P and 80% identical to CYP4F26P. The 4F family seems to generate a lot of pseudogenes, more so than other families.
I am trying to keep up with the new human sequences, but revisions of old sequences are rapidly replacing the ones I have already posted on my PDF files of human P450 genes. This means the nucleotide numbering of the exon locations will be changing.
The sequence previously named CYP4AH1 has been renamed to CYP4V2 since it matches to a rainbow trout P450 fragment sent to me long ago and named CYP4V1. The two are 69% identical.
Be aware that there is one P450 in the database that is labeled incorrectly as human (AC021892). This is a rice CYP75 sequence from rice chromosome 10. It was accidentally labeled as being from human chromosome 10 in the definition line. The authors are trying to get this corrected since it affects many entries.
>AC027142 CYP27C1 43% identical to 27A1 partially assembled gene
1 85452 MQTSAMALLARILRAGLRPAPERGGLLGGGAPRRPQPAGARLPAGARAEDKGAGRPGSPPG 48 85635 GGRAEGPRSLAAMPGPRTLANLAEFFCRDGFSRIHE 85742 83 84 39568 LQQKHTREYGKIFKSHFGPQFVVSIADRDMVAQVLRAEGAAPQRANMESWREYRDLRGRATGLISA 39371 149 150 43984 EGEQWLKMRSVLRQRILKPKDVAIYSGEVNQVIADLIKRIYLLRSQAEDGETVTNVNDLFFKYSME 43787 215 (GGT)intron G amino acid at boundary (AGGA) other end of intron 216 41743 GVATILYESRLGCLENSIPQLTVEYIEALELMFSMFKTSMYAGAIPRWLRPFIPKPWREFC 41564 41563 RSWDGLFKFSKRRIE 41519 287 gap of 51 amino acids 340 110201 TSFTLSWTVYLLARHPEVQQTVYREIVKNLGERHVPTAADVPKVPLVRALLKETLR 110034 395 intron LFPVLPGNGRVTQEDLVIGGYLIPKG intron 418 108006 TQLALCHYATSYQDENFPRAKEFRPERWLRKGDLDRVDNFGSIPFGHGVRSCIGRRIAELEIHLVVIQV 107791 493 missing about 30 aa at end
>AC012525 Homo sapiens chromosome 4. There is a mouse ortholog for this seq. Low 40% range with other mammalian 4 family members new subfamily of CYP4
223491 MAGLWLGLVWQKLLLWGAASAVSLAGASLVLSLLQRVASYARKWQQMRPIPTVARAYPLVGHALLMKPDGR 223279 220816 EFFQQIIEYTEEYRHMPLLKLWVGPVPMVALYNAENVEG 220700 219309 ILTSSKQIDKSSMYKFLEPWLGLGLLT 219232 218377 STGNKWRSRRKMLTPTFHFTILEDFLDIMNEQANILVKKLEKHINQEAFNCFFYITLCALDIIC 218186 217783 ETAMGKNIGAQSNDDSEYVRAVYR 217712 216357 MSEMIFRRIKMPWLWLDLWYLMFKEGWEHKKSLQILHTFTNSV 216229 214155 IAERANEMNANEDCRGDGRGSAPSKNKRRAFLDLLLSVTDDEGNRLSHEDIREEVDTFMFE 213973 210091 GHDTTAAAINWSLYLLGSNPEVQKKVDHELDDV 209993 206422 KSDRPATVEDLKKLRYLECVIKETLRLFPSVPLFARSVSED 206248 YFLTAGYRVLKGTEAVIIPYALHRDPRYFPNPEEFQPERFFPENAQG 206069 206068 RHPYAYVPFSAGPRNCIG 206015 204818 QKFAVMEEKTILSCILRHFWIESNQKREELGLEGQLILRPSNGIWIKLKRRNADER* 204648
>AC025090 CYP2U1 AC000016 has C-term 41% to 2N1 new CYP2 subfamily intron joints not yet defined
MSSPGPSQPPAEDPPWPARLLRAPLGLLRLDPSGGALLLCGLVALLGWSWLRRRRARGI 77036 PPGPTPWPLVGNFGHVLLPPFLRRRSWLSSRTRAAGIDPSVIGPQVLLAHLARVYGSI 76863 76862 FSFFIGHYLVVVLSDFHSVREALVQQAEVFSDRPRVPLISIVT 76734 105008 GPVWRQQRKFSHSTLRHFGLGKLSLEPKIIEEFKYVKAEMQKHGEDPFCPF 105160 105161 SIISNAVSNIICSLCFGQRFDYTNSEFKKMLGFMSRGLEICLNSQVLLVNICPWLYYLPF 105340 105341 GPFKELRQIEKDITSFLKKIIKDHQESLDRENPQDFIDMYLLHMEEERKNNSNSSFDEE 105517 105518 YLFYIIGDLFIAGTDTTTNSLLWCLLYMSLNPDVQ 105622 107396 KVHEEIERVIGANRAPSLTDKAQMPYTEATIMEVQRLTVVVPLAIPHMTSENT 107554 109370 LQGYTIPKGTLILPNLWSVHRDPAIWEKPEDFYPNRFLDDQGQLIKKETFIPFGIG 109540 KRVCMGEQLAKMELFLMFVSLMQSFAFALPEDSKKPLLTGRFGLTLAPHPFNITISRR
>CYP4F22 AC011492 assembled gene 13 exons 114537-140651 66% to 4F3, 65% to 4F11, 63% to 4F2, 59% to 4F8, 64% to 4F12, 57% to AC011537 exact intron boundaries need checking no ESTs MLPITDRLLHLLGLEKTAFRIYAVSTLLLFLLFFLFRLLLRFLRLCRSFYITCRRLRCFPQPPRRNWLLGHLGMVS PNEAGLQDEKKVLDNMHHVLLVWMGPVLPLLVLVHPDYIKPLLGAS AAIAPKDDLFYGFLKPWLG DGLLLSKGDKWSRHRRLLTPAFHFDILKPYMKIFNQSADIMH AKWRHLAEGSAVSLDMFEHISLMTLDSLQKCVFSYNSNCQE KMSDYISAIIELSALSVRRQYRLHHYLDFIYYRSADGRRFRQACDMVHHFTTEVIQERRR ALRQQGAEAWLKAKQGKTLDFIDVLLLAR DEDGKELSDEDIRAEADTFMFEG HDTTSSGISWMLFNLAKYPEYQEKCREEIQEVMKGRELEELEW DDLTQLPFTTMCIKESLRQYPPVTLVSRQCTEDIKLPDGRIIPK GIICLVSIYGTHHNPTVWPDSK VYNPYRFDPDNPQQRSPLAYVPFSAGPR NCIGQSFAMAELRVVVALTLLRFRLSVDRTRKVRPELILRTENGLWLKVEPLPPRA*
>CYP4F23P AC011492 assembled gene 76% to 4F3, 76% to 4F8, 76% to 4F11, 73% to 4F2, 75% to 4F12, 77% to 4F11, 60% to other 4F on this accession no ESTs MSLLSLSWLGLGPVAASPWLLLLLVGASWLLARVLAWTYAFYDNCHRLQCFQQPPKRNCF*GHLSLVS GNEEDMRLMEDLGHYFRDVQLWWLGSFYPVLHLVHPTFTAPVLQAS AAVALKDMSFYGFLKPWLG DGLLISAGDKWRWHRHLLTPAFHFKILKPYVKIFNESTNIMH AKWQRLALEGSVRLEMFEHISLMTLDSLQKCIFSFDSNCQE KPSEYIDAILELSALSLKRHQHIFLLTDFLYFLTPNGRRFCRACDIVHNFTDAVIQERRR TLTSQGVDDFLQAKAKSKTLDFIDVLLLAK DENGKKLSDENIRAEADTFMSG GHDTTASGLSWVLYNLARYPEYQEHCRQEVQELLKNGDPKEIEW DDLAQLPFLTMCLKESLRLHSPVSRIHRCCPQDGVLPDGRVIPK GNTCTISIFGIHHNPSVWPDPEV YDPFRFDPENLQKTSPLAFIPFSAVPR NCIGQTFAMAEMKVVLALTLLRFRVLPDHAEPRRKLELIVRAEDGLWLRVEPLSADLQ*
According to the latest Nature of April 6, Monsanto has sequenced all 12 chromosomes of rice to around 5X coverage and will release it to the International Rice Genome Sequence Project (IRGSP) no strings attached. The IRGSP will combine this new data with their own and release it to public databases. The exact time table for this release is not given. For a news item from the IRGSP click here The gist of this article is that transfer of the data from Monsanto will be in May and June and this will accelerate completion of the genome, but it still might take 3 years. The raw Monsanto data will not go right into public databases. It seems that only finished segments of the genome will be released. What would Craig Venter do with a 5X coverage genome? See below.
On April 7 before a congressional hearing J. Craig Venter said Celera was through sequencing human DNA and would switch to mouse. He said they would begin assembling the human genome from the present sequence data that they had accumulated so far and the finished sequence would be ready in 3-6 weeks. Francis Collins disputed this claim saying it was not possible. Celeras stock fell 20% in one day after Collins remarks.
New 4 Family tree
A second tree covering the remaining sequences including the 6, 9, 12 and 28 families is also here
New 6 and 9 Family tree
There are now 69 N-terminal sequences in an alignment under the Drosophila button on the main page. That means there are at least that many P450s in Drosophila and probably a few more will be found, since C. elegans had 80 P450s. The FASTA file of P450 sequences has all the fragments I have so far assembled and 50 full length sequences are available there. This file is a working scratch pad, so bear with me until I get them all in final order. There are a couple of sequences that are so different from existing P450s that I have not been able to assemble the middle section, even though it is present in the sequence data, I cannot identify the exons.
Celera genomics continues to deposit new Drosophila sequences into Genbank. On Nov. 16, 1612 fragments were deposited. I am trying to catch up to these. See the N-terminal page for an alignment of 65 Drosophila N-terminal sequences. The drosophila list is also being updated. See the accession numbers beginning with AC0#####* followed by an asterisk. These are newly added since the July 14 revision of this page.
Celera Genomics deposited 551 Drosophila sequences into Genbank on Nov. 3, 1999. These represent about 10 million bases of sequence. The remaining Drosophila genome sequences from Celera should be in Genbank by the end of the year. This will allow identifcation of all the P450s in a macroscopic animal.
The Dictyostelium discoideum genome project has been making progress. In Nature 401 from 30 Sept 99 page 440 the current status is given as "Over two-fold coverage of the 34 Mb genome is now available and there is already information on at least 90% of the genes." I have searched the Jena web Blast server for new sequence extensions of the 18 different P450 genes I had found earlier. Some of these are now complete sequences based on ESTs and shotgun sequences. A complete CYP51 is given, There is a complete CYP508A1 and a partial CYP508B1. At least two other sequences are complete or nearly so. I have also translated all the new P450 hits found on the Jena server and sorted them into contigs. There are now 67 different P450 contigs in Dictyostelium, though there will be fewer genes than this because some are non-overlapping N- and C-terminals of the same gene. Even though slime molds are multicellular only part of the time, they seem to have many P450s, possibly more than 30. See the Lower Eukaryote option on the homepage table.
A new human CYP family CYP39A is now represented by a complete sequence. This was assembled over several days by using every trick in the book to find all the pieces. I suspect this will be an important P450, with an ancient history, probably acting on sterols. The gene is also found as ESTs in mouse and chicken. See CYP39A1 in this file.
A new human P450 has been found named CYP26B1 that is 44% identical to 26A1 from mouse and humans. The sequence was found in genomic DNA AC007002. The reference is in press.
Nelson, D.R. (1999) A second CYP26 P450 in humans and zebrafish: CYP26B1.
Archives of Biochemistry and Biophysics in press.
The server has been upgraded from an ancient Quadra 650 circa 1994 to a PowerMac 7100. This is not new, but it is a step up from the old Quadra. You should notice some improvements in speed. If you have any trouble accessing any part of this site let me know so I can check it.
A new tree of the CYP81 family is posted under the trees button. This does not contain all CYP81 sequences, but it has all the subfamilies represented.
Soybean has a large number (86) of P450 containing ESTs. I have collected these, and incorporated them in a list with 22 full length soybean P450s. This is a new format that uses the Entrez Nucleotide query output for each entry. These have been modified to include my own annotation and translations to make them more useful. The sequences have been sorted by family. This format provides easy access to the nucleotide and protein sequences (if available) from the genbank records by a single mouse click. Let me know how you like it, it was a lot of work to make it. See the soybean button under plants.
7938 new Arabidopsis ESTs were deposited in Genbank on Sept. 8. I have looked at them by Blast searches over the weekend and I found 57 that contain P450 sequences. 47 of these are close or exact matches to known P450s from Arabidopsis (96% or better identity). Nine ESTs are new sequences that range from 60% to 93% identical to known P450s. One is a poor sequence that is probably a 93D1 EST. See New Arabidopsis ESTs
On Sept. 8, 1999 Genome Systems, Inc., a wholly owned subsidiary of Incyte Pharmaceuticals, Inc. deposited 7938 Arabidopsis mRNA sequences into Genbank. These ESTs run from AI992384-AI999813 and from AW004082-AW004589. This is about 4 - 4.5Mb of sequence. Currently, these cannot be searched for P450 genes by BLAST. It may take a day to update the searchable database so these sequences are there. I suspect there will be some new P450s in this data set.
The Cytochrome P450 homepage has a new look. A table has replaced the 15 gif images that used to serve as buttons. This should shorten loading time and improve the feel of the page. It will also allow addition of new buttons for different species as time goes on. (There are many soybean ESTs, and I have not worked on zebrafish yet)
A tomato EST project has generated more than 26,000 ESTs, with 8616 deposited in March, 11789 in June, and 5394 in July 1999. These have now been searched for P450 containing sequences. There are 235 P450 coding tomato ESTs. These sort into 58 contigs, that include 5 complete genes (CYP51, 73A24, 76A6, 88B1 and 707A4) not known previously from tomato. For more info see the Tomato P450 page
A new 68 sequence tree with the CYP4 family is given. This has 63 4 Family sequences, and 5 additional sequences. The 48 family and the CYP4P subfamily are at a boundary between the insect cluster and other animal CYP4 members. On earlier versions of this tree they fell outside the 4 cluster. Here they are just inside. A 4 family tree with emphasis on insects August26, 1999[PDF]
39 P450 containing ESTs from Zea mays have been translated and sorted by family see Zea mays ESTs with P450s
A note on Plant P450 evolution new ESTs from pine are clarifying how old some plant P450 families are.
The diversity of plant P450s has been displayed in a new table that lists plant higher level taxonomic groups and shows which groups have known P450 sequences. The more than 400 plant P450s belong to 65 species. These include 3 conifers and 62 angiosperms. Ten species of monocots have known P450 sequences. Seven of these are crop plants among the grasses (Poaceae). The majority of species are among the eudicots, with 22 in the eurosids (11 in Fabales) and 24 in the asterids (10 in Solanales). P450s are known from only 19 of the 60 higher order groups listed. Go to the table.
The plant P450 database has been updated (see the Databases button on the home page). Plants now have over 400known P450s. 212 are from Arabidopsis. The Deep Green meeting in St. Louis described in todays Science has found that plants should be split into three kingdoms not just one. Green, brown and red plants split at about the same time from a common ancestor. Each of these divisions should be accorded kingdom status. This is different from the usual view, and it will be interesting to see how this sorts out. I will be interested to know which P450s are specific to each clade, and which predate the split.
A press release of July 28 from Celera stated that one million sequences (500 million bp of sequence ) have been completed from the Drosophila genome. Below is a quote from the press release. "Celera expects to complete the random sequencing phase of Drosophila in early September when it will begin sequencing the human genome. This will entail completing another 2 million sequences-or about 1 billion letters of genetic code. Working with the Berkeley Drosophila Genome Project (BDGP), Celera will then fill gaps and resolve ambiguities in the sequence to produce finished sequence. Celera will begin making sequence data available to the public in October 1999, and anticipates release of the completed sequence by the end of the year and publication in collaboration with the BDGP in early 2000."
Celera Genomics will finish the Drosophila genome in August or September according to an NPR interview with Craig Venter that aired on July 27. It was not mentioned when the sequence data will be posted to Genbank. The rice P450 page has been updated. There are some new ESTs and some contigs have been joined
A tree with 30 plant P450s coving the CYP93, 705, 706 and 712 families has been posted under the trees button on the home page. The CYP93D1 and CYP712A2 sequences had their names reversed in the bibliographic pages and FASTA files. This has been corrected.
The only possible mitochondrial P450 from C.elegans has been reexamined to better identify the intron exon boundaries based on similarity to CYP12A sequences of insects. One previously missed exon is added based on an EST sequence and another exon is extended to fill a critical gap. Only about 40% of the sequence is covered by ESTs, so five of the intron boundaries are theoretical. CELZK177 cosmid ZK177 U21321 CYP44 Probable mitochondrial P450 489 aa C70591 is an EST from the middle region that adds 49 amino acids not previously identified as coding region in ZK177. Another exon extended 13 amino acids helps fill in the missing I-helix region with DGLSTT matching with AGXDTT of the I-helix. * indicates predicted intron locations ** indicates verified introns based on ESTs. MRRSIRNLAENVEKCPYSPTSSPNTPPRTFSEIPGPREIPVIGNIGYFKYAVKS* DAKTIENYNQHLEEMYKKYGKIVKENLGFGRKYVVHIFDP*ADVQTVLAADGKTPFIVPL QETTQKYREMKGMNPGLGNL*NGPEWYRLRSSVQHAMMRPQSVQT*YLPFSQIVSN DLVCHVADQQKRFGLVDMQKVAGRWSLESAGQILFEKSLGSLGNRSEWADGLIEL NKKIFQLSAK**MRLGLPIFRLFSTPSWRKMVDLEDQFYSEVDRLMDDALDKLKVNDSDS** KDMRFASYLINRKELNRRDVKVILLSMFSDGLST*TAPMLIYNLYNLATHPEALKEIQKE IKEDPASSKLTFLRACIKETFRMFPIGTEVSRVTQKNLILSGYEVPAGTAVDINTNVL MR**HEVLFSDSPREFKPQRWLEKSKEVHPFAYLPFGFGPRMCAGRRFAEQDLLTSL AKLCGNYDIRHRGDPITQIYETLLLPRGDCTFEFKKL
Rice P450s have been updated by exhaustively searching the genbank entries for new accession numbers. The sequences have been translated and added to the FASTA P450 rice list. See the rice button on the homepage for more info. There are 192 accession numbers for rice P450s
Two new trees have been made with CYP4 sequences as the main content. These trees illustrate the difficulty of maintaining a sensible nomenclature when families get very large. There are some inconsistencies. The 4D subfamily has been split over time into two different clusters of sequences. Cyp4p should probably be in a separate family. The 4F subfamily has also been split. I have started using double letters as in Cyp4aa1 for new subfamilies because the single letters have all been used. Please go the the Drosophila section to see these new trees.
A Tree of 56 Insect P450s The tree with 56 insect P450s includes many new Drosophila sequences. Some are not yet named. This tree is based on an alignment that covers the I- helix to the ends of the sequences, since many are missing the N-terminal. The 4 family sequences are not included here. There are too many to fit, they will be treated in a separate tree.
The Drosophila P450s have been found in Genbank by systematic BLAST searches of the nr, month, others ESTs, gss and htgs sections, using different P450 family representatives. The first search with Cyp4d2 yielded 101 new ESTs, 6 new sequences from month, one from htgs and none from gss or nr. The second search with Cyp6d2 only found 17 new ESTs and one sequence from month. The third search hit only 5 new ESTs and one sequence from nr. At this point the search was halted, since the returns were not worth the effort of scanning the output for new sequences. Some of the new sequences are very different from other P450s (AC005130) and cannot be easily assembled into a complete sequence by comparison with known P450s. I have identified exon containing ORFS from this gene, but I cannot detect the exon boundaries. If you are brave have a try at it. The new sequences (almost 300 total in the original FASTA file) have been compared with each other by repetitive Do-It-Yourself WU-BLASTs and condensed onto 98 contigs. Ten of these are from other Drosophila species, 88 are from D. melanogaster. Based on C. elegans 80 P450 genes, these 88 genes and gene fragments may represent nearly all the P450s from Drosophila, though some are probably N- and C- terminals of the same gene and the number of contigs will drop as the genome is completed.
The Drosophila P450 FASTA sequence file has been updated by including many new sequences starting with genbank numbers AI (ESTs) and AL (genome survey sequences) Work continues to sort these and assign them to known sequences or related families.
I have done a preliminary search of the new Drosophila sequences and have translated many of them in the file Drosophila P450s. I am also finding some older P450s that have come out since I last worked on the Drosophila sequences.
On May 28, 1999 28,049 Drosophila genome survey sequences were deposited from Genoscope in France. These are BAC end sequences. The percent of the Drosophila genome sequenced as reported at the MOT tables jumped from 15% to 24%. I have not had a chance to search these for P450 hits, but there should be a number of new P450s in this large sequence collection of 9% of the Drosophila genome.
NCBI has started a new service called Locus Link. This is an attempt to collect information about a family of proteins/ genes on a single page with links to the relevant databases like OMIM, Genbank, Unigene etc. There is a cytochrome P450 page. Some of the information comes from the Cytochrome P450 homepage, which is listed as a collaborator. The list of collaborators is short, with only seven names, including several major databases. The cytochrome P450 homepage is honored to be among these sources.Locus Link P450 page at NCBI
Having trouble with those pesky human P450 polymorphisms? Try The Official Human P450 AlleleNomenclature Site(in Sweden).
A possible chloroplast P450 has been reported from almond this sequence is probably in the CYP71D subfamily. Therefore, the whole 71D subfamily might be localized in the chloroplast. Chloroplast P450
A large cluster of Arabidopsis P450s has been found on AB024038. This sequence contains 14 different P450s and only two were known previously. See CYP86C2 and CYP71B16 to CYP71B26. CYP71B3 and CYP71B4 were already known.
A discussion of early eukaryotic P450 evolution is available looking at P450s in fungi and the slime mold Dictyostelium. Read Me
The CYP72 family has been sequenced in Arabidopsis (accession number AB023038) and there are 8 genes and one pseudogene present. This is similar in size to the Arabidopsis Z97338 gene cluster of CYP702 and 705 genes that has 8 genes and two pseudogenes. The spacing between the genes is less than 1000bp except for one gap of about 6000bp. See the nomenclature files CYP72A7 to CYP72A15 for nucleotide positions. The sequences have been placed in the FASTA Arabidopsis file.
An estimate has been made for the number of P450 genes in Arabidopsis (372) see P450s sorted by family.
Nine new Arabidopsis P450s have been named 71A16, 79F1, 79F2, 705A5 (now complete), 705A10P, 705A11P, 705A12, 707A3 and 708A2.
Rat UNIGENE entries have now been included as a separate file to complement the mouse and human files.
Lists have been compiled of all the human and mouse P450s in the UNIGENE database. These do not include all pseudogenes, but most of the normal genes are present. The human UNIGENE is much more complete than mouse, with only 3 of 48 human genes not having a UNIGENE entry. click on the mouse and human buttons to go to these files. In the files, the UNIGENE entries are hyperlinked to take you there immediately.
I have decided to make a data giveaway on human P450s lying undiscovered in the EST database. I did a search for these in 1995 and have been sitting on them for several years. About half are now cloned, but half are not. So here they are! New human P450s enjoy.
I have begun a housekeeping reorganization of the P450 homepage. It has grown to the point of becoming cluttered and hard to use. To make access to desired sections easier, I have added buttons to link to different subsections of the site. The new homepage also has a name http://drnelson.utmem.edu/CytochromeP450.html that is easier for search engines to locate. It may be a few days before all the subsections are properly configured.
All rice P450 fragments in the EST and GSS databases have been found, translated and sorted according to their best matches to other P450s. There are 134 of these fragments. Rice Cytochrome P450s
Plant P450s have been updated and the PlantCytochrome P450 Databasehas been completed. All the files have been made so the hyperlinks work. Public and Confidential sequence lists are available for each subfamily. The total count for plant P450s is now 289. This does not count all the ESTs and fragments produced from the Genome Survey Sequences. Lower eukaryote P450s have been updated and the Lower Eukaryote Cytochrome P450 Database has been completed. All the files have been made so the hyperlinks work. Public and Confidential sequence lists are available for each subfamily. The total count for lower eukaryote P450s is now 58. This does not count all the ESTs and fragments from genome projects on Dictyostelium discoideum(24) or trypanosoma cruzi(2).
I have just finished an analysis if Arabidopsis P450s and estimate there are a minimum of 137 P450 genes in Arabidopsis. All ESTs are translated and genes assembled from genomic sequence. See Cytochrome P450 ArabidopsisLinks.
An alignment of 60 of the 80 C. elegans P450s is posted, along with 45 other sequences from many other families. The 20 C. elegans sequences that are not included in the alignment are either pseudogenes (CYP13A9P, CYP25A6P, CYP33C10P, CYP33E3P CYP35D2P, or they are very similar to sequences in the alignment, such as the CYP13A2-CYP13A8, CYP13A10, CYP14A3-CYP14A5, CYP25A3-CYP25A5 sequences. CYP33C2 is the only sequence still missing the C- terminal at the end of a contig, so there might be more CYP33C genes in the missing region. .
The C. elegans genome is now nearing completion. I have just done a comprehensive search of C. elegans for P450s and named all the new sequences. There are 80 genes for P450s in C. elegans. Some of them are in large clusters with 10-13 genes, some of which are probably in operons. A small number (6-7) are pseudogenes. Since the whole genome of C. elegans is predicted to have about 16,000 genes, these P450s represent an unusually large subset comprising about 0.5% of the total. Some P450 clusters are in clusters of olfactory receptor related genes. Perhaps the P450s are in some way acting in concert with the olfactory receptors. Here is an excerpt from the bibliographic page with these sequences. . There are 80 C. elegans P450s listed here, surpassing the known mouse and human complements. The C. elegans genome is now officially 77% complete. However, the amount of sequence in the Blast searchable database at Washington Univ. is 117Mb, more than the 100Mb size of the genome. Therefore, we can guess that this set includes all the P450 genes in C. elegans, but the distrubution is not even. Most P450 genes (43 genes) are on chromosome V. see additional info on C. elegans P450s see this list. To see the actual sequences go to the C. elegans sequence file. So far we are missing CYP11A and CYP11B, CYP17, CYP19, CYP21, CYP24 and CYP27A and CYP27B. Does C. elegans make steroids? The present evidence would suggest not. Does C. elegans have mitochondrial P450s? There is one probable mitochondrial P450 in C. elegans on cosmid ZK177 named CYP44. This sequence is incomplete, missing part of the I-helix and some sequence upstream of that. It probably cannot code for a functional gene. .
I have searched for new cytochrome P450s in Arabidopsis and find 37 full length P450s that are new or that complete fragments from ESTs. There are also 38 new ESTs for Arabidopsis P450s. New Arabidopsis sequences There is a new file with the protein sequences of public Arabidopsis P450s FASTA Arabidopsis sequences
The crystal structure of CYP55A1 has appeared in Nature Structural Biology volume 4, number 10, October 1997 pp. 827-832. This enzyme is a nitric oxide reductase. The source is fungal (Fusarium oxysporum), though the protein is soluble and falls in phylogenetic trees in the 105 family. This probably represents a lateral gene transfer from a bacterial species like Streptomyces. .
On naming P450s, problems and approaches . The four tables of the P450 database (plants, animals, bacteria and lower eukaryotes) are under construction. It may take a while to get the data entered that will make these tables useful. If you have additional suggestions, let me know.
. The Arabidopsis data listed for Arabidopsis families and ArabidopsisESTshas been updated. .
. I have revised the genes per species data listedhere.
. Francis Durst has asked me to post an announcement and web link for the IVth International Symposium on Cytochrome P450 Biodiversity and Biotechnology to be held in Strasbourg, France July 12-17 1998. I wouldn't want to miss this one. Jump to Strasbourg .
. The new C. elegans sequences have been assigned names with 17 members in the CYP33 family and 9 members in the CYP35 family.
. I received permission to put the C. elegans translations on my webpage. To see the sequences that are presented in the same order as the list below click here .
*** Note The CYP1B1 gene has been linked to primary congenital glaucoma**** See April 97 Human Molecular Genetics
. After much effort, I have pieced together 71 cytochrome P450 protein sequences starting from C. elegans genomic sequences. Some have been in the GenBank database for some time. Many others are unfinished cosmid sequences in the ftp repositories at the Sanger Center and the Wash U. Genome Sequencing Center. These later sequences were downloaded and translated in three frames to identify where the beginning and ends of the sequence lie. The region was fed through the Baylor College of Medicing GeneFinder website set for nematode sequences. If this successfully assembled the gene, I stopped there. If not, then I did it the hard way by inspection of the three translated frames to find the most probable protein sequence. These sequences may not have the correct intron exon structures, but they should be pretty close. I should mention that the Gene Finder program was not very successful in constructing P450 genes. I had to do most of them by myself. To see a list click here. I will try to post the actual sequences soon if I get permission from the sequencers. . THIS REPRESENTS THE LARGEST COLLECTION OF P450S FROM A SINGLE ORGANISM. The genome of C. elegans is about 67% complete, so there might be as many as 100 P450 genes in C. elegans, though there are a lot on chromosome V for some reason. This may skew the results, so there might be much less than 100. The odd distribution is noticeable on CHR I also. There is only one P450 on CHR I so far. .
. I have started assembling a Rogue's Gallery (and email directory) of P450 researchers. As of August 28, 1996 there are 231 entries (and 47 pictures). If you dare to be seen in such a collection, send me your picture and any comment you might want to appear with it. You can compress a .gif file with BinHex and email it to me at: firstname.lastname@example.org or you can send a photo in the mail to: Dept. of Biochemistry University of Tennessee 858 Madison Ave. Memphis, TN 38163 . To go to the Rogue's Gallery click here I would advise you to turn off the autoload images option on your browser, otherwise all the images will be downloaded automatically. . My research interests . Biosketch . For Mitochondrial carrier info click here . .
. In looking for new bacterial P450s I found 19 P450s that are new to the alignment. Mycobacterium tuberculosis has at least eleven P450 genes, a bacterial record. One shows some similarity to CYP51 (34% identity) and one must wonder if it has a similar function.
. I have just analyzed 163 Arabidopsis expressed sequence tags found in the ESTdb. These have all been translated and aligned with the other plant P450s in a new plant only alignment of 125 public sequences. 39 confidential sequences have been removed from a comprehensive 165 plant sequence alignment. Some interesting features are noted. A CYP72 sequence has been assembled from Arabidopsis ESTs. Two EST fragments are CYP51 sequences for lanosterol 14 demethylase, one is from Zea mays.