CYP27C1 construction

A new human P450 was found in a blast search of the HTGS section of genbank using 
CYP27A1 as query.  The sequence that was found was not complete.  Various strategies 
were used to try to complete the sequence of this gene.

Blast output from the original search sorted into the correct sequence order.
gb|AC027142.1|AC027142 Homo sapiens clone RP11-30F3, WORKING DRAFT SEQUENCE, 40 unordered pieces
              Length = 188888
              

Score = 51.9 bits (122), Expect = 1e-04
 Identities = 23/55 (41%), Positives = 36/55 (64%)
 Frame = -2

Query: 84    LQVLYKAKYGPMWMSYLGPQMHVNLASAPLLEQVMRQEGKYPVRNDMELWKEHRD 138
             LQ  +  +YG ++ S+ GPQ  V++A   ++ QV+R EG  P R +ME W+E+RD
Sbjct: 39568 LQQKHTREYGKIFKSHFGPQFVVSIADRDMVAQVLRAEGAAPQRANMESWREYRD 39404

Score = 58.6 bits (139), Expect = 1e-06
 Identities = 24/66 (36%), Positives = 43/66 (64%)
 Frame = -2

Query: 150   EGHHWYQLRQALNQRLLKPAEAALYTDAFNEVIDDFMTRLDQLRAESASGNQVSDMAQLF 209
             EG  W ++R  L QR+LKP + A+Y+   N+VI D + R+  LR+++  G  V+++  LF
Sbjct: 43984 EGEQWLKMRSVLRQRILKPKDVAIYSGEVNQVIADLIKRIYLLRSQAEDGETVTNVNDLF 43805

Query: 210   YYFALE 215
             + +++E
Sbjct: 43804 FKYSME 43787

Score = 68.3 bits (164), Expect = 1e-09
 Identities = 29/71 (40%), Positives = 49/71 (68%), Gaps = 4/71 (5%)
 Frame = -2
Query: 217   ICYILFEKRIGCLQRSIPEDTVTFVRSIGL---MFQNSLYATFLPKWTRPVLPF-WKRYL 272
             +  IL+E R+GCL+ SIP+ TV ++ ++ L   MF+ S+YA  +P+W RP +P  W+ + 
Sbjct: 41743 VATILYESRLGCLENSIPQLTVEYIEALELMFSMFKTSMYAGAIPRWLRPFIPKPWREFC 41564

Query: 273   DGWNAIFSFGKKLID 287
               W+ +F F K+ I+
Sbjct: 41563 RSWDGLFKFSKRRIE 41519

Score = 52.7 bits (124), Expect = 6e-05
 Identities = 24/56 (42%), Positives = 38/56 (67%)
 Frame = -1

Query: 340    TSNTLTWALYHLSKDPEIQEALHEEVVGVVPAGQVPQHKDFAHMPLLKAVLKETLR 395
              TS TL+W +Y L++ PE+Q+ ++ E+V  +    VP   D   +PL++A+LKETLR
Sbjct: 110201 TSFTLSWTVYLLARHPEVQQTVYREIVKNLGERHVPTAADVPKVPLVRALLKETLR 110034

 Score = 71.8 bits (173), Expect = 1e-10
 Identities = 35/76 (46%), Positives = 49/76 (64%)
 Frame = -3

Query: 418    FPKNTQFVFCHYVVSRDPTAFSEPESFQPHRWLRNSQPATPRIQHPFGSVPFGYGVRACL 477
              FP+ TQ   CHY  S     F   + F+P RWLR       R+ + FGS+PFG+GVR+C+
Sbjct: 108006 FPQ-TQLALCHYATSYQDENFPRAKEFRPERWLRKGD--LDRVDN-FGSIPFGHGVRSCI 107839

Query: 478    GRRIAELEMQLLLARL 493
              GRRIAELE+ L++ ++
Sbjct: 107838 GRRIAELEIHLVVIQV 107791


The sequence is missing many parts, including the N- and C-terminals and some internal 
sequence.  The job of the annotater is to assemble the gene. 

To do this I first started with the N-terminal.  I did blast searches of human ESTs and got 
no matches.  I next tried mouse ESTs since there is a large amount of mouse EST data, 
again no matches.  This seemed odd since there are almost 3 million mouse and human 
ESTs.  Even though I thought it would not work I tried others ESTs.  This time I hit a 
strong match to a Xenopus egg library EST.  The match had 81% identity.  Because the 
P450s are more highly conserved in their C-terminal half, this suggested that the overall 
protein identity would be higher than 81%.  The two vertebrate lineages split over 300 
million years ago.  This implies an important function for the new human P450, since it has 
been conserved over such a long time with such high percent identity.  It may also be 
significant that the EST was from an egg library.  It may function during development.

EST from Xenopus conserved to 81%
gb|AW637606.1|AW637606 bl60c07.w1 Blackshear/Soares normalized Xenopus egg library Xenopus
           laevis cDNA clone PBX0060C07 5'.
           Length = 434
           
 Score =  101 bits (249), Expect = 8e-22
 Identities = 45/55 (81%), Positives = 52/55 (93%)
 Frame = +1

Query: 1   LQQKHTREYGKIFKSHFGPQFVVSIADRDMVAQVLRAEGAAPQRANMESWREYRD 55
           +QQKHTR+YG+IFKSHFGPQFVVSIAD+D+VAQV+RAE  APQRANMESW EYR+
Sbjct: 136 IQQKHTRQYGRIFKSHFGPQFVVSIADKDLVAQVIRAERDAPQRANMESWHEYRE 300

Translation of this EST 
EAEGELGARAKEAPMMKSLKDMPGPSTLANLVEFFWRDGFGRIHE 
IQQKHTRQYGRIFKSHFGPQFVVSIADKDLVAQVIRAERDAPQRANMESWHEYRE
LRGRSTGLISA 
EGEKWLNMRSVLRQKILRPRDVAMYTGGVNEGI

By looking at the nucleotide numbewrs on this ESTs I could see it would extend in both 
Directions.  The translation is shown above.  The human sequence I had seemed to stop 
at the 5 prime end, It did not match the EST upstream, so there was probably an intron 
there.  To see if I could find the next exon upstream I blasted the HTGS human 
sequences again with the new N-terminal of the Xenopus EST.  

I hit a good match on a different contig around 85000 in the sequence.  I got this 
region of the DNA and translated it.

3 frames N-term region 
LNLPLCS*KINVGRARQHFFFFKITASVIPATREAEVRGSLGPWGRRLLSRNRAITLQLW*QSETVSKKKKKKKKMLSNNTL*TEK
NLIFYV*MHESQRFEIFK*CNCSEQFNY*LGKSSLP*NHMAHLWPFPCSIPPCAATFLCNQALGSESRLADSSPRLGMEGASLPQA
LCNRTRPCEGPDKRPMLKVAHPHLVLTASNGVRNKSDEPEESHRKRL*LSLRKLVAR**CGHFTTVGTTCKIVSLFSPPLKVELQ*
FKLKP*RTASKMQKRPKISSVYLQ*SFRRTVIIYKIPVDFSFFGHGWKWG*P*NLVVSRNA*V*INMM*RNNNSLILLQSNHSLFP
SPRGGSL**GSV*TDAPAAGVRRSTALTS*DAGNSNPKSSPRNKVAQTNKPMQTSA  
MALLARILRAGLRPAPERGGLLGGGAPRRPQPAGARLPAGARAEDKGAGRPGSPPG  
GGRAEGPRSLAAMPGPRTLANLAEFFCRDGFSRIHE  IQVARAAPGGSQPGLLPRR

The region that matched is separated 

By looking upstream I could see the whole N-terminal back to the start MET

Now looking at the other end of the N-terminal I could see there was an 11 amino acid 
gap between the end of the blast output and the next fragment.  

By looking at the translation of this region in the human DNA sequence it was possible 
to extend the protein translation by comparison with the Xenopus sequence out to the 
end of the exon

DISLLLLHTAKVVLKTYP*RIRKLP*SCL*SQHFGRLRQEDHLRPRVRDQPGNIERPLFLQKIKN*LGVEVCTCSPSYLGG*GGRI
AWAQELEAAVSYVYATALQPGQQSKTLSLKNKQTNKQTNKRKQQKRDGYVKTCLSQQYVNYCSDSVLC*IALPSAPIFCSSSTQDV
YVKGISYLCIFSFFDFSMV
LQQKHTREYGKIFKSHFGPQFVVSIADRDMVAQVLRAEGAAPQRANMESWREYRD  LRGRATGLISA
*VCGPGLSHGLEGELRGRPPPPK

The next two exons ends were at 215 and 217 amino acids.  It was a simple matter to 
look at the DNA and find the GT AG boundaries which completed the sequence up to 287 
amino acids.  

There was a small gap of about 51 amino acids in this region to get to the next exon.  
I used the CYP27A1 sequence for this region and tried every trick I knew to find the 
missing piece, but I could not find it.  I tried the typical blast of human HTGS 
section of genabnk with the missing part of 27A1 and got nothing.  I translated the 
DNA sequence upstream and downstream of the two exons and blast searched the protein 
sequence against the EST database to see if any sequence matched a P450, but it did 
not.  I even tried blasting the mouse HTGS section to spot an orthologous sequence 
that was conserved.  I had to conclude that the missing region is in a sequence gap.  

The next two Blast output regions were both on the same contig, but they were 
separated by about 22 amino acids.  I could not find these amino acids by looking 
upstream or downstream of the existing sequences.  Because they were both on the same 
contig, the missing sequence had to be there, but I could not find it by eye.  I 
decided to use the Do-It-Yourself WU blast in a TBLASTN of the DNA sequence.  I used 
the 27A1 seq as query only for the 22 amino acid gap.  This did work and I found the 
missing short exon.  

LFIYLP*IKETVFRF**DIRCYMELDEFDFVLCVFPGDFLPATASVSVPGVTHCLPVSATFSP*T
SFTLSWTVYLLARHPEVQQTVYREIVKNLGERHVPTAADVPKVPLVRALLKETLR*
KARQNPSTNCVP*VPGSLSQQVLQMPPRLTCSG*GCS*LLFFLCSCQNSRKTILFYGST*SLFSVKIFGPGTVAHACNPSTLGGQG
RLPEVRSSRAAWPTW*NCVSTKNTKISRAWW*APVVPATWEAGARESLEVR*WSQDCTSALQPGQQSEALSQKKKKKKKDLWSSH*
ILSLL*PIGLLGVSQNVPGSMAFDFVPLKITRNKIDLLGK*FDLFS*FADLDTNLMVLE*MFSSGFRRHPVRLNYSQTGLS*QTEK
PEGCGLSWVEGMVWVGGGIRDGLHPFPQTRHPL*GNAATKSKACFLQYILKNSFKADLIMTVNLYSMLLNLKIKKKEAAGCGVYWY
LSIKVLFRVF*DNPIAEVSFLQMHFMGP*VL*TIP**SGPVVT*NIRNLEFPSWEHNGM*KAWLCVLQMFCFGSLFHIPQNTRW*S
KPTKKQQRSLQNVSVENVPCVTF*EFLMEMGIYPGF*TFNKIL*KSPRHSRPRWLFSR
LFPVLPGNGRVTQEDLVIGGYLIPKG
VSLGPWGWPGLAGRG*IGICFYLQEVFVPKQRSIGFLFFCMYFKMKMGCWEKGREWCDLTALETYCSPARQPMLLGLLPRPCGSGQ
SPLGVSEGLDGLLSGCSDIWGAVWSGKGGAAAILGVSGPGRPPGGSTPGDCPARALGPALLLSDSAPFPFSVIPDF

That brings us to the last of the blast output, but there were about 30 amino acids 
missing and they seemed to be in a small exon.  The contig ended a few thousand bases 
later and I could not find the end.  I suspect it is also in a sequence gap.  Searches 
of the EST database did not find and extensions of the C-terminal.  In fact there were 
no ESTs found for any of the protein regions except for the single xenopus sequence.  
I looked up the clone name on that sequence and it did not have a partner.

The only thing left to try is to search the DNA sequence downstream of the end for 
matches to the ESTs of human for a 3 prime UTR match.  This is still possible.

It is necessary to avoid searching alu repeat regions because you can get thousands of 
matches, so you may want to put the blast filter on for this kind of search.

That is the temporary end of the story for this P450 gene.  And it is the end of the 
class and the end of the course. Go out into the post genomic world and have a BLAST!