MSCI814 Module 2.

Superfamily Genomics/Eukaryotic Gene Assembly, Rhesus Monkey

David Nelson Jan. 19, 2006

This module deals with assembling genes from genomic DNA sequence. This is a little tricky, so we will begin with a newly sequenced genome that is very similar to another well annotated genome. Rhesus monkey, Macaca mulatta has been sequenced and assembled. The Macaque Genome Sequencing Consortium is led by the Baylor College of Medicine Human Genome Sequencing Center, and in collaboration with the J. Craig Venter Institute Joint Technology Center, and the Genome Sequencing Center at Washington University, St. Louis. This genome is aligned against the human genome at the UC Santa Cruz genome browser.

LINKS FOR THIS MODULE

Bioinformatics home page (Under construction)

Module 1 Intro to NCBI, BLAST of Xenopus tropicalis ESTs

Module 1 results Xenopus P450s found by the class

Rhesus monkey genome project Baylor Project page

NCBI sequence viewer for retreiving a Genbank sequence

JavaScript DNA translator For translating nucleotide sequence to protein

Human P450 Blast server For comparing a sequence against human P450s

Human P450s FASTA format

NCBI TBLASTN Server for BLASTing a protein against Genbank nucleotide sequences

Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences

UCSC bioinformatics server for genome browsers use BLAT search of Rhesus

Vista genome browser for comparing genomes Has some precomputed genome comparisons, but not rhesus monkey

Before we begin on Rhesus monkey, I would like to show you how well you did with last weeks assignment. I was very happy with the results of the class effort. I did some searches in the nr section of Genbank for Xenopus tropicalis P450s. I found some and added them to our growing collection. I also made some extensions to your sequences by walking the ESTs (the optional part of assignment 1). The current sequence set is at the link above called Module 1 results. The yellow regions were submitted by the class.

Starting from no sequences on Thursday afternoon last week, we now have 49 different P450 sequences from Xenopus tropicalis. 38 of these are full length P450s. Fugu (the Japanese pufferfish) has 54 and humans have 57, so the frog will probably have a similar number. We may only be missing about 5 sequences including CYP7A1, CYP11B1 and some of these (like CYP21) are not present in the frog EST collection at Genbank.

Look at the frog sequence collection. I have given the names of everyone who turned in sequences, and I have colored the regions yellow that were identified by students in this class. There is something to learn from this data. The yellow regions are mostly near the C-terminal end of the P450 sequences. That is caused by P450s having a more conserved C-terminal sequence than other parts of the protein. The middle region from about 130-300 amino acids is poorly conserved. You were asked to find the best hits in the database, so you naturally found the C-terminal parts most of the time. In some cases where your best hit was to the N-terminal part, this was due to the C-terminal part being missing in the ESTdb.

Sometimes the very last part of the sequence was not found (DT436641, CYP4T8). One to 16 amino acids were missed in the blast alignment. This is often the case. If the last few amino acids are not in a conserved motif, then they will differ between sequences, even if many other regions are highly conserved. The way to check for this is to translate the EST DNA sequence and look for the stop codon shown as an *.

The logical next step to continue your work from the first class would be to extend the sequences you found to include the whole protein coding region. This was the optional part of the assignment. Some sequences could not be completed because the EST collection is not comprehensive and some sequences are missing. CYP21 was not in the dataset, and most of CYP8A1 was missing. Only 83 amino acids at the C-terminal were found. Only ESTs were being used here, but you could easily search in nr or use genomic sequence from the GSS section of Genbank. There are no Xenopus tropicalis sequences in the HTGS or WGS sections, but there are 22.9 million in the Trace Archive.

Let us select an example sequence CYP4V (a chicken sequence) starting with the partial sequence shown below. If a BLAST is done using this sequence against EST others, limited to Gallus gallus (chicken) we find
>gi|25890949|gb|BU382948.1|BU382948  UniGene info 603858448F1 CSEQCHN75 Gallus gallus cDNA clone ChEST866o11 5'.
          Length = 798

 Score =  390 bits (1001), Expect = e-109
 Identities = 188/188 (100%), Positives = 188/188 (100%)
 Frame = +2

Query: 1   KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 60
           KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK
Sbjct: 29  KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 208

Query: 61  VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 120
           VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP
Sbjct: 209 VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 388

Query: 121 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 180
           KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ
Sbjct: 389 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 568

Query: 181 MEEKTLLA 188
           MEEKTLLA
Sbjct: 569 MEEKTLLA 592

Note the length of the hit = 798, while the alignment stops at 592 (frame is + strand) so there are 206 bases of sequence beyond the end of the alignment. This would code for 206/3 = 68 and 2/3 codons or amino acids. Since P450s are about 500 amino acids long and about 50 amino acids after the heme signature FSAGPRNCIG This should extend the sequence to the stop codon. To find out we need to get the sequence and translate it. There are two ways to get the sequence. The easiest way is to click on the accession number in the blast output. It is hyperlinked and it will retrieve the sequence.
BU382948
GCGGCTCGAGGAAGAAAGTGGTTCCAAAAAGAGAGAAGCTTTCTTAGACATGCTGCTGAA
TGCCACAGATGATGAAGGGAAAAAACTCAGCTACAAGGACATTCGTGAAGAAGTGGATAC
TTTTATGTTTGAGGGTCATGATACAACAGCAGCCGCTATGAACTGGGTCCTATACTTGCT
TGGTCATCATCCTGAAGCCCAGAAGAAGGTTCACCAAGAACTGGATGAGGTGTTTGGCAA
CGCAGAGCGTCCTGTTACAGTGGATGATTTGAAGAAACTTCGATACCTCGAGTGTGTTGT
GAAAGAAGCCCTGAGGCTCTTCCCTTCAGTTCCCATGTTCGCCCGTTCCTTGCAAGAGGA
TTGCTATATTAGTGGATATAAGCTACCAAAAGGCACGAATGTCCTTGTCTTAACTTATGT
GCTGCACAGAGATCCTGAGATCTTCCCTGAGCCAGATGAATTCAGGCCTGAGCGCTTCTT
CCCTGAAAATAGCAAAGGAAGGCACCCATATGCTTATGTGCCCTTCTCTGCTGGCCCCAG
GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC
TGCGGCGCTTTTGGGTGGACTGTTCTCAAAAGCCAGAAGAGCTTGGTCTGTCAGGAGAAC
TAATTCTTCGTCCAAATAATGGCATCTGGGGTCAACTGAAGAGGAGACCAAAAACTGTAA
CAGAATGACAGGAATACAAGATTCTGATTTTCCAGAAACTTCTAAGCTATTGGACTGGAG
ATGTGTTTAAATCAGATG
The other way is to go to the NCBI sequence viewer link above and paste in the accession number in the search window (set for nucleotide in the pull down menu). Either way will get you to the sequence shown above. Copy the sequence and take it to the JavaScript DNA translator above and paste it in. Use the frame pull down menu to select one lettter code and 3 forward frames. Then press translate to get the following translation in three frames numbered 1, 2, 3. The reverse frames are 4, 5, 6.
>_1
AARGRKWFQKERSFLRHAAECHR**REKTQLQGHS*RSGYFYV*GS*YNSSRYELGPILA
WSSS*SPEEGSPRTG*GVWQRRASCYSG*FEETSIPRVCCERSPEALPFSSHVRPFLARG
LLY*WI*ATKRHECPCLNLCAAQRS*DLP*AR*IQA*ALLP*K*QRKAPICLCALLCWPQ
ELHWPTLCTNGRENSSSPSSCGAFGWTVLKSQKSLVCQEN*FFVQIMASGVN*RGDQKL*
QNDRNTRF*FSRNF*AIGLEMCLNQM
>_2
RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL
GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED
CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR
NCIGQRFAQMEEKTLLAPHPAALLGGLFSKARRAWSVRRTNSSSK*WHLGSTEEETKNCN
RMTGIQDSDFPETSKLLDWRCV*IR
>_3
GSRKKVVPKREKLS*TCC*MPQMMKGKNSATRTFVKKWILLCLRVMIQQQPL*TGSYTCL
VIILKPRRRFTKNWMRCLATQSVLLQWMI*RNFDTSSVL*KKP*GSSLQFPCSPVPCKRI
AILVDISYQKARMSLS*LMCCTEILRSSLSQMNSGLSASSLKIAKEGTHMLMCPSLLAPG
TALANALHKWKRKLF*PLILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRRPKTVT
E*QEYKILIFQKLLSYWTGDVFKSD
Look at all three frames. Notice that frames 1 and three have many * in them. These represent stop codons. These occur by chance about once every 21 amino acids in non-coding nucleotide sequence. Of course in coding sequence they should not appear except at the end of a coding region. Otherwise they are found in introns, the non-coding regions of genes. Remember that we are looking at ESTs (no introns).

Frame 2 has very few * and they do not break up the sequence. If you look at this frame you will find the heme signature mentioned above that was part of the starting sequence in CYP4V. This unbroken sequence is the P450 CDS (coding sequence) up to the stop codon. The sequence we have now is shown below. Magenta is our starting sequence.
>_2
RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL
GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED
CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR
NCIGQRFAQMEEKTLLAPHPAALLGGLFSKARRAWSVRRTNSSSK*
We now have completed the C-terminal of CYP4V.

Facing up to imperfect sequence data


The sequence we have just extended looked like a good sequence up to the end. However, it is always a good idea to check your work. If you take this sequence to the human P450 blast page above and do the blast, we find that the C-terminal we added does not match the human sequence of CYP4V2. In fact it stops at TLLA. This suggests an error in the sequence that has caused a frame shift, probably one base is missing or one extra base is added in the sequence. To check for frameshifts you need to blast the other two frames in the region after TLLA. One of them will probably match the human sequence. If you blast the last two lines of frame three you will find that frame three is the correct frame after TLLA. You would also see this if you did a blastx search of the nucleotide sequence.

>CYP4V2 formerly CYP4AH1 AC012525 Homo sapiens chromosome 4
        Length = 525

 Score = 143 (50.3 bits), Expect = 1.8e-12, P = 1.8e-12
 Identities = 26/37 (70%), Positives = 32/37 (86%)

Query:    19 ILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRR 55
             ILR FW++ +QK EELGL G+LILRP+NGIW +LKRR
Sbjct:   483 ILRHFWIESNQKREELGLEGQLILRPSNGIWIKLKRR 519

Where is the frameshift?

GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC
                                        T  L  L  A     L  I  L
TGCGGCGCTTTTGGGTGGACTGTTCTCAAAAGCCAGAAGAGCTTGGTCTGTCAGGAGAAC
  R  R  F  W
The sequence from above is shown with the TLLA and LILRRFW sequences in translation below it. There is an extra C base in a run of four Cs. This sequence was read as four when there were only three Cs here. Now we can say with more confidence that the correct CYP4V sequence is

RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL
GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED
CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR
NCIGQRFAQMEEKTLLALILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRRPKTVTE*
A similar strategy can be used to extend the sequence upstream to the start codon. You must keep the possibility of frameshifts in your mind while you are doing this. The process is:
  • Do a blast search and find an accession that overlaps your known sequence and extends it upstream.
  • Get the sequence and translate it in the correct frames based on the frame of the blast match (+ or -, Forward or Reverse).
  • Look for the open reading frame (ORF), the one without stop codons in it that extends your known sequence.
  • Check your work by blasting the new sequence against human.
  • The start codon (almost always Met) should be in about the same place as the start codon in the best human match.

Going after bigger fish, searching genomic DNA

ESTs are very nice to learn on, but now that you are more familiar with BLAST and frameshifts, you are ready for harder problems. Eukaryotic genomes have introns. Some genes have none (CYP8B1 for example), but that is rare, 5-10 introns is more common. Introns make your life difficult, because you have to find them all and find all the intron-exon boundaries.

Genome sequences are being generated world wide at the rate of 200-300 million bases of sequence a day and that is probably a low estimate now. Celera Genomics has a 100 million bases per day capacity. The new J. Craig Venter Institute will be about the same and can increase to 4 times that level. The Broad Institute at MIT is near 60 million a day, probably more by now. The Joint Genome Institute (DOE) is of similar capacity and that does not count the Sanger Center and the genome centers like Kazusa in Japan and other sites in the world. These centers are producing more sequence than they can possibly annotate in any detail. They are relying on automated gene finding programs and automated blast comparisons to annotate these genomes. It will be a moderate success, but it will not be 100% correct. The gene finding programs only get about 60% accuracy. They fuse adjacent genes together. They skip exons. Some exons are very short, as short as eight nucleotides. See the GHE sequence below from several P450 genes from the white rot fungus. Automated programs miss short exons and they fail to detect bad exon boundaries that are probably sequence errors. In short, they do not have expert knowledge of a single protein family. Celera realized this when they did the Drosophila genome and they held two Gene Jamborees, where expert annotators were brought in to work on assembling genes from individual familes. The Riken mouse cDNA project had a similar meeting with about 50 invited annotators.



Figure showing a very short exon GHE from several P450s

Phanerochaete chrysosporium (white rot fungus) Scaffold_388a very similar 
to sequences 77 417 112 129
gene model complete all boundaries checked, 16 exons
3583 MISDTFALAISSGLSLFLCLKAFIDYRAGLRSI (2) 3684 ex1
3732 NHSYLPGFRALISSFGILGLFFKEPKRGLWGGRRRFWLRKHLDFEEAGVDIISH (0) 3893 ex2
3954 IAFLPSVSTYLLLADAAAIK (0) 4013 ex3
4069 EVTGHRARFPKPTYKTLRIFGGNVLASEGEEWKRHRKVVGPAFSE (0) 4203 c-helix ex4
4255 HNNRLVWNETVKIVNDLFANVWGSQSEVYVDNVVQSVTLP (0) 4374 ex5
4423 MALYVISIAGFGKRALWQADGNLPPGHKLSFQ (0) 4521 ex6
4576 DALHILGTDLWIKAATPTLLMNWAPTTRIANVKLAFDEVK (0) 4692 ex7
4747 QYMLELIQERRNSEKRDERYDLFSSLLDANDLNEDGNGNVTLTNDELL (1) 4890 ex9
     GNIFIFMLA (1) 4973 ex9 split
     GHE (0) ex9 split
5087 TTAHTLAFTFGLLALHPDYQETVYQQIKSIVPDNRPP (0) 5197 ex10
     MYEEMNSLTECMA (2) ex11
5351 YETLRLFPP (0) 5380 ex12
5436 TATIPKIAAEDTYLVTIDRAGNRVVVPVPCGTALHLNVIALHHN (1) 5564 ex13
5614 PRYWDNPSAFKPERFRGDWPRDAFIPFSTGSRSCIGRR (2) 5730 ex14
5780 FFETESIAILTMILSRYKIELRNDPRFADETYEERWQRVLRVKDGLTPA* 5932 ex15

compare to scaffold 388b not a separate exon

      GNIFIFLLAGHE(0) 14307 ex9
14247 TTAHTLAFTFGLLALYPEQQDKLYKHIKHVIPDGRIP (0) 14137 ex10

and scaffold 129

      GNIFIFMLA (1) ex9 split
      GHE (0) ex9 split
25150 TTAHTLAFTFGLLALHSDYQEKVHQQIKSIMPDNRLP (0) 25260 ex10

and scaffold 12a not a separate exon (exons have been fused)

54432 VKANMTEDAKSRLSEEEMYAEMR (2) 
54552 TILFAGHETTSTTISWVLLE



Expert knowledge reveals GHE to be a real exon, by comparison with other P450s. This happens to be in a motif region that is conserved (AGXETT) and easily recognized, but automated gene finding programs are not going to see this. Please note that the intron exon phases are indicated in (). The number 0, 1 or 2 tells the phase of the junction. Exons that join between codons are phase 0, those that join one base into a codon are phase 1 and those that join two bases in are phase 2. You cannot join exons together in frame unless the phase is preserved at both ends of the intron. In the example above, the sequence GNIFIFMLA ends in a phase 1 boundary, while the sequence TTAHTLAFTFGLLA starts with a phase 0 boundary. The two cannot be joined unless there is an exon in between with a phase 1 start and a phase 0 end. That is the GHE exon. This short eight base pair exon is seen in six different genes in the white rot fungus.

Introns begin with GT and end with AG. Some rare introns begin with GC. These bases, and all the bases between them, are cut out when the full length transcript is processed in the nucleus to make the mature mRNA (messenger RNA). Below is an example of a Drosophila P450 gene Cyp4e2, showing the 5 introns in place. Please look at each intron and find the GT and AG pairs. Also notice the phase of each intron as described above. This figure is the BLAT search output of the UCSC genome browser.
The exons are in blue capital letters. Introns are lower case black letters. The Capitals are all in blocks of three per codon. This allows determining the phase of intron boundaries, since the last capital letter is at the end of a complete codon. If gt follows then the phase is 0. If ngt ( n is any base) follows then the phase is 1. If nngt follows the phase is 2. In this figure intron 1 and intron 5 are phase 1, though intron 5 has the correct boundary off by one codon. Introns 2 and 4 are phase 0. Intron 3 is phase 2. The numbering on the right is nucleotide numbering for Chromosome 2R.

In this module, you will learn how to identify intron-exon boundaries, by comparing two very similar genes, one from human, and the other from rhesus monkey. This will be fairly easy since the genes are so similar. In real life situations, this is not going to be so clear. We have all 57 P450 genes identified from human, but we are starting with none identified from rhesus monkey. The human P450 sequences are linked above.

We are very fortunate to have the UCSC genome browser, which has an alignment of the two primate genomes as of Jan. 2005. By doing a BLAT search (Slightly different than a BLAST search, optimized for near exact matches), with a Human P450, we will get the region of the rhesus genome with the best match to our query sequence. Then we can link to a view of the genomic DNA sequence showing the location of the protein coding exons over the region of our gene. From this figure and our original protein sequence we can identify the GT and AG intron-exon boundaries. The introns can be edited out, and the assembled gene can be translated to give the rhesus monkey protein sequence.

The step by step procedure is now given:
  • Link to the Human P450s list and copy your protein sequence.
  • Paste your protein sequence in a Word Document for later use. You will be identifying the intron boundaries and phases in this sequence.
  • Link to the UCSC bioinformatics server click on BLAT and then select Rhesus under the genome pull down menu.
  • Paste your P450 sequence in the window and click the submit file button. You will see a display like the one below. If you get an internal error message shorten your sequence and repeat. Notice under the identity column the top hit is 93.6%. That is your gene. The others are different P450 sequences. The BLAT search only returns strong matches. Notice there are no hits less than 61%. The gene is on Scaffold 111671 on the minus strand at the nucleotide location given. The +- signs indicate your query sequence is always treated as Plus Strand and the match is Minus Strand. ++ would mean the hit is in the Plus Strand.

    This example was made with the bottom part of CYP2A13. The whole 2A13 sequence caused an error, so I shortened it. I think the error may have happened because this sequence is at the end of a scaffold and the whole sequence is not on the scaffold. Notice START END and QSIZE. This is the amino acid position of your query that matched. QSIZE is the query length. Here the query was 218 aa long and 2-218 matched.

  • Click on details on the left side. This gives the image of the gene structure. This gene fragment has 4 exons, but only two are shown in the first image. The top of the figure has your query sequence color-coded to show matches, mismatches and gaps (introns).



    The second image shows all four exons.



    The bottom of the details page shows the protein translation of each exon. Conservative changes are in green, other changes are in red.



  • Go back and click on browser. This takes you to the browser page showing the gene from one end to the other (exons 7-9 in this case). Multiple levels of evidence are given on the page including your query sequence at the top. Genbank RefSeq genes and other curated and non-curated predictions and sequence matches, including ESTs. The wide bars in the image are exons, or predicted exons. The connecting line covers the region from the beginning of your query sequence to the end. The arrows on the connecting line show the orientation of the gene. This gene piece has 4 exons and it is on the minus strand.

  • Click on the purple bar. This links you to both the human and rhesus sequences in the region of the browser window. Farther down on this page is an alignment of the two sequences. You can also open the human browser at the equivalent spot to see what is there. This can be very helpful when dealing with gene clusters.

  • Go back and examine the sequence showing the exons in blue and the introns in lower case black. It is time to identify the GT and AG boundaries of the introns. Keep in mind that these figures are drawn with intact codons in Blue. The codons are never split. This means that you can tell the phase if you can identify the GT that is the beginning of the intron. It should be the first GT in the black lower case region. This may be one or two codons away, but usually it is in the first codon in the black. The GT may also be in the Blue region, or the last letter of the Blue region may be G with the T in the black. The rules for determining the intron phase are:

    Phase 0. G and T are the first two letters of the black sequence, 1,2 or or they are 4,5 or 7,8. The intron is Phase 0 and it occurs between codons. Phase 0 codons always code for Val, since all valine codons begin with GT.

    GATCGTTgtagtca

    Phase 1. G and T are at positions 2,3 or 5,6 or 8,9 in the black sequence. The most common phase one codon is GGT = glycine. Phase one boundaries break the codon one nucleotide in.

    GATCAGTggtcgtaa

    Phase 2. G and T may overlap the blue and black sequence -1,1, or 3,4 or 6,7 in the black sequence. Arginine is often a phase 2 codon AGGT Phase two boundaries break the codon two nucleotides in.

    GATCAGGtcgtagtaa

    The other end of the intron will have an AG before the coding sequence begins again.

    Phase 0. AG may be the last two letters of the black intron sequence. They may be multiples of three from that boundary. Frequently CAG = gln is at the phase 0 boundary.

    gagcagAAGTACGAA

    Phase 1. The AG can span the black and blue boundary with A in the black and G in the blue. They can also be multiples of three from that location.

    taagaaGCAGACGAA

    Phase 2. AG can be one letter away from the blue boundary as AGX. The AG can also be the first two letters in the blue boundary.

    gatagtAGAAGCGAA

  • Copy each exon sequence upto but not including the GT of the intron. Mark each intron boundary on the end of an exon with the correct phase in parentheses (0), (1) or (2). The phase at both ends of the intron has to be the same so you only need to mark one end. Start copying the next exon after the AG boundary. Repeat until you have all exons copied and marked for phase
    
    The CYP2A13 sequence (4 exons, introns removed)
    GAGGAGAAGA ACCCCAACAC GGAGTTCTAC TTGAAGAACC TGatgATGAC CACGCTGAAC  14520
    CTCTTCattG CAGGCACCGA GACCGTCAGC ACCACCCTGC GCTATGGCTT CCTGCTGCTC  14460
    ATGAAGtatC CAGAGGTGGA Gg (1)
    CCAAG GTCCATGAGG AGATTGACAG AGTGATCGGC AAGAACCGGC AGCCCAAGTT  13920
    TGAGGACCGG gtcAAGATGC CCTACatgGA GGCAGTGATC CATGAGATCC AAAGATTTGG  13860
    AGACgtgatc CCCATGagcT TGGCCcgcAG GGTCAACAAG GACACCAAGT TTCGGGATTT  13800
    CTTCCTCCCT AAG (0)
    GGCACC  13260
    GAAGTGTTCC CTATGCTGGG CTCCgtgCTG AGAGACCCCA GGTTCTTCTC CAACCCCCAG  13200
    GACttcaatC CCCAGCACTT CTTGGATGAG AAGGGGCAGT TTAAGAAGAG TGACGCTTTT  13140
    GTGCCCTTTT CCATCg (1)
    GAAAG CGGaacTGTT TCGGAGAAGG CCTGGCCAGA  12120
    ATGGAGCTCT TTCTCTTCTT CACCACCATC ATGCAGAACT TCCGCTTCAA GTCCCCCCAG  12060
    ttgCCCAAGG ACATCGACGT GTCCCCCAAA CACGTGGGCT TTGCCACGAT CCCAccaAAC  12000
    TACACCATGA GCTTCCTGCC CCGC 
    
  • take your complete coding sequence with all introns removed to the Javascript DNA translator and translate your sequence. Sometimes there are errors made in cutting out the intron sequences. This can make a frameshift and introduce stop codons into your translation, If you see stop codons then look for the location of the frameshift by blast searching your three frame translations against Human at the human P450 blast server page. The place where your sequence stops matching the human sequence is where the frameshift is. This will probably be at one of the intron boundaries where you made an error in removing the intron.
  • Paste your newly assembled rhesus monkey protein sequence below the human starting sequence in your word file. Mark the location and pahse of the introns and do a blast search of human to get the sequence alignment as shown below. Email me these three items.
    
    >human protein CYP2A13 last 4 exons
    EEEKNPNTEFYLKNLVMTTLNLFFAGTETVSTTLRYGFLLLMKHPEVE
    AKVHEEIDRVIGKNRQPKFEDRAKMPYTEAVIHEIQRFGDMLPMGLAHRVNKDTKFRDFFLPK
    GTEVFPMLGSELRDPRFFSNPQDCSPQHFLDEKGQFKKSDAFVPFSI
    GKRYCFGEGLARMELFLFFTTIMQNFRFKSPQSPKDIDVSPKHVGFATIPRNYTMSFLPR
    
    >Rhesus match
    EEKNPNTEFYLKNLMMTTLNLFIAGTETVSTTLRYGFLLLMKYPEVE (1)
    AKVHEEIDRVIGKNRQPKFEDRVKMPYMEAVIHEIQRFGDVIPMSLARRVNKDTKFRDFFLPK (0)
    GTEVFPMLGSVLRDPRFFSNPQDFNPQHFLDEKGQFKKSDAFVPFSI (1)
    GKRNCFGEGLARMELFLFFTTIMQNFRFKSPQLPKDIDVSPKHVGFATIPPNYTMSFLPR*
    
    Alignment of the two sequences 93% identical
    Query: 2   EEKNPNTEFYLKNLVMTTLNLFFAGTETVSTTLRYGFLLLMKHPEVEAKVHEEIDRVIGK 61
               EEKNPNTEFYLKNL+MTTLNLF AGTETVSTTLRYGFLLLMK+PEVEAKVHEEIDRVIGK
    Sbjct: 1   EEKNPNTEFYLKNLMMTTLNLFIAGTETVSTTLRYGFLLLMKYPEVEAKVHEEIDRVIGK 60
    
    Query: 62  NRQPKFEDRAKMPYTEAVIHEIQRFGDMLPMGLAHRVNKDTKFRDFFLPKGTEVFPMLGS 121
               NRQPKFEDR KMPY EAVIHEIQRFGD++PM LA RVNKDTKFRDFFLPKGTEVFPMLGS
    Sbjct: 61  NRQPKFEDRVKMPYMEAVIHEIQRFGDVIPMSLARRVNKDTKFRDFFLPKGTEVFPMLGS 120
    
    Query: 122 ELRDPRFFSNPQDCSPQHFLDEKGQFKKSDAFVPFSIGKRYCFGEGLARMELFLFFTTIM 181
                LRDPRFFSNPQD +PQHFLDEKGQFKKSDAFVPFSIGKR CFGEGLARMELFLFFTTIM
    Sbjct: 121 VLRDPRFFSNPQDFNPQHFLDEKGQFKKSDAFVPFSIGKRNCFGEGLARMELFLFFTTIM 180
    
    Query: 182 QNFRFKSPQSPKDIDVSPKHVGFATIPRNYTMSFLPR 218
               QNFRFKSPQ PKDIDVSPKHVGFATIP NYTMSFLPR
    Sbjct: 181 QNFRFKSPQLPKDIDVSPKHVGFATIPPNYTMSFLPR 217
    
    
    




    Assignment 2.

    Do the gene assemblies as shown above for two Rhesus P450s. Note: CYP3 and CYP4 genes have 12 and 13 exons. You do not have to do all exons if you have a gene with 10 or more. CYP2 genes have 9 exons Please try to do all 9 of these if they are all present. Not all genes will show complete coverage In the Genome browser. If an exon is missing just make a note saying an exon was missing here. Look at your seq at the top of the details page to see what is missing. There will be black lower case letter between cyan Capital letter that mark the boudaries of gaps. These are usually exons, but they might just be short exons.
    
    N. Abdeltawab              2C19  5A1
    S. Aggarwal                1A2  3A43
    C. Blackwell              26C1  21A2
    A. Bolen                   2F1   7A1
    L. Chen                   2G2P   2U1
    S. Hill                     17  11B1
    K. Iyer                   26A1    19
    S. Jain                   26B1    24
    N. Liao                    2C9  26A1
    X. Liu                    27B1  11B1
    E. Mahrous                11A1   2W1
    D. Mao                     1A1   3A4
    A. Naguib                   46  4A11
    R. Panakanti               1B1   4B1
    Y. Peng                    2A6    51
    H. Penmatsa               26B1   4F3
    M. Puljic                  2B6  4A22
    S. Sarva                   2C8   4V2
    Q. Tran                     39  27A1
    G. Vasser                  2D6   4X1
    F. Zhang                   2E1  27C1
    Z. Zhang                   5A1   2J2
    G. Zhu                     2R1   7B1
    L. Zhu                    11B1   8A1