MSCI814 Module 2.

Superfamily Genomics/Eukaryotic Gene Assembly, Orangutan

David Nelson Jan. 14, 2010 (under revision)

This module deals with assembling genes from genomic DNA sequence. This is a little tricky, so we will begin with a newly sequenced genome that is similar to another well annotated genome. Orangutan, Pongo pygmaeus abelii has been sequenced and assembled with about 6X coverage. The orangutan sequencing was done by the Genome Sequencing Center at Washington University, St. Louis. This genome is aligned against the human genome at the UC Santa Cruz genome browser. P450 Sequences are about 95% identical to human orthologs.

The 2008 module 2 was on duck-billed platypus.
LINKS FOR THIS MODULE

The 2006 module 2 was on Rhesus monkey.
Bioinformatics home page

Module 1 Intro to NCBI, BLAST of maize ESTs

Module 1 results maize P450s found by the class

Orangutan genome project Genome Browser Gateway

NCBI sequence viewer for retreiving a Genbank sequence

Expasy DNA translator A second tool for translating nucleotide sequence to protein

Human P450 Blast server For comparing a sequence against human P450s

Human P450s FASTA format

NCBI TBLASTN Server for BLASTing a protein against Genbank nucleotide sequences

Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences

UCSC bioinformatics server for genome browsers use BLAT search of orangutan
Vista genome browser for comparing genomes Has some precomputed genome comparisons
Before we begin on orangutan, I would like to show you how well you did with last weeks assignment. I was very happy with the results of the class effort. I also made some extensions to your sequences by walking the ESTs (the optional part of assignment 1). The current sequence set is at the link above called Module 1 results.

Starting from no sequences on Thursday afternoon last week, we now have 15 different P450 sequences from maize. 14 of these are full length P450s.

Look at the maize sequence collection. I have given the names of everyone who turned in sequences. P450s have a more conserved C-terminal sequence than other parts of the protein. The middle region from about 130-300 amino acids is poorly conserved. You were asked to find the best hits in the database, so you naturally found the C-terminal parts most of the time. In some cases part of the sequence was missing in the ESTdb. I extended some sequences (in blue) by searching GSS Or nr.

Sometimes the very last part of the sequence was not found. One to 16 amino acids were missed in the blast alignment. This is often the case. If the last few amino acids are not in a conserved motif, then they will differ between sequences, even if many other regions are highly conserved. The way to check for this is to translate the EST DNA sequence and look for the stop codon shown as an *.

The logical next step to continue your work from the first class would be to extend the sequences you found to include the whole protein coding region. This was the optional part of the assignment. Some sequences could not be completed because the EST collection is not comprehensive and some sequences are missing. Only ESTs were being used here, but you could easily search in nr or use genomic sequence from the GSS section of Genbank.

Let us select an example sequence CYP4V (a chicken sequence) starting with the partial sequence shown below. If a BLAST is done using this sequence against EST others, limited to Gallus gallus (chicken) we find
>gi|25890949|gb|BU382948.1|BU382948 UniGene info 603858448F1 CSEQCHN75 Gallus gallus cDNA clone ChEST866o11 5'. Length = 798 Score = 390 bits (1001), Expect = e-109 Identities = 188/188 (100%), Positives = 188/188 (100%) Frame = +2 Query: 1 KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 60 KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK Sbjct: 29 KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 208 Query: 61 VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 120 VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP Sbjct: 209 VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 388 Query: 121 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 180 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ Sbjct: 389 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 568 Query: 181 MEEKTLLA 188 MEEKTLLA Sbjct: 569 MEEKTLLA 592
Note the length of the hit = 798, while the alignment stops at 592 (frame is + strand) so there are 206 bases of sequence beyond the end of the alignment. This would code for 206/3 = 68 and 2/3 codons or amino acids. Since P450s are about 500 amino acids long and about 50 amino acids after the heme signature FSAGPRNCIG This should extend the sequence to the stop codon. To find out we need to get the sequence and translate it. There are two ways to get the sequence. The easiest way is to click on the accession number in the blast output. It is hyperlinked and it will retrieve the sequence.
BU382948 GCGGCTCGAGGAAGAAAGTGGTTCCAAAAAGAGAGAAGCTTTCTTAGACATGCTGCTGAA TGCCACAGATGATGAAGGGAAAAAACTCAGCTACAAGGACATTCGTGAAGAAGTGGATAC TTTTATGTTTGAGGGTCATGATACAACAGCAGCCGCTATGAACTGGGTCCTATACTTGCT TGGTCATCATCCTGAAGCCCAGAAGAAGGTTCACCAAGAACTGGATGAGGTGTTTGGCAA CGCAGAGCGTCCTGTTACAGTGGATGATTTGAAGAAACTTCGATACCTCGAGTGTGTTGT GAAAGAAGCCCTGAGGCTCTTCCCTTCAGTTCCCATGTTCGCCCGTTCCTTGCAAGAGGA TTGCTATATTAGTGGATATAAGCTACCAAAAGGCACGAATGTCCTTGTCTTAACTTATGT GCTGCACAGAGATCCTGAGATCTTCCCTGAGCCAGATGAATTCAGGCCTGAGCGCTTCTT CCCTGAAAATAGCAAAGGAAGGCACCCATATGCTTATGTGCCCTTCTCTGCTGGCCCCAG GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC TGCGGCGCTTTTGGGTGGACTGTTCTCAAAAGCCAGAAGAGCTTGGTCTGTCAGGAGAAC TAATTCTTCGTCCAAATAATGGCATCTGGGGTCAACTGAAGAGGAGACCAAAAACTGTAA CAGAATGACAGGAATACAAGATTCTGATTTTCCAGAAACTTCTAAGCTATTGGACTGGAG ATGTGTTTAAATCAGATG
The other way is to go to the NCBI sequence viewer link above and paste in the accession number in the search window (set for nucleotide in the pull down menu). Either way will get you to the sequence shown above. Copy the sequence and take it to the DNA translator above and paste it in and translate.
>_1 AARGRKWFQKERSFLRHAAECHR**REKTQLQGHS*RSGYFYV*GS*YNSSRYELGPILA WSSS*SPEEGSPRTG*GVWQRRASCYSG*FEETSIPRVCCERSPEALPFSSHVRPFLARG LLY*WI*ATKRHECPCLNLCAAQRS*DLP*AR*IQA*ALLP*K*QRKAPICLCALLCWPQ ELHWPTLCTNGRENSSSPSSCGAFGWTVLKSQKSLVCQEN*FFVQIMASGVN*RGDQKL* QNDRNTRF*FSRNF*AIGLEMCLNQM >_2 RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR NCIGQRFAQMEEKTLLAPHPAALLGGLFSKARRAWSVRRTNSSSK*WHLGSTEEETKNCN RMTGIQDSDFPETSKLLDWRCV*IR >_3 GSRKKVVPKREKLS*TCC*MPQMMKGKNSATRTFVKKWILLCLRVMIQQQPL*TGSYTCL VIILKPRRRFTKNWMRCLATQSVLLQWMI*RNFDTSSVL*KKP*GSSLQFPCSPVPCKRI AILVDISYQKARMSLS*LMCCTEILRSSLSQMNSGLSASSLKIAKEGTHMLMCPSLLAPG TALANALHKWKRKLF*PLILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRRPKTVT E*QEYKILIFQKLLSYWTGDVFKSD
Look at all three frames. Notice that frames 1 and three have many * in them. These represent stop codons. These occur by chance about once every 21 amino acids in non-coding nucleotide sequence. Of course in coding sequence they should not appear except at the end of a coding region. Otherwise they are found in introns, the non-coding regions of genes. Remember that we are looking at ESTs (no introns).

Frame 2 has very few * and they do not break up the sequence. If you look at this frame you will find the heme signature mentioned above that was part of the starting sequence in CYP4V. This unbroken sequence is the P450 CDS (coding sequence) up to the stop codon. The sequence we have now is shown below. Magenta is our starting sequence.
>_2 RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR NCIGQRFAQMEEKTLLAPHPAALLGGLFSKARRAWSVRRTNSSSK*
We now have completed the C-terminal of chicken CYP4V.

Facing up to imperfect sequence data

The sequence we have just extended looked like a good sequence up to the end. However, it is always a good idea to check your work. If you take this sequence to the human P450 blast page above and do the blast, we find that the C-terminal we added does not match the human sequence of CYP4V2. In fact it stops at TLLA. This suggests an error in the sequence that has caused a frame shift, probably one base is missing or one extra base is added in the sequence. To check for frameshifts you need to blast the other two frames in the region after TLLA. One of them will probably match the human sequence. If you blast the last two lines of frame three you will find that frame three is the correct frame after TLLA. You would also see this if you did a blastx search of the EST nucleotide sequence.
>CYP4V2 AC012525 Homo sapiens chromosome 4 Length = 525 Score = 143 (50.3 bits), Expect = 1.8e-12, P = 1.8e-12 Identities = 26/37 (70%), Positives = 32/37 (86%) Query: 19 ILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRR 55 ILR FW++ +QK EELGL G+LILRP+NGIW +LKRR Sbjct: 483 ILRHFWIESNQKREELGLEGQLILRPSNGIWIKLKRR 519
Where is the frameshift?
GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC T L L A L I L TGCGGCGCTTTTGGGTGGACTGTTCTCAAAAGCCAGAAGAGCTTGGTCTGTCAGGAGAAC R R F W
The sequence from above is shown with the TLLA and LILRRFW sequences in translation below it. There is an extra C base in a run of four Cs. This sequence was read as four when there were only three Cs here. Now we can say with more confidence that the correct CYP4V sequence is
RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR NCIGQRFAQMEEKTLLALILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRRPKTVTE*
A similar strategy can be used to extend the sequence upstream to the start codon. You must keep the possibility of frameshifts in your mind while you are doing this. The process is:

Do a blast search and find an accession that overlaps your known sequence and extends it upstream.
Get the sequence and translate it in the correct frames based on the frame of the blast match (+ or -, Forward or Reverse).
Look for the open reading frame (ORF), the one without stop codons in it that extends your known sequence.
Check your work by blasting the new sequence against human for an animal P450, or Arabidopsis for a plant P450.
The Met start codon should be in about the same place as the start codon in the best match.

Going after bigger fish, searching genomic DNA
ESTs are very nice to learn on, but now that you are more familiar with BLAST and frameshifts, you are ready for harder problems. Eukaryotic genomes have introns. Some genes have none (CYP8B1 for example), but that is rare, 5-10 introns is more common. Introns make your life difficult, because you have to find them all and find all the intron-exon boundaries.

Genome sequences are being generated world wide at the rate of more than 1 billion bases of sequence a day and that is probably a low estimate now. Celera Genomics had a 100 million bases per day capacity. The new J. Craig Venter Institute will be about the same and can increase to 4 times that level. The Broad Institute at MIT is near 60 million a day, probably more by now. The Joint Genome Institute (DOE) did 82 million capillary bases/day and 24 million 454 bases/day in 2007. The Wellcome Trust Sanger Institute press release from Dec 6, 2007 said "When fully deployed, the new platforms will boost DNA sequence capacity of the Institute from 110M bases per day to more than 6500M bases - or a complete diploid human genome - per day." Baylor College of Medicine in a Nov. 21, 2006 press release said "The combined sequence output from the centers, using current technologies, is expected to be about 12 billion DNA base pairs per month - the equivalent of four human genomes." That is 400 million bases per day. The Beijing Genomics Institute was mentioned in a Dec 24 news story from GenomeWeb Daily News "...updates the institute's capacity to 250 million base pairs a day with standard equipment, and up to four billion a day using next-generation sequencers." Washington University Genome Sequencing Center is one of the three large scale genome centers funded by NHGRI (including Baylor College of Medicine and the Broad Institute. JGI is funded by DOE. The private GATC Biotech in Germany will boost the company's current sequencing capacity from 130 gigabases to 250 gigabases a year (685 million bases per day). There are more genome centers like Riken and Kazusa in Japan and Genoscope in France (30 million/day) and other sites in the world.

These centers are producing more sequence than they can possibly annotate in any detail. They are relying on automated gene finding programs and automated blast comparisons to annotate these genomes. These programs will have moderate success, but they will not be 100% correct. The gene finding programs only get about 60% accuracy. They fuse adjacent genes together. They skip exons. Some exons are very short, as short as eight nucleotides. See the GHE sequence below from several P450 genes from the white rot fungus. Automated programs miss short exons and they fail to detect bad exon boundaries that are probably sequence errors. In short, they do not have expert knowledge of a single protein family. Celera realized this when they did the Drosophila genome and they held two Gene Jamborees, where expert annotators were brought in to work on assembling genes from individual familes. The Riken mouse cDNA project had a similar meeting with about 50 invited annotators. These type of expert groups are now usually organized online without a physical meeting. I have been involved with a number of these genome annotation groups including Drosophila, the cottonwood tree, the moss genome (Science paper just out Jan 4, 2008), the Tribolium (beetle) genome, the papaya genome, Daphnia (water flea), brown algae, green algae, several fungal genomes, Anopheles and Aedes mosquitos.

Figure showing a very short micro exon GHE from several P450s Phanerochaete chrysosporium (white rot fungus) Scaffold_388a very similar to sequences 77 417 112 129 gene model complete all boundaries checked, 16 exons 3583 MISDTFALAISSGLSLFLCLKAFIDYRAGLRSI (2) 3684 ex1 3732 NHSYLPGFRALISSFGILGLFFKEPKRGLWGGRRRFWLRKHLDFEEAGVDIISH (0) 3893 ex2 3954 IAFLPSVSTYLLLADAAAIK (0) 4013 ex3 4069 EVTGHRARFPKPTYKTLRIFGGNVLASEGEEWKRHRKVVGPAFSE (0) 4203 c-helix ex4 4255 HNNRLVWNETVKIVNDLFANVWGSQSEVYVDNVVQSVTLP (0) 4374 ex5 4423 MALYVISIAGFGKRALWQADGNLPPGHKLSFQ (0) 4521 ex6 4576 DALHILGTDLWIKAATPTLLMNWAPTTRIANVKLAFDEVK (0) 4692 ex7 4747 QYMLELIQERRNSEKRDERYDLFSSLLDANDLNEDGNGNVTLTNDELL (1) 4890 ex9 GNIFIFMLA (1) 4973 ex9 split GHE (0) ex9 split 5087 TTAHTLAFTFGLLALHPDYQETVYQQIKSIVPDNRPP (0) 5197 ex10 MYEEMNSLTECMA (2) ex11 5351 YETLRLFPP (0) 5380 ex12 5436 TATIPKIAAEDTYLVTIDRAGNRVVVPVPCGTALHLNVIALHHN (1) 5564 ex13 5614 PRYWDNPSAFKPERFRGDWPRDAFIPFSTGSRSCIGRR (2) 5730 ex14 5780 FFETESIAILTMILSRYKIELRNDPRFADETYEERWQRVLRVKDGLTPA* 5932 ex15 compare to scaffold 388b not a separate exon GNIFIFLLAGHE(0) 14307 ex9 14247 TTAHTLAFTFGLLALYPEQQDKLYKHIKHVIPDGRIP (0) 14137 ex10 and scaffold 129 GNIFIFMLA (1) ex9 split GHE (0) ex9 split 25150 TTAHTLAFTFGLLALHSDYQEKVHQQIKSIMPDNRLP (0) 25260 ex10 and scaffold 12a not a separate exon (exons have been fused) 54432 VKANMTEDAKSRLSEEEMYAEMR (2) 54552 TILFAGHETTSTTISWVLLE
Expert knowledge reveals GHE to be a real exon, by comparison with other P450s. This happens to be in a motif region that is conserved (AGXETT) and easily recognized, but automated gene finding programs are not going to see this. Please note that the intron exon phases are indicated in (). The number 0, 1 or 2 tells the phase of the junction. Exons that join between codons are phase 0, those that join one base into a codon are phase 1 and those that join two bases in are phase 2. You cannot join exons together in frame unless the phase is preserved at both ends of the intron. In the example above, the sequence GNIFIFMLA ends in a phase 1 boundary, while the sequence TTAHTLAFTFGLLA starts with a phase 0 boundary. The two cannot be joined unless there is an exon in between with a phase 1 start and a phase 0 end. That is the GHE exon. This short eight base pair exon is seen in six different genes in the white rot fungus. These small exons have been called micro exons.

Introns begin with GT and end with AG. Some rare introns begin with GC. These bases, and all the bases between them, are cut out when the full length transcript is processed in the nucleus to make the mature mRNA (messenger RNA). Below is an example of a Drosophila P450 gene Cyp4e2, showing the 5 introns in place. Please look at each intron and find the GT and AG pairs. Also notice the phase of each intron as described above. This figure is the BLAT search output of the UCSC genome browser.
The exons are in blue capital letters. Introns are lower case black letters. The Capitals are all in blocks of three per codon. This allows determining the phase of intron boundaries, since the last capital letter is at the end of a complete codon. If gt follows then the phase is 0. If ngt ( n is any base) follows then the phase is 1. If nngt follows the phase is 2. In this figure intron 1 and intron 5 are phase 1, though intron 5 has the correct boundary off by one codon. Introns 2 and 4 are phase 0. Intron 3 is phase 2. The numbering on the right is nucleotide numbering for Chromosome 2R.

In this module, you will learn how to identify intron-exon boundaries, by comparing two similar genes, one from human, and the other from orangutan. This will be fairly easy since the genes are similar. In real life situations, this is not going to be so clear. We have all 57 P450 genes identified from human, but we are starting with none identified from orangutan. The human P450 sequences are linked below.

We are very fortunate to have the UCSC genome browser, which has an alignment of the human and orangutan genomes as of July 2007. By doing a BLAT search (Slightly different than a BLAST search, optimized for near exact matches), with a human P450, we will get the region of the orangutan genome with the best match to our query sequence. Then we can link to a view of the genomic DNA sequence showing the location of the protein coding exons over the region of our gene. From this figure and our original protein sequence we can identify the GT and AG intron-exon boundaries. The introns can be edited out, and the assembled gene can be translated to give the orangutan protein sequence.

The step by step procedure is now given:

Link to the Human P450s for this assignment and copy your protein sequence.
Paste your protein sequence in a Word Document for later use. You will be identifying the intron boundaries and phases in this sequence.
Link to the UCSC bioinformatics server click on BLAT and then select orangutan under the genome pull down menu.
Paste your P450 sequence in the window and click the submit file button. You will see a display like the one below. This example is human CYP2A13 compared to rhesus monkey. Notice under the identity column the top hit is 93.6%. That is your gene. The others are different P450 sequences. The BLAT search only returns strong matches. Notice there are no hits less than 61%. The gene is on Scaffold 111671 on the minus strand at the nucleotide location given. The +- signs indicate your query sequence is always treated as Plus Strand and the match is Minus Strand. ++ would mean the hit is in the Plus Strand.

This example was made with the bottom part of CYP2A13. The whole 2A13 sequence caused an error, so I shortened it. I think the error may have happened because this sequence is at the end of a scaffold and the whole sequence is not on the scaffold. Notice START END and QSIZE. This is the amino acid position of your query that matched. QSIZE is the query length. Here the query was 218 aa long and 2-218 matched.

Click on details on the left side. This gives the image of the gene structure. This gene fragment has 4 exons, but only two are shown in the first image. The top of the figure has your query sequence color-coded to show matches, mismatches and gaps (introns).

The second image shows all four exons.

The bottom of the details page shows the protein translation of each exon. Conservative changes are in green, other changes are in red. Top line is the query sequence with the quesry amino acid numbering. The bottom line is the translated nucleotide sequence with nucleotide numbering.

Go back and click on browser. This takes you to the browser page showing the gene from one end to the other (exons 6-9 in this case). Multiple levels of evidence are given on the page including your query sequence at the top. Genbank RefSeq genes and other curated and non-curated predictions and sequence matches, including ESTs. The wide bars in the image are exons, or predicted exons. The connecting line covers the region from the beginning of your query sequence to the end. The arrows on the connecting line show the orientation of the gene. This gene piece has 4 exons and it is on the minus strand.

Click on the purple bar (color indicates chromosome, see code at the bottom of the graphic). This links you to both the human and rhesus nucleotide sequences in the region of the browser window. Farther down on this page is an alignment of the two sequences. You can also open the human browser at the equivalent spot to see what is there. This can be very helpful when dealing with gene clusters.

Go back to details and examine the sequence showing the exons in blue and the introns in lower case black. It is time to identify the GT and AG boundaries of the introns. Keep in mind that these figures are drawn with intact codons in Blue. The codons are never split. This means that you can tell the phase if you can identify the GT that is the beginning of the intron. It should be the first GT in the black lower case region. This may be one or two codons away, but usually it is in the first codon in the black. The GT may also be in the Blue region, or the last letter of the Blue region may be G with the T in the black. The rules for determining the intron phase are:

Phase 0. G and T are the first two letters of the black sequence, 1,2 or or they are 4,5 or 7,8. The intron is Phase 0 and it occurs between codons. Phase 0 codons always code for Val, since all valine codons begin with GT.

GATCGTTgtagtca

Phase 1. G and T are at positions 2,3 or 5,6 or 8,9 in the black sequence. The most common phase one codon is GGT = glycine. Phase one boundaries break the codon one nucleotide in.

GATCAGTggtcgtaa

Phase 2. G and T may overlap the blue and black sequence -1,1, or 3,4 or 6,7 in the black sequence. Arginine is often a phase 2 codon AGGT Phase two boundaries break the codon two nucleotides in.

GATCAGGtcgtagtaa

The other end of the intron will have an AG before the coding sequence begins again.

Phase 0. AG may be the last two letters of the black intron sequence. They may be multiples of three from that boundary. Frequently CAG = gln is at the phase 0 boundary.

gagcagAAGTACGAA

Phase 1. The AG can span the black and blue boundary with A in the black and G in the blue. They can also be multiples of three from that location.

taagaaGCAGACGAA

Phase 2. AG can be one letter away from the blue boundary as AGX. The AG can also be the first two letters in the blue boundary.

gatagtAGAAGCGAA

Copy each exon sequence upto but not including the GT of the intron. Mark each intron boundary on the end of an exon with the correct phase in parentheses (0), (1) or (2). The phase at both ends of the intron has to be the same so you only need to mark one end. Start copying the next exon after the AG boundary. Repeat until you have all exons copied and marked for phase
The CYP2A13 sequence (4 exons, introns removed) GAGGAGAAGA ACCCCAACAC GGAGTTCTAC TTGAAGAACC TGatgATGAC CACGCTGAAC 14520 CTCTTCattG CAGGCACCGA GACCGTCAGC ACCACCCTGC GCTATGGCTT CCTGCTGCTC 14460 ATGAAGtatC CAGAGGTGGA Gg (1) CCAAG GTCCATGAGG AGATTGACAG AGTGATCGGC AAGAACCGGC AGCCCAAGTT 13920 TGAGGACCGG gtcAAGATGC CCTACatgGA GGCAGTGATC CATGAGATCC AAAGATTTGG 13860 AGACgtgatc CCCATGagcT TGGCCcgcAG GGTCAACAAG GACACCAAGT TTCGGGATTT 13800 CTTCCTCCCT AAG (0) GGCACC 13260 GAAGTGTTCC CTATGCTGGG CTCCgtgCTG AGAGACCCCA GGTTCTTCTC CAACCCCCAG 13200 GACttcaatC CCCAGCACTT CTTGGATGAG AAGGGGCAGT TTAAGAAGAG TGACGCTTTT 13140 GTGCCCTTTT CCATCg (1) GAAAG CGGaacTGTT TCGGAGAAGG CCTGGCCAGA 12120 ATGGAGCTCT TTCTCTTCTT CACCACCATC ATGCAGAACT TCCGCTTCAA GTCCCCCCAG 12060 ttgCCCAAGG ACATCGACGT GTCCCCCAAA CACGTGGGCT TTGCCACGAT CCCAccaAAC 12000 TACACCATGA GCTTCCTGCC CCGC

take your complete coding sequence with all introns removed to the DNA translator and translate your sequence. Sometimes there are errors made in cutting out the intron sequences. This can make a frameshift and introduce stop codons into your translation, If you see stop codons then look for the location of the frameshift by blast searching your three frame translations against Human at the human P450 blast server page. The place where your sequence stops matching the human sequence is where the frameshift is. This will probably be at one of the intron boundaries where you made an error in removing the intron.
Paste your newly assembled orangutan protein sequence below the human starting sequence in your word file. Mark the location and phase of the introns and do a blast search against human to get the sequence alignment as shown below. Email me these three items.
>human protein CYP2A13 last 4 exons EEEKNPNTEFYLKNLVMTTLNLFFAGTETVSTTLRYGFLLLMKHPEVE AKVHEEIDRVIGKNRQPKFEDRAKMPYTEAVIHEIQRFGDMLPMGLAHRVNKDTKFRDFFLPK GTEVFPMLGSELRDPRFFSNPQDCSPQHFLDEKGQFKKSDAFVPFSI GKRYCFGEGLARMELFLFFTTIMQNFRFKSPQSPKDIDVSPKHVGFATIPRNYTMSFLPR >Rhesus match (yours will be orangutan) EEKNPNTEFYLKNLMMTTLNLFIAGTETVSTTLRYGFLLLMKYPEVE (1) AKVHEEIDRVIGKNRQPKFEDRVKMPYMEAVIHEIQRFGDVIPMSLARRVNKDTKFRDFFLPK (0) GTEVFPMLGSVLRDPRFFSNPQDFNPQHFLDEKGQFKKSDAFVPFSI (1) GKRNCFGEGLARMELFLFFTTIMQNFRFKSPQLPKDIDVSPKHVGFATIPPNYTMSFLPR* Alignment of the two sequences 93% identical Query: 2 EEKNPNTEFYLKNLVMTTLNLFFAGTETVSTTLRYGFLLLMKHPEVEAKVHEEIDRVIGK 61 EEKNPNTEFYLKNL+MTTLNLF AGTETVSTTLRYGFLLLMK+PEVEAKVHEEIDRVIGK Sbjct: 1 EEKNPNTEFYLKNLMMTTLNLFIAGTETVSTTLRYGFLLLMKYPEVEAKVHEEIDRVIGK 60 Query: 62 NRQPKFEDRAKMPYTEAVIHEIQRFGDMLPMGLAHRVNKDTKFRDFFLPKGTEVFPMLGS 121 NRQPKFEDR KMPY EAVIHEIQRFGD++PM LA RVNKDTKFRDFFLPKGTEVFPMLGS Sbjct: 61 NRQPKFEDRVKMPYMEAVIHEIQRFGDVIPMSLARRVNKDTKFRDFFLPKGTEVFPMLGS 120 Query: 122 ELRDPRFFSNPQDCSPQHFLDEKGQFKKSDAFVPFSIGKRYCFGEGLARMELFLFFTTIM 181 LRDPRFFSNPQD +PQHFLDEKGQFKKSDAFVPFSIGKR CFGEGLARMELFLFFTTIM Sbjct: 121 VLRDPRFFSNPQDFNPQHFLDEKGQFKKSDAFVPFSIGKRNCFGEGLARMELFLFFTTIM 180 Query: 182 QNFRFKSPQSPKDIDVSPKHVGFATIPRNYTMSFLPR 218 QNFRFKSPQ PKDIDVSPKHVGFATIP NYTMSFLPR Sbjct: 181 QNFRFKSPQLPKDIDVSPKHVGFATIPPNYTMSFLPR 217
The genes in orangutan will be very close to human so finding exons should be quite easy to do.

Assignment 2.

Do the gene assembly as shown above for an orangutan P450. Not all genes will show complete coverage in the Genome browser. There may be gaps in the sequence. If an exon is missing just make a note saying an exon was missing here. The BLAT search at the UCSC browser is geared for near exact matches. Short or weak matches do not show up. You may be able to find missing exons by blastx searching the genomic DNA against human P450s in the P450 blast server. Look at your seq at the top of the details page to see what is missing. Any long region of black text was not found by the BLAT program. In the list below you have two P450s given, You only have to do one of these for the assignment. The second is extra if you feel you want to do more than one.
J. Barnes 2S1 26B1 R. Bauer 1A1 11A1 A. Fathi Ahmed 8A1 27A1 H. Ghoneim 51A1 2W1 R. Horton 17A1 27C1 C. Hovinga 19A1 2R1 N. Khan 26C1 27A1 C. Liu 26B1 2C8 A. Lasiter 26A1 7B1 D. Mohamed 7A1 2B6 S. Whaley 2C8 27B1 Y. Zhao 2U1 7B1
Human P450s for this assignment