MSCI814 Module 2.

Superfamily Genomics/Eukaryotic Gene Assembly, Orangutan

David Nelson Jan. 14, 2010 (under revision)

This module deals with assembling genes from genomic DNA sequence. This is a little tricky, so we will begin with a newly sequenced genome that is similar to another well annotated genome. Orangutan, Pongo pygmaeus abelii has been sequenced and assembled with about 6X coverage. The orangutan sequencing was done by the Genome Sequencing Center at Washington University, St. Louis. This genome is aligned against the human genome at the UC Santa Cruz genome browser. P450 Sequences are about 95% identical to human orthologs.

The 2008 module 2 was on duck-billed platypus.

LINKS FOR THIS MODULE

The 2006 module 2 was on Rhesus monkey.

Bioinformatics home page

Module 1 Intro to NCBI, BLAST of maize ESTs

Module 1 results maize P450s found by the class

Orangutan genome project Genome Browser Gateway

NCBI sequence viewer for retreiving a Genbank sequence

Expasy DNA translator A second tool for translating nucleotide sequence to protein

Human P450 Blast server For comparing a sequence against human P450s

Human P450s FASTA format

NCBI TBLASTN Server for BLASTing a protein against Genbank nucleotide sequences

Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences

UCSC bioinformatics server for genome browsers use BLAT search of orangutan

Vista genome browser for comparing genomes Has some precomputed genome comparisons

Before we begin on orangutan, I would like to show you how well you did with last weeks assignment. I was very happy with the results of the class effort. I also made some extensions to your sequences by walking the ESTs (the optional part of assignment 1). The current sequence set is at the link above called Module 1 results.

Starting from no sequences on Thursday afternoon last week, we now have 15 different P450 sequences from maize. 14 of these are full length P450s.

Look at the maize sequence collection. I have given the names of everyone who turned in sequences. P450s have a more conserved C-terminal sequence than other parts of the protein. The middle region from about 130-300 amino acids is poorly conserved. You were asked to find the best hits in the database, so you naturally found the C-terminal parts most of the time. In some cases part of the sequence was missing in the ESTdb. I extended some sequences (in blue) by searching GSS Or nr.

Sometimes the very last part of the sequence was not found. One to 16 amino acids were missed in the blast alignment. This is often the case. If the last few amino acids are not in a conserved motif, then they will differ between sequences, even if many other regions are highly conserved. The way to check for this is to translate the EST DNA sequence and look for the stop codon shown as an *.

The logical next step to continue your work from the first class would be to extend the sequences you found to include the whole protein coding region. This was the optional part of the assignment. Some sequences could not be completed because the EST collection is not comprehensive and some sequences are missing. Only ESTs were being used here, but you could easily search in nr or use genomic sequence from the GSS section of Genbank.

Let us select an example sequence CYP4V (a chicken sequence) starting with the partial sequence shown below. If a BLAST is done using this sequence against EST others, limited to Gallus gallus (chicken) we find
>gi|25890949|gb|BU382948.1|BU382948  UniGene info 603858448F1 CSEQCHN75 Gallus gallus cDNA clone ChEST866o11 5'.
          Length = 798

 Score =  390 bits (1001), Expect = e-109
 Identities = 188/188 (100%), Positives = 188/188 (100%)
 Frame = +2

Query: 1   KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 60
           KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK
Sbjct: 29  KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 208

Query: 61  VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 120
           VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP
Sbjct: 209 VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 388

Query: 121 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 180
           KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ
Sbjct: 389 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 568

Query: 181 MEEKTLLA 188
           MEEKTLLA
Sbjct: 569 MEEKTLLA 592

Note the length of the hit = 798, while the alignment stops at 592 (frame is + strand) so there are 206 bases of sequence beyond the end of the alignment. This would code for 206/3 = 68 and 2/3 codons or amino acids. Since P450s are about 500 amino acids long and about 50 amino acids after the heme signature FSAGPRNCIG This should extend the sequence to the stop codon. To find out we need to get the sequence and translate it. There are two ways to get the sequence. The easiest way is to click on the accession number in the blast output. It is hyperlinked and it will retrieve the sequence.
BU382948
GCGGCTCGAGGAAGAAAGTGGTTCCAAAAAGAGAGAAGCTTTCTTAGACATGCTGCTGAA
TGCCACAGATGATGAAGGGAAAAAACTCAGCTACAAGGACATTCGTGAAGAAGTGGATAC
TTTTATGTTTGAGGGTCATGATACAACAGCAGCCGCTATGAACTGGGTCCTATACTTGCT
TGGTCATCATCCTGAAGCCCAGAAGAAGGTTCACCAAGAACTGGATGAGGTGTTTGGCAA
CGCAGAGCGTCCTGTTACAGTGGATGATTTGAAGAAACTTCGATACCTCGAGTGTGTTGT
GAAAGAAGCCCTGAGGCTCTTCCCTTCAGTTCCCATGTTCGCCCGTTCCTTGCAAGAGGA
TTGCTATATTAGTGGATATAAGCTACCAAAAGGCACGAATGTCCTTGTCTTAACTTATGT
GCTGCACAGAGATCCTGAGATCTTCCCTGAGCCAGATGAATTCAGGCCTGAGCGCTTCTT
CCCTGAAAATAGCAAAGGAAGGCACCCATATGCTTATGTGCCCTTCTCTGCTGGCCCCAG
GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC
TGCGGCGCTTTTGGGTGGACTGTTCTCAAAAGCCAGAAGAGCTTGGTCTGTCAGGAGAAC
TAATTCTTCGTCCAAATAATGGCATCTGGGGTCAACTGAAGAGGAGACCAAAAACTGTAA
CAGAATGACAGGAATACAAGATTCTGATTTTCCAGAAACTTCTAAGCTATTGGACTGGAG
ATGTGTTTAAATCAGATG
The other way is to go to the NCBI sequence viewer link above and paste in the accession number in the search window (set for nucleotide in the pull down menu). Either way will get you to the sequence shown above. Copy the sequence and take it to the DNA translator above and paste it in and translate.
>_1
AARGRKWFQKERSFLRHAAECHR**REKTQLQGHS*RSGYFYV*GS*YNSSRYELGPILA
WSSS*SPEEGSPRTG*GVWQRRASCYSG*FEETSIPRVCCERSPEALPFSSHVRPFLARG
LLY*WI*ATKRHECPCLNLCAAQRS*DLP*AR*IQA*ALLP*K*QRKAPICLCALLCWPQ
ELHWPTLCTNGRENSSSPSSCGAFGWTVLKSQKSLVCQEN*FFVQIMASGVN*RGDQKL*
QNDRNTRF*FSRNF*AIGLEMCLNQM
>_2
RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL
GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED
CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR
NCIGQRFAQMEEKTLLAPHPAALLGGLFSKARRAWSVRRTNSSSK*WHLGSTEEETKNCN
RMTGIQDSDFPETSKLLDWRCV*IR
>_3
GSRKKVVPKREKLS*TCC*MPQMMKGKNSATRTFVKKWILLCLRVMIQQQPL*TGSYTCL
VIILKPRRRFTKNWMRCLATQSVLLQWMI*RNFDTSSVL*KKP*GSSLQFPCSPVPCKRI
AILVDISYQKARMSLS*LMCCTEILRSSLSQMNSGLSASSLKIAKEGTHMLMCPSLLAPG
TALANALHKWKRKLF*PLILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRRPKTVT
E*QEYKILIFQKLLSYWTGDVFKSD
Look at all three frames. Notice that frames 1 and three have many * in them. These represent stop codons. These occur by chance about once every 21 amino acids in non-coding nucleotide sequence. Of course in coding sequence they should not appear except at the end of a coding region. Otherwise they are found in introns, the non-coding regions of genes. Remember that we are looking at ESTs (no introns).

Frame 2 has very few * and they do not break up the sequence. If you look at this frame you will find the heme signature mentioned above that was part of the starting sequence in CYP4V. This unbroken sequence is the P450 CDS (coding sequence) up to the stop codon. The sequence we have now is shown below. Magenta is our starting sequence.
>_2
RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL
GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED
CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR
NCIGQRFAQMEEKTLLAPHPAALLGGLFSKARRAWSVRRTNSSSK*
We now have completed the C-terminal of chicken CYP4V.

Facing up to imperfect sequence data


The sequence we have just extended looked like a good sequence up to the end. However, it is always a good idea to check your work. If you take this sequence to the human P450 blast page above and do the blast, we find that the C-terminal we added does not match the human sequence of CYP4V2. In fact it stops at TLLA. This suggests an error in the sequence that has caused a frame shift, probably one base is missing or one extra base is added in the sequence. To check for frameshifts you need to blast the other two frames in the region after TLLA. One of them will probably match the human sequence. If you blast the last two lines of frame three you will find that frame three is the correct frame after TLLA. You would also see this if you did a blastx search of the EST nucleotide sequence.

>CYP4V2 AC012525 Homo sapiens chromosome 4
        Length = 525

 Score = 143 (50.3 bits), Expect = 1.8e-12, P = 1.8e-12
 Identities = 26/37 (70%), Positives = 32/37 (86%)

Query:    19 ILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRR 55
             ILR FW++ +QK EELGL G+LILRP+NGIW +LKRR
Sbjct:   483 ILRHFWIESNQKREELGLEGQLILRPSNGIWIKLKRR 519

Where is the frameshift?

GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC
                                        T  L  L  A     L  I  L
TGCGGCGCTTTTGGGTGGACTGTTCTCAAAAGCCAGAAGAGCTTGGTCTGTCAGGAGAAC
  R  R  F  W
The sequence from above is shown with the TLLA and LILRRFW sequences in translation below it. There is an extra C base in a run of four Cs. This sequence was read as four when there were only three Cs here. Now we can say with more confidence that the correct CYP4V sequence is

RLEEESGSKKREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLL
GHHPEAQKKVHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQED
CYISGYKLPKGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPR
NCIGQRFAQMEEKTLLALILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRRPKTVTE*
A similar strategy can be used to extend the sequence upstream to the start codon. You must keep the possibility of frameshifts in your mind while you are doing this. The process is:
  • Do a blast search and find an accession that overlaps your known sequence and extends it upstream.
  • Get the sequence and translate it in the correct frames based on the frame of the blast match (+ or -, Forward or Reverse).
  • Look for the open reading frame (ORF), the one without stop codons in it that extends your known sequence.
  • Check your work by blasting the new sequence against human for an animal P450, or Arabidopsis for a plant P450.
  • The Met start codon should be in about the same place as the start codon in the best match.

Going after bigger fish, searching genomic DNA

ESTs are very nice to learn on, but now that you are more familiar with BLAST and frameshifts, you are ready for harder problems. Eukaryotic genomes have introns. Some genes have none (CYP8B1 for example), but that is rare, 5-10 introns is more common. Introns make your life difficult, because you have to find them all and find all the intron-exon boundaries.

Genome sequences are being generated world wide at the rate of more than 1 billion bases of sequence a day and that is probably a low estimate now. Celera Genomics had a 100 million bases per day capacity. The new J. Craig Venter Institute will be about the same and can increase to 4 times that level. The Broad Institute at MIT is near 60 million a day, probably more by now. The Joint Genome Institute (DOE) did 82 million capillary bases/day and 24 million 454 bases/day in 2007. The Wellcome Trust Sanger Institute press release from Dec 6, 2007 said "When fully deployed, the new platforms will boost DNA sequence capacity of the Institute from 110M bases per day to more than 6500M bases - or a complete diploid human genome - per day." Baylor College of Medicine in a Nov. 21, 2006 press release said "The combined sequence output from the centers, using current technologies, is expected to be about 12 billion DNA base pairs per month - the equivalent of four human genomes." That is 400 million bases per day. The Beijing Genomics Institute was mentioned in a Dec 24 news story from GenomeWeb Daily News "...updates the institute's capacity to 250 million base pairs a day with standard equipment, and up to four billion a day using next-generation sequencers." Washington University Genome Sequencing Center is one of the three large scale genome centers funded by NHGRI (including Baylor College of Medicine and the Broad Institute. JGI is funded by DOE. The private GATC Biotech in Germany will boost the company's current sequencing capacity from 130 gigabases to 250 gigabases a year (685 million bases per day). There are more genome centers like Riken and Kazusa in Japan and Genoscope in France (30 million/day) and other sites in the world.

These centers are producing more sequence than they can possibly annotate in any detail. They are relying on automated gene finding programs and automated blast comparisons to annotate these genomes. These programs will have moderate success, but they will not be 100% correct. The gene finding programs only get about 60% accuracy. They fuse adjacent genes together. They skip exons. Some exons are very short, as short as eight nucleotides. See the GHE sequence below from several P450 genes from the white rot fungus. Automated programs miss short exons and they fail to detect bad exon boundaries that are probably sequence errors. In short, they do not have expert knowledge of a single protein family. Celera realized this when they did the Drosophila genome and they held two Gene Jamborees, where expert annotators were brought in to work on assembling genes from individual familes. The Riken mouse cDNA project had a similar meeting with about 50 invited annotators. These type of expert groups are now usually organized online without a physical meeting. I have been involved with a number of these genome annotation groups including Drosophila, the cottonwood tree, the moss genome (Science paper just out Jan 4, 2008), the Tribolium (beetle) genome, the papaya genome, Daphnia (water flea), brown algae, green algae, several fungal genomes, Anopheles and Aedes mosquitos.



Figure showing a very short micro exon GHE from several P450s

Phanerochaete chrysosporium (white rot fungus) Scaffold_388a very similar 
to sequences 77 417 112 129
gene model complete all boundaries checked, 16 exons
3583 MISDTFALAISSGLSLFLCLKAFIDYRAGLRSI (2) 3684 ex1
3732 NHSYLPGFRALISSFGILGLFFKEPKRGLWGGRRRFWLRKHLDFEEAGVDIISH (0) 3893 ex2
3954 IAFLPSVSTYLLLADAAAIK (0) 4013 ex3
4069 EVTGHRARFPKPTYKTLRIFGGNVLASEGEEWKRHRKVVGPAFSE (0) 4203 c-helix ex4
4255 HNNRLVWNETVKIVNDLFANVWGSQSEVYVDNVVQSVTLP (0) 4374 ex5
4423 MALYVISIAGFGKRALWQADGNLPPGHKLSFQ (0) 4521 ex6
4576 DALHILGTDLWIKAATPTLLMNWAPTTRIANVKLAFDEVK (0) 4692 ex7
4747 QYMLELIQERRNSEKRDERYDLFSSLLDANDLNEDGNGNVTLTNDELL (1) 4890 ex9
     GNIFIFMLA (1) 4973 ex9 split
     GHE (0) ex9 split
5087 TTAHTLAFTFGLLALHPDYQETVYQQIKSIVPDNRPP (0) 5197 ex10
     MYEEMNSLTECMA (2) ex11
5351 YETLRLFPP (0) 5380 ex12
5436 TATIPKIAAEDTYLVTIDRAGNRVVVPVPCGTALHLNVIALHHN (1) 5564 ex13
5614 PRYWDNPSAFKPERFRGDWPRDAFIPFSTGSRSCIGRR (2) 5730 ex14
5780 FFETESIAILTMILSRYKIELRNDPRFADETYEERWQRVLRVKDGLTPA* 5932 ex15

compare to scaffold 388b not a separate exon

      GNIFIFLLAGHE(0) 14307 ex9
14247 TTAHTLAFTFGLLALYPEQQDKLYKHIKHVIPDGRIP (0) 14137 ex10

and scaffold 129

      GNIFIFMLA (1) ex9 split
      GHE (0) ex9 split
25150 TTAHTLAFTFGLLALHSDYQEKVHQQIKSIMPDNRLP (0) 25260 ex10

and scaffold 12a not a separate exon (exons have been fused)

54432 VKANMTEDAKSRLSEEEMYAEMR (2) 
54552 TILFAGHETTSTTISWVLLE



Expert knowledge reveals GHE to be a real exon, by comparison with other P450s. This happens to be in a motif region that is conserved (AGXETT) and easily recognized, but automated gene finding programs are not going to see this. Please note that the intron exon phases are indicated in (). The number 0, 1 or 2 tells the phase of the junction. Exons that join between codons are phase 0, those that join one base into a codon are phase 1 and those that join two bases in are phase 2. You cannot join exons together in frame unless the phase is preserved at both ends of the intron. In the example above, the sequence GNIFIFMLA ends in a phase 1 boundary, while the sequence TTAHTLAFTFGLLA starts with a phase 0 boundary. The two cannot be joined unless there is an exon in between with a phase 1 start and a phase 0 end. That is the GHE exon. This short eight base pair exon is seen in six different genes in the white rot fungus. These small exons have been called micro exons.

Introns begin with GT and end with AG. Some rare introns begin with GC. These bases, and all the bases between them, are cut out when the full length transcript is processed in the nucleus to make the mature mRNA (messenger RNA). Below is an example of a Drosophila P450 gene Cyp4e2, showing the 5 introns in place. Please look at each intron and find the GT and AG pairs. Also notice the phase of each intron as described above. This figure is the BLAT search output of the UCSC genome browser.
The exons are in blue capital letters. Introns are lower case black letters. The Capitals are all in blocks of three per codon. This allows determining the phase of intron boundaries, since the last capital letter is at the end of a complete codon. If gt follows then the phase is 0. If ngt ( n is any base) follows then the phase is 1. If nngt follows the phase is 2. In this figure intron 1 and intron 5 are phase 1, though intron 5 has the correct boundary off by one codon. Introns 2 and 4 are phase 0. Intron 3 is phase 2. The numbering on the right is nucleotide numbering for Chromosome 2R.

In this module, you will learn how to identify intron-exon boundaries, by comparing two similar genes, one from human, and the other from orangutan. This will be fairly easy since the genes are similar. In real life situations, this is not going to be so clear. We have all 57 P450 genes identified from human, but we are starting with none identified from orangutan. The human P450 sequences are linked below.

We are very fortunate to have the UCSC genome browser, which has an alignment of the human and orangutan genomes as of July 2007. By doing a BLAT search (Slightly different than a BLAST search, optimized for near exact matches), with a human P450, we will get the region of the orangutan genome with the best match to our query sequence. Then we can link to a view of the genomic DNA sequence showing the location of the protein coding exons over the region of our gene. From this figure and our original protein sequence we can identify the GT and AG intron-exon boundaries. The introns can be edited out, and the assembled gene can be translated to give the orangutan protein sequence.

The step by step procedure is now given:
  • Link to the Human P450s for this assignment and copy your protein sequence.
  • Paste your protein sequence in a Word Document for later use. You will be identifying the intron boundaries and phases in this sequence.
  • Link to the UCSC bioinformatics server click on BLAT and then select orangutan under the genome pull down menu.
  • Paste your P450 sequence in the window and click the submit file button. You will see a display like the one below. This example is human CYP2A13 compared to rhesus monkey. Notice under the identity column the top hit is 93.6%. That is your gene. The others are different P450 sequences. The BLAT search only returns strong matches. Notice there are no hits less than 61%. The gene is on Scaffold 111671 on the minus strand at the nucleotide location given. The +- signs indicate your query sequence is always treated as Plus Strand and the match is Minus Strand. ++ would mean the hit is in the Plus Strand.

    This example was made with the bottom part of CYP2A13. The whole 2A13 sequence caused an error, so I shortened it. I think the error may have happened because this sequence is at the end of a scaffold and the whole sequence is not on the scaffold. Notice START END and QSIZE. This is the amino acid position of your query that matched. QSIZE is the query length. Here the query was 218 aa long and 2-218 matched.

  • Click on details on the left side. This gives the image of the gene structure. This gene fragment has 4 exons, but only two are shown in the first image. The top of the figure has your query sequence color-coded to show matches, mismatches and gaps (introns).



    The second image shows all four exons.



    The bottom of the details page shows the protein translation of each exon. Conservative changes are in green, other changes are in red. Top line is the query sequence with the quesry amino acid numbering. The bottom line is the translated nucleotide sequence with nucleotide numbering.



  • Go back and click on browser. This takes you to the browser page showing the gene from one end to the other (exons 6-9 in this case). Multiple levels of evidence are given on the page including your query sequence at the top. Genbank RefSeq genes and other curated and non-curated predictions and sequence matches, including ESTs. The wide bars in the image are exons, or predicted exons. The connecting line covers the region from the beginning of your query sequence to the end. The arrows on the connecting line show the orientation of the gene. This gene piece has 4 exons and it is on the minus strand.

  • Click on the purple bar (color indicates chromosome, see code at the bottom of the graphic). This links you to both the human and rhesus nucleotide sequences in the region of the browser window. Farther down on this page is an alignment of the two sequences. You can also open the human browser at the equivalent spot to see what is there. This can be very helpful when dealing with gene clusters.

  • Go back to details and examine the sequence showing the exons in blue and the introns in lower case black. It is time to identify the GT and AG boundaries of the introns. Keep in mind that these figures are drawn with intact codons in Blue. The codons are never split. This means that you can tell the phase if you can identify the GT that is the beginning of the intron. It should be the first GT in the black lower case region. This may be one or two codons away, but usually it is in the first codon in the black. The GT may also be in the Blue region, or the last letter of the Blue region may be G with the T in the black. The rules for determining the intron phase are:

    Phase 0. G and T are the first two letters of the black sequence, 1,2 or or they are 4,5 or 7,8. The intron is Phase 0 and it occurs between codons. Phase 0 codons always code for Val, since all valine codons begin with GT.

    GATCGTTgtagtca

    Phase 1. G and T are at positions 2,3 or 5,6 or 8,9 in the black sequence. The most common phase one codon is GGT = glycine. Phase one boundaries break the codon one nucleotide in.

    GATCAGTggtcgtaa

    Phase 2. G and T may overlap the blue and black sequence -1,1, or 3,4 or 6,7 in the black sequence. Arginine is often a phase 2 codon AGGT Phase two boundaries break the codon two nucleotides in.

    GATCAGGtcgtagtaa

    The other end of the intron will have an AG before the coding sequence begins again.

    Phase 0. AG may be the last two letters of the black intron sequence. They may be multiples of three from that boundary. Frequently CAG = gln is at the phase 0 boundary.

    gagcagAAGTACGAA

    Phase 1. The AG can span the black and blue boundary with A in the black and G in the blue. They can also be multiples of three from that location.

    taagaaGCAGACGAA

    Phase 2. AG can be one letter away from the blue boundary as AGX. The AG can also be the first two letters in the blue boundary.

    gatagtAGAAGCGAA

  • Copy each exon sequence upto but not including the GT of the intron. Mark each intron boundary on the end of an exon with the correct phase in parentheses (0), (1) or (2). The phase at both ends of the intron has to be the same so you only need to mark one end. Start copying the next exon after the AG boundary. Repeat until you have all exons copied and marked for phase
    
    The CYP2A13 sequence (4 exons, introns removed)
    GAGGAGAAGA ACCCCAACAC GGAGTTCTAC TTGAAGAACC TGatgATGAC CACGCTGAAC  14520
    CTCTTCattG CAGGCACCGA GACCGTCAGC ACCACCCTGC GCTATGGCTT CCTGCTGCTC  14460
    ATGAAGtatC CAGAGGTGGA Gg (1)
    CCAAG GTCCATGAGG AGATTGACAG AGTGATCGGC AAGAACCGGC AGCCCAAGTT  13920
    TGAGGACCGG gtcAAGATGC CCTACatgGA GGCAGTGATC CATGAGATCC AAAGATTTGG  13860
    AGACgtgatc CCCATGagcT TGGCCcgcAG GGTCAACAAG GACACCAAGT TTCGGGATTT  13800
    CTTCCTCCCT AAG (0)
    GGCACC  13260
    GAAGTGTTCC CTATGCTGGG CTCCgtgCTG AGAGACCCCA GGTTCTTCTC CAACCCCCAG  13200
    GACttcaatC CCCAGCACTT CTTGGATGAG AAGGGGCAGT TTAAGAAGAG TGACGCTTTT  13140
    GTGCCCTTTT CCATCg (1)
    GAAAG CGGaacTGTT TCGGAGAAGG CCTGGCCAGA  12120
    ATGGAGCTCT TTCTCTTCTT CACCACCATC ATGCAGAACT TCCGCTTCAA GTCCCCCCAG  12060
    ttgCCCAAGG ACATCGACGT GTCCCCCAAA CACGTGGGCT TTGCCACGAT CCCAccaAAC  12000
    TACACCATGA GCTTCCTGCC CCGC 
    
  • take your complete coding sequence with all introns removed to the DNA translator and translate your sequence. Sometimes there are errors made in cutting out the intron sequences. This can make a frameshift and introduce stop codons into your translation, If you see stop codons then look for the location of the frameshift by blast searching your three frame translations against Human at the human P450 blast server page. The place where your sequence stops matching the human sequence is where the frameshift is. This will probably be at one of the intron boundaries where you made an error in removing the intron.
  • Paste your newly assembled orangutan protein sequence below the human starting sequence in your word file. Mark the location and phase of the introns and do a blast search against human to get the sequence alignment as shown below. Email me these three items.
    
    >human protein CYP2A13 last 4 exons
    EEEKNPNTEFYLKNLVMTTLNLFFAGTETVSTTLRYGFLLLMKHPEVE
    AKVHEEIDRVIGKNRQPKFEDRAKMPYTEAVIHEIQRFGDMLPMGLAHRVNKDTKFRDFFLPK
    GTEVFPMLGSELRDPRFFSNPQDCSPQHFLDEKGQFKKSDAFVPFSI
    GKRYCFGEGLARMELFLFFTTIMQNFRFKSPQSPKDIDVSPKHVGFATIPRNYTMSFLPR
    
    >Rhesus match (yours will be orangutan)
    EEKNPNTEFYLKNLMMTTLNLFIAGTETVSTTLRYGFLLLMKYPEVE (1)
    AKVHEEIDRVIGKNRQPKFEDRVKMPYMEAVIHEIQRFGDVIPMSLARRVNKDTKFRDFFLPK (0)
    GTEVFPMLGSVLRDPRFFSNPQDFNPQHFLDEKGQFKKSDAFVPFSI (1)
    GKRNCFGEGLARMELFLFFTTIMQNFRFKSPQLPKDIDVSPKHVGFATIPPNYTMSFLPR*
    
    Alignment of the two sequences 93% identical
    Query: 2   EEKNPNTEFYLKNLVMTTLNLFFAGTETVSTTLRYGFLLLMKHPEVEAKVHEEIDRVIGK 61
               EEKNPNTEFYLKNL+MTTLNLF AGTETVSTTLRYGFLLLMK+PEVEAKVHEEIDRVIGK
    Sbjct: 1   EEKNPNTEFYLKNLMMTTLNLFIAGTETVSTTLRYGFLLLMKYPEVEAKVHEEIDRVIGK 60
    
    Query: 62  NRQPKFEDRAKMPYTEAVIHEIQRFGDMLPMGLAHRVNKDTKFRDFFLPKGTEVFPMLGS 121
               NRQPKFEDR KMPY EAVIHEIQRFGD++PM LA RVNKDTKFRDFFLPKGTEVFPMLGS
    Sbjct: 61  NRQPKFEDRVKMPYMEAVIHEIQRFGDVIPMSLARRVNKDTKFRDFFLPKGTEVFPMLGS 120
    
    Query: 122 ELRDPRFFSNPQDCSPQHFLDEKGQFKKSDAFVPFSIGKRYCFGEGLARMELFLFFTTIM 181
                LRDPRFFSNPQD +PQHFLDEKGQFKKSDAFVPFSIGKR CFGEGLARMELFLFFTTIM
    Sbjct: 121 VLRDPRFFSNPQDFNPQHFLDEKGQFKKSDAFVPFSIGKRNCFGEGLARMELFLFFTTIM 180
    
    Query: 182 QNFRFKSPQSPKDIDVSPKHVGFATIPRNYTMSFLPR 218
               QNFRFKSPQ PKDIDVSPKHVGFATIP NYTMSFLPR
    Sbjct: 181 QNFRFKSPQLPKDIDVSPKHVGFATIPPNYTMSFLPR 217
    
    
    
    The genes in orangutan will be very close to human so finding exons should be quite easy to do.



    Assignment 2.

    Do the gene assembly as shown above for an orangutan P450. Not all genes will show complete coverage in the Genome browser. There may be gaps in the sequence. If an exon is missing just make a note saying an exon was missing here. The BLAT search at the UCSC browser is geared for near exact matches. Short or weak matches do not show up. You may be able to find missing exons by blastx searching the genomic DNA against human P450s in the P450 blast server. Look at your seq at the top of the details page to see what is missing. Any long region of black text was not found by the BLAT program. In the list below you have two P450s given, You only have to do one of these for the assignment. The second is extra if you feel you want to do more than one.
    
    J. Barnes                2S1   26B1
    R. Bauer                 1A1   11A1
    A. Fathi Ahmed           8A1   27A1
    H. Ghoneim               51A1  2W1
    R. Horton                17A1  27C1
    C. Hovinga               19A1  2R1
    N. Khan                  26C1  27A1
    C. Liu                   26B1  2C8
    A. Lasiter               26A1  7B1
    D. Mohamed               7A1   2B6
    S. Whaley                2C8   27B1
    Y. Zhao                  2U1   7B1
    
    
    
    
    

    Human P450s for this assignment