MSCI814 Module 2.
Superfamily Genomics/Eukaryotic Gene Assembly, Rhesus Monkey
David Nelson Jan. 19, 2006
This module deals with assembling genes from genomic DNA sequence. This is a little tricky, so we will begin with a newly sequenced genome that is very similar to another well annotated genome. Rhesus monkey, Macaca mulatta has been sequenced and assembled. The Macaque Genome Sequencing Consortium is led by the Baylor College of Medicine Human Genome Sequencing Center, and in collaboration with the J. Craig Venter Institute Joint Technology Center, and the Genome Sequencing Center at Washington University, St. Louis. This genome is aligned against the human genome at the UC Santa Cruz genome browser.
LINKS FOR THIS MODULE
Bioinformatics home page (Under construction)
Module 1 Intro to NCBI, BLAST of Xenopus tropicalis ESTs
Module 1 results Xenopus P450s found by the class
Rhesus monkey genome project Baylor Project page
NCBI sequence viewer for retreiving a Genbank sequence
Human P450 Blast server For comparing a sequence against human P450s
Human P450s FASTA format
NCBI TBLASTN Server for BLASTing a protein against Genbank nucleotide sequences
Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences
UCSC bioinformatics server for genome browsers use BLAT search of Rhesus
Vista genome browser for comparing genomes Has some precomputed genome comparisons, but not rhesus monkey
Before we begin on Rhesus monkey, I would like to show you how well you did with last weeks assignment. I was very happy with the results of the class effort. I did some searches in the nr section of Genbank for Xenopus tropicalis P450s. I found some and added them to our growing collection. I also made some extensions to your sequences by walking the ESTs (the optional part of assignment 1). The current sequence set is at the link above called Module 1 results. The yellow regions were submitted by the class.
Starting from no sequences on Thursday afternoon last week, we now have 49 different P450 sequences from Xenopus tropicalis. 38 of these are full length P450s. Fugu (the Japanese pufferfish) has 54 and humans have 57, so the frog will probably have a similar number. We may only be missing about 5 sequences including CYP7A1, CYP11B1 and some of these (like CYP21) are not present in the frog EST collection at Genbank.
Look at the frog sequence collection. I have given the names of everyone who turned in sequences, and I have colored the regions yellow that were identified by students in this class. There is something to learn from this data. The yellow regions are mostly near the C-terminal end of the P450 sequences. That is caused by P450s having a more conserved C-terminal sequence than other parts of the protein. The middle region from about 130-300 amino acids is poorly conserved. You were asked to find the best hits in the database, so you naturally found the C-terminal parts most of the time. In some cases where your best hit was to the N-terminal part, this was due to the C-terminal part being missing in the ESTdb.
Sometimes the very last part of the sequence was not found (DT436641, CYP4T8). One to 16 amino acids were missed in the blast alignment. This is often the case. If the last few amino acids are not in a conserved motif, then they will differ between sequences, even if many other regions are highly conserved. The way to check for this is to translate the EST DNA sequence and look for the stop codon shown as an *.
The logical next step to continue your work from the first class would be to extend the sequences you found to include the whole protein coding region. This was the optional part of the assignment. Some sequences could not be completed because the EST collection is not comprehensive and some sequences are missing. CYP21 was not in the dataset, and most of CYP8A1 was missing. Only 83 amino acids at the C-terminal were found. Only ESTs were being used here, but you could easily search in nr or use genomic sequence from the GSS section of Genbank. There are no Xenopus tropicalis sequences in the HTGS or WGS sections, but there are 22.9 million in the Trace Archive.
Let us select an example sequence CYP4V (a chicken sequence) starting with the partial sequence shown below. If a BLAST is done using this sequence against EST others, limited to Gallus gallus (chicken) we find
>gi|25890949|gb|BU382948.1|BU382948 UniGene info 603858448F1 CSEQCHN75 Gallus gallus cDNA clone ChEST866o11 5'.
Length = 798
Score = 390 bits (1001), Expect = e-109
Identities = 188/188 (100%), Positives = 188/188 (100%)
Frame = +2
Query: 1 KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 60
Sbjct: 29 KREAFLDMLLNATDDEGKKLSYKDIREEVDTFMFEGHDTTAAAMNWVLYLLGHHPEAQKK 208
Query: 61 VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 120
Sbjct: 209 VHQELDEVFGNAERPVTVDDLKKLRYLECVVKEALRLFPSVPMFARSLQEDCYISGYKLP 388
Query: 121 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 180
Sbjct: 389 KGTNVLVLTYVLHRDPEIFPEPDEFRPERFFPENSKGRHPYAYVPFSAGPRNCIGQRFAQ 568
Query: 181 MEEKTLLA 188
Sbjct: 569 MEEKTLLA 592
Note the length of the hit = 798, while the alignment stops at 592 (frame is + strand) so there are 206 bases of sequence beyond the end of the alignment. This would code for 206/3 = 68 and 2/3 codons or amino acids. Since P450s are about 500 amino acids long and about 50 amino acids after the heme signature FSAGPRNCIG This should extend the sequence to the stop codon. To find out we need to get the sequence and translate it. There are two ways to get the sequence. The easiest way is to click on the accession number in the blast output. It is hyperlinked and it will retrieve the sequence.
GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC
Look at all three frames. Notice that frames 1 and three have many * in them. These represent stop codons. These occur by chance about once every 21 amino acids in non-coding nucleotide sequence. Of course in coding sequence they should not appear except at the end of a coding region. Otherwise they are found in introns, the non-coding regions of genes. Remember that we are looking at ESTs (no introns).
Frame 2 has very few * and they do not break up the sequence. If you look at this frame you will find the heme signature mentioned above that was part of the starting sequence in CYP4V. This unbroken sequence is the P450 CDS (coding sequence) up to the stop codon. The sequence we have now is shown below. Magenta is our starting sequence.
We now have completed the C-terminal of CYP4V.
Facing up to imperfect sequence data
The sequence we have just extended looked like a good sequence up to the end. However, it is always a good idea to check your work. If you take this sequence to the human P450 blast page above and do the blast, we find that the C-terminal we added does not match the human sequence of CYP4V2. In fact it stops at TLLA. This suggests an error in the sequence that has caused a frame shift, probably one base is missing or one extra base is added in the sequence. To check for frameshifts you need to blast the other two frames in the region after TLLA. One of them will probably match the human sequence. If you blast the last two lines of frame three you will find that frame three is the correct frame after TLLA. You would also see this if you did a blastx search of the nucleotide sequence.
>CYP4V2 formerly CYP4AH1 AC012525 Homo sapiens chromosome 4
Length = 525
Score = 143 (50.3 bits), Expect = 1.8e-12, P = 1.8e-12
Identities = 26/37 (70%), Positives = 32/37 (86%)
Query: 19 ILRRFWVDCSQKPEELGLSGELILRPNNGIWGQLKRR 55
ILR FW++ +QK EELGL G+LILRP+NGIW +LKRR
Sbjct: 483 ILRHFWIESNQKREELGLEGQLILRPSNGIWIKLKRR 519
Where is the frameshift?
GAACTGCATTGGCCAACGCTTTGCACAAATGGAAGAGAAAACTCTTCTAGCC C CTCATCC
T L L A L I L
R R F W
The sequence from above is shown with the TLLA and LILRRFW sequences in translation below it. There is an extra C base in a run of four Cs. This sequence was read as four when there were only three Cs here. Now we can say with more confidence that the correct CYP4V sequence is
A similar strategy can be used to extend the sequence upstream to the start codon. You must keep the possibility of frameshifts in your mind while you are doing this. The process is:
- Do a blast search and find an accession that overlaps your known sequence and extends it upstream.
- Get the sequence and translate it in the correct frames based on the frame of the blast match (+ or -, Forward or Reverse).
- Look for the open reading frame (ORF), the one without stop codons in it that extends your known sequence.
- Check your work by blasting the new sequence against human.
- The start codon (almost always Met) should be in about the same place as the start codon in the best human match.
Going after bigger fish, searching genomic DNA
ESTs are very nice to learn on, but now that you are more familiar with BLAST and frameshifts, you are ready for harder problems. Eukaryotic genomes have introns. Some genes have none (CYP8B1 for example), but that is rare, 5-10 introns is more common. Introns make your life difficult, because you have to find them all and find all the intron-exon boundaries.
Genome sequences are being generated world wide at the rate of 200-300 million bases of sequence a day and that is probably a low estimate now. Celera Genomics has a 100 million bases per day capacity. The new J. Craig Venter Institute will be about the same and can increase to 4 times that level. The Broad Institute at MIT is near 60 million a day, probably more by now. The Joint Genome Institute (DOE) is of similar capacity and that does not count the Sanger Center and the genome centers like Kazusa in Japan and other sites in the world. These centers are producing more sequence than they can possibly annotate in any detail. They are relying on automated gene finding programs and automated blast comparisons to annotate these genomes. It will be a moderate success, but it will not be 100% correct. The gene finding programs only get about 60% accuracy. They fuse adjacent genes together. They skip exons. Some exons are very short, as short as eight nucleotides. See the GHE sequence below from several P450 genes from the white rot fungus. Automated programs miss short exons and they fail to detect bad exon boundaries that are probably sequence errors. In short, they do not have expert knowledge of a single protein family. Celera realized this when they did the Drosophila genome and they held two Gene Jamborees, where expert annotators were brought in to work on assembling genes from individual familes. The Riken mouse cDNA project had a similar meeting with about 50 invited annotators.
Figure showing a very short exon GHE from several P450s
Phanerochaete chrysosporium (white rot fungus) Scaffold_388a very similar
to sequences 77 417 112 129
gene model complete all boundaries checked, 16 exons
3583 MISDTFALAISSGLSLFLCLKAFIDYRAGLRSI (2) 3684 ex1
3732 NHSYLPGFRALISSFGILGLFFKEPKRGLWGGRRRFWLRKHLDFEEAGVDIISH (0) 3893 ex2
3954 IAFLPSVSTYLLLADAAAIK (0) 4013 ex3
4069 EVTGHRARFPKPTYKTLRIFGGNVLASEGEEWKRHRKVVGPAFSE (0) 4203 c-helix ex4
4255 HNNRLVWNETVKIVNDLFANVWGSQSEVYVDNVVQSVTLP (0) 4374 ex5
4423 MALYVISIAGFGKRALWQADGNLPPGHKLSFQ (0) 4521 ex6
4576 DALHILGTDLWIKAATPTLLMNWAPTTRIANVKLAFDEVK (0) 4692 ex7
4747 QYMLELIQERRNSEKRDERYDLFSSLLDANDLNEDGNGNVTLTNDELL (1) 4890 ex9
GNIFIFMLA (1) 4973 ex9 split
GHE (0) ex9 split
5087 TTAHTLAFTFGLLALHPDYQETVYQQIKSIVPDNRPP (0) 5197 ex10
MYEEMNSLTECMA (2) ex11
5351 YETLRLFPP (0) 5380 ex12
5436 TATIPKIAAEDTYLVTIDRAGNRVVVPVPCGTALHLNVIALHHN (1) 5564 ex13
5614 PRYWDNPSAFKPERFRGDWPRDAFIPFSTGSRSCIGRR (2) 5730 ex14
5780 FFETESIAILTMILSRYKIELRNDPRFADETYEERWQRVLRVKDGLTPA* 5932 ex15
compare to scaffold 388b not a separate exon
GNIFIFLLAGHE(0) 14307 ex9
14247 TTAHTLAFTFGLLALYPEQQDKLYKHIKHVIPDGRIP (0) 14137 ex10
and scaffold 129
GNIFIFMLA (1) ex9 split
GHE (0) ex9 split
25150 TTAHTLAFTFGLLALHSDYQEKVHQQIKSIMPDNRLP (0) 25260 ex10
and scaffold 12a not a separate exon (exons have been fused)
54432 VKANMTEDAKSRLSEEEMYAEMR (2)
Expert knowledge reveals GHE to be a real exon, by comparison with other P450s. This happens to be in a motif region that is conserved (AGXETT) and easily recognized, but automated gene finding programs are not going to see this. Please note that the intron exon phases are indicated in (). The number 0, 1 or 2 tells the phase of the junction. Exons that join between codons are phase 0, those that join one base into a codon are phase 1 and those that join two bases in are phase 2. You cannot join exons together in frame unless the phase is preserved at both ends of the intron. In the example above, the sequence GNIFIFMLA ends in a phase 1 boundary, while the sequence TTAHTLAFTFGLLA starts with a phase 0 boundary. The two cannot be joined unless there is an exon in between with a phase 1 start and a phase 0 end. That is the GHE exon. This short eight base pair exon is seen in six different genes in the white rot fungus.
Introns begin with GT and end with AG. Some rare introns begin with GC. These bases, and all the bases between them, are cut out when the full length transcript is processed in the nucleus to make the mature mRNA (messenger RNA). Below is an example of a Drosophila P450 gene Cyp4e2, showing the 5 introns in place. Please look at each intron and find the GT and AG pairs. Also notice the phase of each intron as described above.
This figure is the BLAT search output of the UCSC genome browser.
The exons are in blue capital letters. Introns are lower case black letters. The Capitals are all in blocks of three per codon. This allows determining the phase of intron boundaries, since the last capital letter is at the end of a complete codon. If gt follows then the phase is 0. If ngt ( n is any base) follows then the phase is 1. If nngt follows the phase is 2. In this figure intron 1 and intron 5 are phase 1, though intron 5 has the correct boundary off by one codon. Introns 2 and 4 are phase 0. Intron 3 is phase 2. The numbering on the right is nucleotide numbering for Chromosome 2R.
In this module, you will learn how to identify intron-exon boundaries, by comparing two very similar genes, one from human, and the other from rhesus monkey. This will be fairly easy since the genes are so similar. In real life situations, this is not going to be so clear. We have all 57 P450 genes identified from human, but we are starting with none identified from rhesus monkey. The human P450 sequences are linked above.
We are very fortunate to have the UCSC genome browser, which has an alignment of the two primate genomes as of Jan. 2005. By doing a BLAT search (Slightly different than a BLAST search, optimized for near exact matches), with a Human P450, we will get the region of the rhesus genome with the best match to our query sequence. Then we can link to a view of the genomic DNA sequence showing the location of the protein coding exons over the region of our gene. From this figure and our original protein sequence we can identify the GT and AG intron-exon boundaries. The introns can be edited out, and the assembled gene can be translated to give the rhesus monkey protein sequence.
The step by step procedure is now given:
- Link to the Human P450s list and copy your protein sequence.
- Paste your protein sequence in a Word Document for later use. You will be identifying the intron boundaries and phases in this sequence.
- Link to the UCSC bioinformatics server
click on BLAT and then select Rhesus under the genome pull down menu.
- Paste your P450 sequence in the window and click the submit file button. You will see a display like the one below. If you get an internal error message shorten your sequence and repeat. Notice under the identity column the top hit is 93.6%. That is your gene. The others are different P450 sequences. The BLAT search only returns strong matches. Notice there are no hits less than 61%. The gene is on Scaffold 111671 on the minus strand at the nucleotide location given.
The +- signs indicate your query sequence is always treated as Plus Strand and the match is Minus Strand. ++ would mean the hit is in the Plus Strand.
This example was made with the bottom part of CYP2A13. The whole 2A13 sequence caused an error, so I shortened it. I think the error may have happened because this sequence is at the end of a scaffold and the whole sequence is not on the scaffold. Notice START END and QSIZE. This is the amino acid position of your query that matched. QSIZE is the query length. Here the query was 218 aa long and 2-218 matched.
- Click on details on the left side. This gives the image of the gene structure. This gene fragment has 4 exons, but only two are shown in the first image. The top of the figure has your query sequence color-coded to show matches, mismatches and gaps (introns).
The second image shows all four exons.
The bottom of the details page shows the protein translation of each exon. Conservative changes are in green, other changes are in red.
- Go back and click on browser. This takes you to the browser page showing the gene from one end to the other (exons 7-9 in this case). Multiple levels of evidence are given on the page including your query sequence at the top. Genbank RefSeq genes and other curated and non-curated predictions and sequence matches, including ESTs. The wide bars in the image are exons, or predicted exons. The connecting line covers the region from the beginning of your query sequence to the end. The arrows on the connecting line show the orientation of the gene. This gene piece has 4 exons and it is on the minus strand.
- Click on the purple bar. This links you to both the human and rhesus sequences in the region of the browser window. Farther down on this page is an alignment of the two sequences. You can also open the human browser at the equivalent spot to see what is there. This can be very helpful when dealing with gene clusters.
- Go back and examine the sequence showing the exons in blue and the introns in lower case black. It is time to identify the GT and AG boundaries of the introns.
Keep in mind that these figures are drawn with intact codons in Blue. The codons are never split. This means that you can tell the phase if you can identify the GT that is the beginning of the intron. It should be the first GT in the black lower case region. This may be one or two codons away, but usually it is in the first codon in the black. The GT may also be in the Blue region, or the last letter of the Blue region may be G with the T in the black. The rules for determining the intron phase are:
Phase 0. G and T are the first two letters of the black sequence, 1,2 or or they are 4,5 or 7,8. The intron is Phase 0 and it occurs between codons. Phase 0 codons always code for Val, since all valine codons begin with GT.
Phase 1. G and T are at positions 2,3 or 5,6 or 8,9 in the black sequence.
The most common phase one codon is GGT = glycine.
Phase one boundaries break the codon one nucleotide in.
Phase 2. G and T may overlap the blue and black sequence -1,1, or 3,4 or 6,7 in the black sequence. Arginine is often a phase 2 codon AGGT
Phase two boundaries break the codon two nucleotides in.
The other end of the intron will have an AG before the coding sequence begins again.
Phase 0. AG may be the last two letters of the black intron sequence. They may be multiples of three from that boundary. Frequently CAG = gln is at the phase 0 boundary.
Phase 1. The AG can span the black and blue boundary with A in the black and G in the blue. They can also be multiples of three from that location.
Phase 2. AG can be one letter away from the blue boundary as AGX. The AG can also be the first two letters in the blue boundary.
- Copy each exon sequence upto but not including the GT of the intron. Mark each intron boundary on the end of an exon with the correct phase in parentheses (0), (1) or (2). The phase at both ends of the intron has to be the same so you only need to mark one end. Start copying the next exon after the AG boundary. Repeat until you have all exons copied and marked for phase
The CYP2A13 sequence (4 exons, introns removed)
GAGGAGAAGA ACCCCAACAC GGAGTTCTAC TTGAAGAACC TGatgATGAC CACGCTGAAC 14520
CTCTTCattG CAGGCACCGA GACCGTCAGC ACCACCCTGC GCTATGGCTT CCTGCTGCTC 14460
ATGAAGtatC CAGAGGTGGA Gg (1)
CCAAG GTCCATGAGG AGATTGACAG AGTGATCGGC AAGAACCGGC AGCCCAAGTT 13920
TGAGGACCGG gtcAAGATGC CCTACatgGA GGCAGTGATC CATGAGATCC AAAGATTTGG 13860
AGACgtgatc CCCATGagcT TGGCCcgcAG GGTCAACAAG GACACCAAGT TTCGGGATTT 13800
CTTCCTCCCT AAG (0)
GAAGTGTTCC CTATGCTGGG CTCCgtgCTG AGAGACCCCA GGTTCTTCTC CAACCCCCAG 13200
GACttcaatC CCCAGCACTT CTTGGATGAG AAGGGGCAGT TTAAGAAGAG TGACGCTTTT 13140
GTGCCCTTTT CCATCg (1)
GAAAG CGGaacTGTT TCGGAGAAGG CCTGGCCAGA 12120
ATGGAGCTCT TTCTCTTCTT CACCACCATC ATGCAGAACT TCCGCTTCAA GTCCCCCCAG 12060
ttgCCCAAGG ACATCGACGT GTCCCCCAAA CACGTGGGCT TTGCCACGAT CCCAccaAAC 12000
TACACCATGA GCTTCCTGCC CCGC
Paste your newly assembled rhesus monkey protein sequence below the human starting sequence in your word file. Mark the location and pahse of the introns and do a blast search of human to get the sequence alignment as shown below.
Email me these three items.
>human protein CYP2A13 last 4 exons
Alignment of the two sequences 93% identical
Query: 2 EEKNPNTEFYLKNLVMTTLNLFFAGTETVSTTLRYGFLLLMKHPEVEAKVHEEIDRVIGK 61
Sbjct: 1 EEKNPNTEFYLKNLMMTTLNLFIAGTETVSTTLRYGFLLLMKYPEVEAKVHEEIDRVIGK 60
Query: 62 NRQPKFEDRAKMPYTEAVIHEIQRFGDMLPMGLAHRVNKDTKFRDFFLPKGTEVFPMLGS 121
NRQPKFEDR KMPY EAVIHEIQRFGD++PM LA RVNKDTKFRDFFLPKGTEVFPMLGS
Sbjct: 61 NRQPKFEDRVKMPYMEAVIHEIQRFGDVIPMSLARRVNKDTKFRDFFLPKGTEVFPMLGS 120
Query: 122 ELRDPRFFSNPQDCSPQHFLDEKGQFKKSDAFVPFSIGKRYCFGEGLARMELFLFFTTIM 181
LRDPRFFSNPQD +PQHFLDEKGQFKKSDAFVPFSIGKR CFGEGLARMELFLFFTTIM
Sbjct: 121 VLRDPRFFSNPQDFNPQHFLDEKGQFKKSDAFVPFSIGKRNCFGEGLARMELFLFFTTIM 180
Query: 182 QNFRFKSPQSPKDIDVSPKHVGFATIPRNYTMSFLPR 218
QNFRFKSPQ PKDIDVSPKHVGFATIP NYTMSFLPR
Sbjct: 181 QNFRFKSPQLPKDIDVSPKHVGFATIPPNYTMSFLPR 217
Do the gene assemblies as shown above for two Rhesus P450s.
Note: CYP3 and CYP4 genes have 12 and 13 exons. You do not have to do all
exons if you have a gene with 10 or more. CYP2 genes have 9 exons
Please try to do all 9 of these if they are all present. Not all genes will show
In the Genome browser. If an exon is missing just make a note saying an exon
was missing here. Look at your seq at the top of the details page to see
what is missing. There will be black lower case letter between cyan Capital
letter that mark the boudaries of gaps. These are usually exons, but they might
just be short exons.
N. Abdeltawab 2C19 5A1
S. Aggarwal 1A2 3A43
C. Blackwell 26C1 21A2
A. Bolen 2F1 7A1
L. Chen 2G2P 2U1
S. Hill 17 11B1
K. Iyer 26A1 19
S. Jain 26B1 24
N. Liao 2C9 26A1
X. Liu 27B1 11B1
E. Mahrous 11A1 2W1
D. Mao 1A1 3A4
A. Naguib 46 4A11
R. Panakanti 1B1 4B1
Y. Peng 2A6 51
H. Penmatsa 26B1 4F3
M. Puljic 2B6 4A22
S. Sarva 2C8 4V2
Q. Tran 39 27A1
G. Vasser 2D6 4X1
F. Zhang 2E1 27C1
Z. Zhang 5A1 2J2
G. Zhu 2R1 7B1
L. Zhu 11B1 8A1