MSCI814 Module 9.

Comparative Genomics/Human and Amphioxus

David Nelson March 6, 2008

The following image and legend is from JGI Branchiostoma v.1.0

LINKS FOR THIS MODULE

Bioinformatics home page

NCBI sequence viewer for retreiving a Genbank sequence

DNA translator For translating nucleotide sequence to protein

Human P450 Blast server For comparing a sequence against all human P450s

Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences

JGI Amphioxus home page

UCSC genome browser

The human genome has been sequenced by both public and private groups to better than 10X coverage (meaning every base on average has been sequenced 10 times or more). The most recent build of the human genome (Build 36.2, Oct. 17, 2006) has 278 contigs, which means there are 278 pieces of continuous ordered sequence, with a few small regions of uncalled bases indicated by runs of NNNNNNN. ( release statistics for Build 36.2 Since humans have 23 pairs of chromosomes: 22 autosomes and X and Y, we should have 24 contigs if the whole genome was completely done. However, chr 14 is the only human chromosome to be in one contig. The region around the centromeres is often repetitive and difficult to sequence so this is one place where many of the human chromosomes are not finished. The 278 contigs less the 24 complete chromosomes leaves 254 gaps in the human genome. This is not bad considering the publications in Nature and Science in Feb. 2001 had about 120,000 gaps remaining. Recent work has continued closing these gaps "At present about 184 euchromatic gaps remain in build hg18" Bovee, D. et al. Nature Genetics 40, 96 - 101, Dec. 23, 2007). The official gene count is now 20,488 (Clamp M. et al. PNAS 104, 19428-19433 Dec. 4, 2007)

From the JGI site: "The genome of Branchiostoma floridae is estimated to be approximately 575 Mb contained in 19 pairs of chromosomes, and is being sequenced to approximately 8.1 X depth. There are a total of 3,032 scaffolds, with a total length of 923 Mb [both haplotypes assembled] composed of 81,073 contigs. Half of the assembly is contained in 174 scaffolds, all at least 1.6 Mb in length. The length-weighted mean contig size (L50) is 26kb."



This module is designed to look at comparative genomics between two chordate genomes. Amphioxus is the sister group to all vertebrates, so it should be an ideal organism to study to see what was common to all vertebrates before they diverged into hagfish, lampreys, sharks, fish, and tetrapods. Gene clusters are ideal places to look for differences between amphioxus and humans, because they were presumably simpler in amphioxus and became more complex in mammals. It would be very interesting if we could find the ancestor of each human P450 gene cluster and see if it had only one gene or a few genes.

Gene clusters are groups of related genes that arose by tandem duplication. Once a cluster forms, it can rearrange and undergo gene conversion, where two similar genes are paired up and one of them replaces the homologous region in the other gene. This results in two genes with one region that is highly similar and flanking regions that are less similar. There may be just two genes or as many as 15 genes or more in a gene cluster. These are also active sites of pseudogene formation, often just spare exons scattered in the gene cluster (called detritus exons).

For a simple example I will use CYP11B1 and CYP11B2 on human chromosome 8. These genes form a two gene cluster. Lets BLAT search the chimp genome at the UCSC browser with the human CYP11B2 gene and see what is there.

>CYP11B1 NM_000497
MALRAKAEVCMAVPWLSLQRAQALGTRAARVPRTVLPFEAMPRR
PGNRWLRLLQIWREQGYEDLHLEVHQTFQELGPIFRYDLGGAGMVCVMLPEDVEKLQQ
VDSLHPHRMSLEPWVAYRQHRGHKCGVFLLNGPEWRFNRLRLNPEVLSPNAVQRFLPM
VDAVARDFSQALKKKVLQNARGSLTLDVQPSIFHYTIEASNLALFGERLGLVGHSPSS
ASLNFLHALEVMFKSTVQLMFMPRSLSRWTSPKVWKEHFEAWDCIFQYGDNCIQKIYQ
ELAFSRPQQYTSIVAELLLNAELSPDAIKANSMELTAGSVDTTVFPLLMTLFELARNP
NVQQALRQESLAAAASISEHPQKATTELPLLRAALKETLRLYPVGLFLERVASSDLVL
QNYHIPAGTLVRVFLYSLGRNPALFPRPERYNPQRWLDIRGSGRNFYHVPFGFGMRQC
LGRRLAEAEMLLLLHHVLKHLQVETLTQEDIKMVYSFILRPSMCPLLTFRAIN

>CYP11B2 NM_000498
MALRAKAEVCVAAPWLSLQRARALGTRAARAPRTVLPFEAMPQH
PGNRWLRLLQIWREQGYEHLHLEMHQTFQELGPIFRYNLGGPRMVCVMLPEDVEKLQQ
VDSLHPCRMILEPWVAYRQHRGHKCGVFLLNGPEWRFNRLRLNPDVLSPKAVQRFLPM
VDAVARDFSQALKKKVLQNARGSLTLDVQPSIFHYTIEASNLALFGERLGLVGHSPSS
ASLNFLHALEVMFKSTVQLMFMPRSLSRWISPKVWKEHFEAWDCIFQYGDNCIQKIYQ
ELAFNRPQHYTGIVAELLLKAELSLEAIKANSMELTAGSVDTTAFPLLMTLFELARNP
DVQQILRQESLAAAASISEHPQKATTELPLLRAALKETLRLYPVGLFLERVVSSDLVL
QNYHIPAGTLVQVFLYSLGRNAALFPRPERYNPQRWLDIRGSGRNFHHVPFGFGMRQC
LGRRLAEAEMLLLLHHVLKHFLVETLTQEDIKMVYSFILRPGTSPLLTFRAIN



In our example, a search for 11B1 finds the same results as for 11B2. That is due to the high sequence identity (93%) between these two genes. Looking at the browser window after zooming out, we see two P450 11B genes: 11B1 and 11B2.



But there is something very unusual about these genes. They seem to share the first two exons (on the right side). So what used to be two independent genes have become alternative splice variants of one gene with 7 variable exons and two constant exons. The chimp genome assembly could be wrong here and there might be some missing sequence that holds the two missing exons of CYP11B1. Humans have about 30kb between the two genes, chimps have only 4kb. If the assembly is correct, this is an highly unusual difference between chimp and human. This could be checked by comparing to rhesus monkey.

This is a zoomed in view of the chimp CYP11B genes.

Among the larger P450 gene clusters, the human CYP2ABFGST cluster is unusual in that it has a mirror symmetry due to an inverted duplication of one half of the cluster. Notice that 2T genes are on the outside, followed by 2F, 2A, 2G and 2B as you move to the center. The UCSC genome browser has this display of this human CYP2ABFGST cluster). Notice that the genes are shown, but not the pseudogenes. A similar picture can be seen after BLAT searching the chimp genome and adjusting the scale. chimp CYP2ABFGST cluster). Notice how large the 2B6 gene looks compared to the other genes. This is probably joining parts of 2B6 with some pseudogene exons in this region. The 2A gene on the left is also large and it may contain some exons of the CYP2T2P pseudogene.

Today we are going to look at Human-Amphioxus P450 gene clusters. We will look for the number of genes in a cluster, the order and orientation of genes in the cluster. We will be creating a map of these P450 gene clusters. You can look at the human clusters in the paper I gave you comparing human and mouse P450s. Humans have 57 P450 genes, with 58 pseudogenes. Some mouse clusters with multiple genes have only one gene in humans (Figure 5 C and D, the CYP2J cluster).

The human clusters shown in the paper are

2ABFGST 6 genes + 8 pseudogenes
2C 4 genes + 4 pseudogenes
3A 4 genes + 9 pseudogenes
4ABXZ 5 genes + 5 pseudogenes
4F 6 genes + 5 pseudogenes
2D 1 gene + 2 pseudogenes
2J 1 gene + 0 pseudogenes

updated versions with rat included are linked below

A map of the mouse, human and rat CYP4ABX gene clusters

A map of the human, mouse and rat CYP2J gene cluster

A map of the human, mouse and rat CYP3A gene cluster

A map of the human, mouse and rat CYP2C gene cluster

A map of the human, mouse and rat CYP2D gene cluster

A map of the human, mouse and rat CYP2ABFGST gene cluster

A map of the human, mouse and rat CYP4F gene cluster

These clusters contain 27 of the 57 human P450 genes and 33 of 58 pseudogenes. For locations of all genes and pseudogenes on the human chromosomes see the ideogram links at the human P450 data page

This page also has the human P450 sequences and the pseudogene sequences in a FASTA file.

I tried to find direct 1:1 human:amphioxus orthologous clusters. This is successful in fish and I hoped we could extend it to Amphioxus. This did not work. Amphioxus has too many P450s and the synteny (preservation of gene order) is not conserved. I could not find any P450s yet that had their neigboring non-P450 genes preserved as orthologs. This indicates a randomization of the genes in Amphioxus and we will not be able to identify equivalent gene clusters compared to human. Instead I would like you to try to build up your own Amphioxus P450 gene cluster map.

LINKS FOR THIS MODULE

JGI Amphioxus home page

JGI Amphioxus page on my P450 site

This has links to 480 Amphioxus P450 sequences, sorted by P450 clan and smaller gene groupings of closely related sequences. See the clan2, 3, 4 amd all other sections. We will probably work mostly in the clan 2 section since it has 304 sequences and will contain the most gene clusters.

Bioinformatics home page

Human P450 Blast server For comparing a sequence against all human P450s, including pseudogenes

Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences

UCSC genome browser use BLAT search of human

NCBI sequence viewer for retreiving a Genbank sequence

DNA translator For translating nucleotide sequence to protein



Assignment: scroll through the P450 clan pages looking for gene models with adjacent numbers like these, then try to build a map of the cluster. Use the browser as shown below to get a view of the whole region including you P450 genes. Print the browser page (might need to print it as landscape oritentation to prevent clipping). Lable the P450s on the map. You can hand it to me, since labling it electronically will require a drawing program or Power Point and that is not necessary.

New cluster

e_gw.110.123.1 Brafl1/scaffold_110:507223-515410

fgenesh2_pg.scaffold_110000030 Brafl1/scaffold_110:516372-520208

fgenesh2_pg.scaffold_110000031 Brafl1/scaffold_110:520927-527727

e_gw.110.18.1 Brafl1/scaffold_110:528976-533638

fgenesh2_pm.scaffold_110000006 Brafl1/scaffold_110:534879-539286

e_gw.110.103.1 Brafl1/scaffold_110:559940-564804

80000 bp space

fgenesh2_pm.scaffold_110000009 Brafl1/scaffold_110:644987-648996

Tips on how to use the JGI browser.

The search page (tool bar top left) has a "search Models" button at the bottom of the page.

Paste in the gene model name from the sequence collection into the search window and click on search models.

examples: fgenesh2_pg.scaffold_110000031

e_gw.110.103.1

do not include the |Brafl1 on the end of the name. It will not work.



The result will look like this.

Click on the Model Id to go to the gene page.



To get the gene browser open click on "To genome Browser" near the bottom.



Once at the browser window click on zoom -10X to zoom out 10 fold and see the gene neigborhood. This will allow you to find the other gene models. Mousing over the gene structures will allow you to go to the gene page for each model.

On the gene page you can get the following information near the top of the page.



This gives the model name and the nucleotide position of the model on the scaffold. You may want to have this to help you make your map.