MSCI814 Module 9.

Comparative Genomics/Human and Chimpanzee

David Nelson April 6, 2006

LINKS FOR THIS MODULE

Bioinformatics home page

NCBI sequence viewer for retreiving a Genbank sequence

DNA translator For translating nucleotide sequence to protein

Human P450 Blast server For comparing a sequence against all human P450s, including pseudogenes

Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences

UCSC genome browser use BLAT search of chimp

62 chimp-human P450 sequence alignments

The human genome has been sequenced by both public and private groups to better than 10X coverage (meaning every base on average has been sequenced 10 times or more). The most recent build of the human genome (Build 36.1, March 2006) has 280 contigs, which means there are 280 pieces of continuous ordered sequence, with a few small regions of uncalled bases indicated by runs of NNNNNNN. ( release statistics for Build 36.1 Since humans have 23 pairs of chromosomes: 22 autosomes and X and Y, we should have 24 contigs if the whole genome was completely done. However, chr 14 is the only human chromosome to be in one contig. The region around the centromeres is often repetitive and difficult to sequence so this is one place where many of the human chromosomes are not finished. The 280 contigs less the 24 complete chromosomes leaves 256 gaps in the human genome. This is not bad considering the publications in Nature and Science in Feb. 2001 had about 120,000 gaps remaining. The official gene count is now 28,976.

On Dec. 10, 2003, the Chimpanzee genome was released. The chimp genome has been sequenced to 4X coverage, which leaves a lot of uncovered regions, but the sequence could be aligned to the human genome. This meant that almost every piece of chimp sequence could be correctly placed and oriented by using human as a guide. The UCSC Genome browser has the aligned sequence available so it can be viewed in alignment with the human sequence. The publication of the chimp genome occurred in the Sept. 1, 2005 issue of Nature.

These are a few interesting facts about the chimp and human genomes.

After 6 million years of divergence, there are about 35 million DNA base pair differences between the shared portions of the two genomes.

Another 5 million sites differ because of an insertion or deletion in one of the lineages. The larger indels are rare, one of a kind events that can be used to determine evolutionary relationships among species.

3 million of the differences may lie in protein-coding genes or other functional areas of the genome.

More than 50 genes present in the human genome are missing or partially deleted from the chimp genome.

The average protein sequence between humans and chimps differs at only two amino acids. This is unlikely to change the function of an enzyme, transporter or a structural protein. Even so, chimps and humans do not look alike or behave alike. There are key differences at the DNA level that are responsible for the outward differences between the two species. Keep in mind that humans differ from each other at 0.1% of their DNA sequence (3 million base pairs), mostly in the form of SNPs.

One item not addressed in this article was divergence of pseudogenes between the two genomes. Pseudogenes are defective genes that have been broken by frameshifts, stop codons, loss of exons or insertions of large pieces of DNA to separate the ends of the gene. Sometimes chromosome rearrangements can occur in the middle of a functional gene causing it to be disrupted. In humans this sometimes contributes to cancer by breaking a tumor suppressor gene.

Most pseudogenes are free to change without selection keeping them constant. However, some pseudogenes have been shown to regulate other genes. Knockout of mouse mackorin1-ps is lethal. Therefore, one has to wonder if pseudogenes are conserved over millions of years whether they might have a regulatory role.

Some pseudogenes are not whole genes, but just one or a few exons that were duplicated. These are often close to the original gene or even inside it. See figure 2B in the paper (Pharmacogentics 14, 1-18 2004) for examples. An updated version of this CYP2ABFGST gene cluster including rat is given here. Each dot represents an exon. The full length genes have 9 exons. The pseudogenes often have less than that, but a pseudogene can be full length (see Figure 2A human 2G2P, 2B7P1, 2T2P). The open circle in mouse 2s1 is an extra fourth exon between exons 3 and 4. The arrows below each gene/pseudogene show the orientation of the gene. The scale on the bottom is in millions of base pairs. These gene clusters are about a half million to a million base pairs long.

This module is designed to look at comparative genomics between two very similar genomes. Because the genes are so similar, it does not make much sense to look at one gene at a time because the average ortholog pair has only two amino acid differences. Gene clusters are ideal places to look for differences between chimps and humans, because thay have some long range order with many similar genes and some pseudogenes in blocks of 500kb to 1000kb. Gene clusters are groups of related genes that arose by tandem duplication. Once a cluster forms, it can rearrange and undergo gene conversion, where two similar genes are paired up and one of them replaces the homologous region in the other gene. This results in two genes with one region that is highly similar and flanking regions that are less similar. There may be just two genes or as many as 15 genes or more in a gene cluster. These are also active sites of pseudogene formation, often just spare exons scattered in the gene cluster (called detritus exons).

For a simple example I will use CYP11B1 and CYP11B2 on human chromosome 8. These genes form a two gene cluster (not in the paper).

>CYP11B1 NM_000497
MALRAKAEVCMAVPWLSLQRAQALGTRAARVPRTVLPFEAMPRR
PGNRWLRLLQIWREQGYEDLHLEVHQTFQELGPIFRYDLGGAGMVCVMLPEDVEKLQQ
VDSLHPHRMSLEPWVAYRQHRGHKCGVFLLNGPEWRFNRLRLNPEVLSPNAVQRFLPM
VDAVARDFSQALKKKVLQNARGSLTLDVQPSIFHYTIEASNLALFGERLGLVGHSPSS
ASLNFLHALEVMFKSTVQLMFMPRSLSRWTSPKVWKEHFEAWDCIFQYGDNCIQKIYQ
ELAFSRPQQYTSIVAELLLNAELSPDAIKANSMELTAGSVDTTVFPLLMTLFELARNP
NVQQALRQESLAAAASISEHPQKATTELPLLRAALKETLRLYPVGLFLERVASSDLVL
QNYHIPAGTLVRVFLYSLGRNPALFPRPERYNPQRWLDIRGSGRNFYHVPFGFGMRQC
LGRRLAEAEMLLLLHHVLKHLQVETLTQEDIKMVYSFILRPSMCPLLTFRAIN

>CYP11B2 NM_000498
MALRAKAEVCVAAPWLSLQRARALGTRAARAPRTVLPFEAMPQH
PGNRWLRLLQIWREQGYEHLHLEMHQTFQELGPIFRYNLGGPRMVCVMLPEDVEKLQQ
VDSLHPCRMILEPWVAYRQHRGHKCGVFLLNGPEWRFNRLRLNPDVLSPKAVQRFLPM
VDAVARDFSQALKKKVLQNARGSLTLDVQPSIFHYTIEASNLALFGERLGLVGHSPSS
ASLNFLHALEVMFKSTVQLMFMPRSLSRWISPKVWKEHFEAWDCIFQYGDNCIQKIYQ
ELAFNRPQHYTGIVAELLLKAELSLEAIKANSMELTAGSVDTTAFPLLMTLFELARNP
DVQQILRQESLAAAASISEHPQKATTELPLLRAALKETLRLYPVGLFLERVVSSDLVL
QNYHIPAGTLVQVFLYSLGRNAALFPRPERYNPQRWLDIRGSGRNFHHVPFGFGMRQC
LGRRLAEAEMLLLLHHVLKHFLVETLTQEDIKMVYSFILRPGTSPLLTFRAIN

If we do a blat search with CYP11B2 the results page looks like this.



Pay attention to the way numbering is reported. In the first line the orientation is +-, that means the search sequence was plus orientation and the match was minus orientation. Notice that in minus orientation the numbering has the end of the protein first and the start is second. That will be reversed if the orientation is ++.

Click on the details link. We see the search seq at the top with differences in black lower case. Below that is the genomic sequence with exons in blue and mismatches in lower case black. Only part of this is shown here.



Farther down on the details page is the side by side alignment of each exon.

From this you can get the start and stop nucleotide numbering information for each exon of the coding sequence (start Methionine, exon 1 = 147279934, the last amino acid in the last exon = 147266643). We also need to know the chimp chromosome # and the gene orientation. This is at the top of the details page. [Chimp.chr7 (reverse strand):]. It is also on the first results page.

In our example, a search for 11B1 finds the same results as for 11B2. That is due to the high sequence identity (93%) between these two genes. Looking at the browser window after zooming out, we see two P450 11B genes: 11B1 and 11B2.



But there is something very unusual about these genes. They seem to share the first two exons (on the right side). So what used to be two independent genes have become alternative splice variants of one gene with 7 variable exons and two constant exons. The chimp genome assembly could be wrong here and there might be some missing sequence that holds the two missing exons of CYP11B1. Humans have about 30kb between the two genes, chimps have only 4kb. If the assembly is correct, this is an highly unusual difference between chimp and human. Look for odd stuff like this.

Among the larger P450 gene clusters, the human CYP2ABFGST cluster is unusual in that it has a mirror symmetry due to an inverted duplication of one half of the cluster. Notice that 2T genes are on the outside, followed by 2F, 2A, 2G and 2B as you move to the center. The UCSC genome browser has this display of this human CYP2ABFGST cluster). Notice that the genes are shown, but not the pseudogenes. A similar picture can be seen after BLAT searching the chimp genome and adjusting the scale. chimp CYP2ABFGST cluster). Notice how large the 2B6 gene looks compared to the other genes. This is probably joining parts of 2B6 with some pseudogene exons in this region. The 2A gene on the left is also large and it may contain some exons of the CYP2T2P pseudogene.

Today we are going to look at Human-Chimp P450 gene clusters and look for changes in the number of genes in a cluster, the order and orientation of genes in the cluster and pseudogenes in the cluster. We will be creating a map of six gene clusters of P450 genes. These are in the paper I gave you comparing human and mouse P450s. Humans have 57 P450 genes, with 58 pseudogenes. Some mouse clusters with multiple genes have only one gene in humans (Figure 5 C and D, the CYP2J cluster). A Blat search of chimp shows that chimp also has only one CYP2J gene.

The clusters shown in the paper are

2ABFGST 6 genes + 8 pseudogenes
2C 4 genes + 4 pseudogenes
3A 4 genes + 9 pseudogenes
4ABXZ 5 genes + 5 pseudogenes
4F 6 genes + 5 pseudogenes
2D 1 gene + 2 pseudogenes
2J 1 gene + 0 pseudogenes

updated versions with rat included are linked below

A map of the mouse, human and rat CYP4ABX gene clusters

A map of the human, mouse and rat CYP2J gene cluster

A map of the human, mouse and rat CYP3A gene cluster

A map of the human, mouse and rat CYP2C gene cluster

A map of the human, mouse and rat CYP2D gene cluster

A map of the human, mouse and rat CYP2ABFGST gene cluster

A map of the human, mouse and rat CYP4F gene cluster

These clusters contain 27 of the 57 human P450 genes and 33 of 58 pseudogenes. For locations of all genes and pseudogenes on the human chromosomes see the ideogram links at the human P450 data page

This page also has the human P450 sequences and the pseudogene sequences in a FASTA file. You will need to get these for blat searching the chimp genome at the UCSC browser.

What we need to do is search the chimp genome at the UCSC browser with one of the P450 sequences from each human P450 cluster. They will have nearly exact matches, though the genes will be broken into exons. Go to the browser window. Once the region of the chimp genome is identified, zoom out 10X (probably twice) to get the region around the gene. This should look like the genome browser figures I showed you above. You can shorten the search process by using parts of the two outside sequences from the gene cluster. For example use the bottom of CYP2S1 and the bottom of CYP2T2P to get matches on both ends of the cluster at once.

After you have the cluster in view, we need to break it down into manageable size pieces to search for P450 exons. We will not assemble any genes today. We are just trying to map the clusters. At the top of the browser window, there is a toolbar that has DNA on it. Clicking on DNA will take you to a window that allows recovery of any DNA sequence. We would like to get 100,000bp pieces, starting with a 100,000bp number, actually, 100,001. [example: the CYP2ABFGST cluster in chimp is on chr 20 from 43,000,000 to 43,400,000 bp. chimp CYP2ABFGST cluster).



I have entered chr20:43,000,001-43,100,000 into the window. Clicking get DNA will retrieve 100,000bp of seq at the left edge of the gene cluster. The next 100,000 would be chr20:43,100,001-43,200,000. Use commas but don't leave any spaces. Once you have the DNA, copy it and paste it into the P450 blast server window and select blastx as the Program option. Human is the default database. Search the human P450s to get matches. Each exon should be found. The result for this segment is shown here.



There are hits around 8kb, 17kb, 31kb, 65kb, 68kb, and 95kb. The hit around 17 kb looks pretty weak and may be a false positive.

Copy the text results to a Word document (only the top hits, not the whole file). Add into the top part of this document which gene cluster you are working on and which chromosome and what nucleotide range. CYP2ABFGST cluster chr20:43100001-43200000. This way we only need to add 43,100,000 to your numbers to get the right location in the chromosome. You can identify the hits by blasting the chimp amino acid sequence against the human P450s in blastP mode. Email me your result. Come up to the front and mark your exons on the white board.

To make this process efficient I have broken the 6 clusters into 100 kb regions to work on. These are shown below. Some are 50kb regions

1. CYP2ABFGST chr20:42,900,001-43,000,000 in the 2T2P region

2. CYP2ABFGST chr20:43,000,001-43,100,000 in 2F1P to 2A18PC region

3. CYP2ABFGST chr20:43,100,001-43,200,000 in 2B7P1 to 2B6 region

4. CYP2ABFGST chr20:43,200,001-43,250,000 in the 2B6 to 2G2P region

5. CYP2ABFGST chr20:43,250,001-43,300,000 in the 2G2P to 2A13 region (blast mouse)

6. CYP2ABFGST chr20:43,300,001-43,400,000 in the CYP2S, 2F1 region

7. CYP2C cluster chr8:98,100,001-98,200,000 2C19 region

8. CYP2C cluster chr8:98,200,001-98,300,000 2C19/2C58P region

9. CYP2C cluster chr8:98,300,001-98,400,000 2C9 region

10. CYP2C cluster chr8_random:47,100,001-47,200,000 2C8 region

11. CYP2C cluster chr8_random:47,200,001-47,300,000 2C8 region

12. CYP2D cluster chr23:41,100,001-41,200,000 2D6 region, whole cluster

13. CYP3A cluster chr6:100,600,001-100,700,000 3A5/ CYP3A5-de1b2b region

14. CYP3A cluster chr6:100,700,001-100,750,000 3A7 region

15. CYP3A cluster chr6:100,750,001-100,800,000 3A4 region

16. CYP3A cluster chr6:100,800,001-100,900,000 3A4/3A43 region

17. CYP4ABXZ cluster chr1_random:66,500,001-66,600,000 4B region

18. CYP4ABXZ cluster chr1:45,200,001-45,300,000 4A11 region

19. CYP4ABXZ cluster chr1:45,300,001-45,400,000 4X1 and pseudogene region

20. CYP4ABXZ cluster chr1:45,400,001-45,455,000 4Z1 region

21. CYP4ABXZ cluster chr1:45,450,001-45,510,000 4A22 region (disregard 1-5000 region = 4Z1)

22. CYP4F cluster chr20:16,200,001-16,300,000 4F22/4F23P/4F8 region (complex)

23. CYP4F cluster chr20:16,300,001-16,400,000 4F8 to 4F12 region (complex)

24. CYP4F cluster chr20:16,400,001-16,500,000 4F24P region

25. CYP4F cluster chr20:16,500,001-16,600,000 4F2/4F11 region

26. CYP4F cluster chr20:16,600,001-16,700,000 4F9P region

I will map out the structure of the gene clusters "live" in class with this information.

Compare your chimp gene cluster with the orthologous one in the paper.

I do not know what to expect for the pseudogenes. This will be new for me as well as you. Chimp and human have been separated for about 6 million years, so they may be pretty different in the pseudogenes.

Some things of interest to look for:

The CYP27C1 gene is missing in rodents due to a chromosome rearrangement that deleted the CYP27C1 gene. Is it in chimps? Yes.

The CYP4Z1 gene is not assembled very well in human, can you find a full length CYP4Z1 in chimp. This gene is absent in mice and may be primate specific or human specific.

Humans have several pseudogenes that are nearly intact and they may be real genes in chimps, not pseudogenes. CYP2T2P, CYP2T3P, CYP2G2P, CYP2AB1P, CYP2AC1P

LINKS FOR THIS MODULE

Bioinformatics home page

NCBI sequence viewer for retreiving a Genbank sequence

DNA translator For translating nucleotide sequence to protein

Human P450 Blast server For comparing a sequence against all human P450s, including pseudogenes

Do-it-yourself WU Blast server For BLASTing a new sequence against your own set of sequences

UCSC genome browser use BLAT search of chimp

62 chimp-human P450 sequence alignments

To help keep this organized, please come to the front and select a numbered region to work on. Mark it on the white board. That way duplication of effort will be small. Get the human sequences at this link