2.2.7 Selected Per-Organism Statistics The following table provides the number of entries and bases of DNA/RNA for the most sequenced organisms in Releases: 121.0, Release 139.0, Release 151 and release 163, release 175 release 187 (Dec. 15, 2011). Red species were not in the top in the release beneath it.These sequences are in several sections and it is important to know about these so you can find what you are looking for at Genbank. This table shows the breakdown of the four main sequence divisions in Genbank in Jan 2010. For 2004 data see this older table. These are nr (non-redundant), est for expressed sequence tags, htgs for high throughput genomic sequences and gss for genome survey sequences. STS is a small section called sequence tagged sites used in mapping. PAT is a patent section. We will talk about each of them in more detail. Nr is the default section of the database that you will search when you do a BLAST search unless you select another section. It is a good place to start. It contains all the genes that have been sequenced by individuals over the years, It has both cDNA and gene sequences. Genome project sequences move into nr as they are finished. Unless a submitter is sending in a batch submission of large numbers of sequences for the other sections of the database their sequence goes here. Note that as of Dec. 1, 2011 nr has 34.8 billion bases or 25% of the total Genbank bases. If you only search here for a match to your sequence, you are neglecting 75% of Genbank data.
The only new species this year compared to 2009 is Solanum lycopersicum (tomato) The tomato genome has been sequenced and a paper on the genome is currently in review at Nature. Release 175 Dec. 15, 2009 Entries Bases Species 13159466 13908099312 Homo sapiens (human) 7860659 8832081330 Mus musculus (mouse 1998941 6279057623 Rattus norvegicus (rat) 2178209 5348086238 Bos taurus (cow) 3889116 5030646878 Zea mays (corn, maize) 2944910 4583294332 Sus scrofa (pig) 1696309 3058371049 Danio rerio (zebrafish) 228206 1352872121 Strongylocentrotus purpuratus (urchin) 1228296 1191090309 Oryza sativa Japonica Group (rice) 1753617 1185599468 Nicotiana tabacum (tobacco) 1423562 1146893384 Xenopus (Silurana) tropicalis (frog) 1204509 1042164589 Drosophila melanogaster (fruitfly) 213485 999212659 Pan troglodytes (chimp) 2283144 991227695 Arabidopsis thaliana (thale cress) 1434990 931947188 Canis lupus familiaris (dog) 655914 911045354 Vitis vinifera (grapevine) 807724 885496928 Gallus gallus (chicken) 1839992 864328945 Glycine max (soybean) 79135 812636075 Macaca mulatta (rhesus monkey) 1118338 793077812 uncultured bacterium 1216178 748172925 Ciona intestinalis (sea squirt) 1139607 697955222 Triticum aestivum (wheat) 1234458 673633441 Oryzias latipes (Medaka, ricefish) Release 163.0 Dec. 15, 2007 Entries Bases Species 11279087 12854169426 Homo sapiens 7228215 8323105071 Mus musculus 1305109 5778218766 Rattus norvegicus 2880831 4205748958 Zea mays 2057685 3845789573 Bos taurus 1584337 2832324577 Danio rerio 2121316 2304579871 Sus scrofa 227854 1352499678 Strongylocentrotus purpuratus (sea urchin) 1418414 1136849350 Xenopus tropicalis 1130666 1005948286 Oryza sativa Japonica Group 212624 974669347 Pan troglodytes 738789 900252577 Drosophila melanogaster 1951847 881157707 Arabidopsis thaliana 802672 864709966 Gallus gallus 530099 838165327 Vitis vinifera (grapevive) 76348 784577143 Macaca mulatta 1221140 700538475 Canis lupus familiaris 1011784 660118364 Sorghum bicolor 1103658 656437787 Triticum aestivum 422523 536075160 Medicago truncatula Release 151.0 Dec. 15, 2005 (chloroplast and mitochondrial sequences not included, and Whole Genome Shotgun sequences not included) Entries Bases Species 9899176 11953879540 Homo sapiens (human) 6625240 7917536708 Mus musculus (mouse) 1035223 5682651636 Rattus norvegicus (rat) 1356047 2424040780 Bos taurus (cattle) 881280 2215171371 Danio rerio (zebrafish) 2678230 1780527217 Zea mays (maize) 357923 1171720224 Oryza sativa (rice, japonica cultivar-group) 1179107 954916818 Xenopus tropicalis (frog, diploid) 1409440 805877955 Canis familiaris (dog) 534811 785832602 Drosophila melanogaster (fruit fly) 1120673 785475842 Sus scrofa (pig) 787971 684548519 Gallus gallus (chicken) 981930 648235477 Arabidopsis thaliana (thale cress) 207825 533669351 Pan troglodytes (chimp) 784734 463997658 Sorghum bicolor (Sorghum) 68350 434454818 Macaca mulatta (Rhesus monkey) 695995 420688666 Ciona intestinalis (seq squirt) 387953 420276295 Medicago truncatula (barrel medic, plant) 596255 404132383 Brassica oleracea (cabbage, cauliflower, kale) 612712 349010415 Triticum aestivum (wheat) Release 139.0 December 15, 2003 Whole Genome Shotgun sequences not included Entries Bases Species 7152768 10307972332 Homo sapiens (human) 5440187 6293834876 Mus musculus (mouse) 843877 5553994470 Rattus norvegicus (rat) 569066 1186984387 Danio rerio (zebrafish) 1700546 1060366537 Zea mays (maize) 266693 714394030 Oryza sativa (rice, japonica cultivar-group) 361257 697671246 Drosophila melanogaster (fruit fly) 887467 508114342 Canis familiaris (dog) 596909 503066134 Gallus gallus (chicken) 589739 424174855 Arabidopsis thaliana (thale cress) 595964 403835636 Brassica oleracea (cabbage, cauliflower, kale) 650813 396135104 Bos taurus (cattle) 175582 385807223 Pan troglodytes (chimpanzee) 25432 337349532 Macaca mulatta (Rhesus monkey) 553918 303123150 Triticum aestivum (wheat) 499249 294204103 Ciona intestinalis (sea squirt, tunicate) 250141 237196637 Medicago truncatula (barrel medic, plant) 222595 233083018 Caenorhabditis elegans (nematode worm) 364210 232049749 Xenopus laevis (African clawed frog, tetraploid) 324322 211144144 Zea mays subsp. Mays (another maize subspecies) 297857 210279451 Silurana tropicalis (another frog) Release 121.0 December 15, 2000 Entries Bases Species 3918724 6702881570 Homo sapiens (human) 2456194 1291602139 Mus musculus (mouse) 166554 487561384 Drosophila melanogaster (fruit fly) 181388 242674129 Arabidopsis thaliana (thale cress) 114553 203544197 Caenorhabditis elegans (nematode worm) 188993 165539271 Tetraodon nigroviridis (freshwater pufferfish) 151411 125948974 Oryza sativa (rice) 218598 106344366 Rattus norvegicus (rat) 159473 71215626 Bos taurus (cattle) 141802 62817102 Glycine max (soybeans) 104535 50991920 Medicago truncatula (barrel medic, plant) 91334 49855996 Trypanosoma brucei (protozoan) 97112 49415566 Lycopersicon esculentum (tomato) 54328 47639714 Giardia intestinalis (hikers Diarrhea protist) 77532 47590936 Strongylocentrotus purpuratus (sea urchin) 49938 44522016 Entamoeba histolytica (parasite) 57779 44489692 Hordeum vulgare (barley) 83726 40906902 Danio rerio (zebrafish) 77506 36885212 Zea mays (maize) 18361 32779082 Saccharomyces cerevisiae (bakers yeast)
Notice that mouse and human have over 13 million ESTs between them. Note the red line is Zea mays (maize). We will use these ESTs in our assignment today.
dbEST release 123109 Summary by Organism - December 31, 2009 Number of public entries: 64,526,769 link to data Homo sapiens (human) 8,296,280 Mus musculus + domesticus (mouse) 4,852,144 Zea mays (maize) 2,018,798 Bos taurus (cattle) 1,558,493 Sus scrofa (pig) 1,538,441 Arabidopsis thaliana (thale cress) 1,527,298 Danio rerio (zebrafish) 1,481,930 Glycine max (soybean) 1,422,982 Xenopus (Silurana) tropicalis (western clawed frog) 1,271,375 Oryza sativa (rice) 1,249,110 Ciona intestinalis 1,205,674 Triticum aestivum (wheat) 1,067,291 Rattus norvegicus + sp. (rat) 1,009,817 Drosophila melanogaster (fruit fly) 821,005 Xenopus laevis (African clawed frog) 677,806 Oryzias latipes (Japanese medaka) 665,382 Brassica napus (oilseed rape) 643,601 Gallus gallus (chicken) 600,323 Hordeum vulgare + subsp. vulgare (barley) 501,614 Salmo salar (Atlantic salmon) 494,392 Panicum virgatum (switchgrass) 442,269 Phaseolus coccineus 391,138 Canis lupus familiaris (dog) 365,909 Physcomitrella patens subsp. patens 362,131 Vitis vinifera (wine grape) 357,856 Caenorhabditis elegans (nematode) 355,321 Ictalurus punctatus (channel catfish) 354,434 dbEST release 121407 Summary by Organism - December 14, 2007 Homo sapiens (human) 8,135,176 Mus musculus + domesticus (mouse) 4,850,243 Bos taurus (cattle) 1,497,772 Sus scrofa (pig) 1,475,731 Danio rerio (zebrafish) 1,379,829 Arabidopsis thaliana (thale cress) 1,276,695 Xenopus tropicalis (western clawed frog) 1,271,375 Oryza sativa (rice) 1,214,088 Zea mays (maize) 1,174,690 Triticum aestivum (wheat) 1,051,175 Rattus norvegicus + sp. (rat) 895,827 Ciona intestinalis 686,396 Xenopus laevis (African clawed frog) 677,784 Gallus gallus (chicken) 599,330 Brassica napus (oilseed rape) 567,177 Drosophila melanogaster (fruit fly) 542,677 Hordeum vulgare + subsp. vulgare (barley) 467,916 Salmo salar (Atlantic salmon) 432,815 Glycine max (soybean) 392,393 Canis familiaris (dog) 365,909 Vitis vinifera (wine grape) 352,984 Caenorhabditis elegans (nematode) 346,107 Oryzias latipes (Japanese medaka) 343,846 Pinus taeda (loblolly pine) 328,628 Physcomitrella patens subsp. patens 305,606 Aedes aegypti (yellow fever mosquito) 298,060 Branchiostoma floridae (Florida lancelet) 277,538 Gasterosteus aculeatus (three spined stickleback) 276,992 Picea glauca (white spruce) 272,464 Oncorhynchus mykiss (rainbow trout) 260,886 Solanum lycopersicum (tomato) 257,540 Malus x domestica (apple tree) 255,111 Pimephales promelas (fish) 249,941 Medicago truncatula (barrel medic) 249,450 Saccharum officinarum (sugarcane) 246,301 Solanum tuberosum (potato) 230,780 Sorghum bicolor (sorghum) 209,776 Chlamydomonas reinhardtii 202,044 Ixodes scapularis (black-legged tick) 193,480 Ovis aries (sheep) 186,678 Bombyx mori (domestic silkworm) 184,509 Aplysia californica (California sea hare) 179,000 Gossypium hirsutum (upland cotton) 177,244 Hydra magnipapillata (pond hydra) 174,162 Nematostella vectensis (sea anemone) 162,714 Schistosoma mansoni (blood fluke) 158,841 Nicotiana tabacum (tobacco) 158,008 Dictyostelium discoideum (cellular slime mold) 155,032 Anopheles gambiae (African malaria mosquito) 153,165 Lotus japonicus (plant) 150,631 Brassica rapa subsp. pekinensis (Chinese cabbage) 147,217 Trichosurus vulpecula (brushtail possum) 147,199 Strongylocentrotus purpuratus (purple urchin) 141,833 Paracentrotus lividus (sea urchin) 140,897 Picea sitchensis (Sitka spruce) 139,569 Capitella sp. I ECS-2004 (marine polychaete worm) 138,404 Toxoplasma gondii (parasite, pathogen) 129,421 Gadus morhua (Atlantic cod) 126,325 Acyrthosiphon pisum (pea aphid) 120,255 Molgula tectiformis (sea squirt, ascidian) 106,863 Helobdella robusta (leech, annelid worm) 101,359 Macaca fascicularis (cynomolgus monkey) 101,192 Summary by Organism - Nov. 11, 2005 Entries with more than 100,000 ESTs Number of public entries: 31,307,034 Homo sapiens (human) 7,057,754 Mus musculus + domesticus (mouse) 4,688,047 Xenopus tropicalis (frog) 1,038,272 Rattus sp. (rat) 704,494 Bos taurus (cattle) 702,645 Danio rerio (zebrafish) 689,581 Ciona intestinalis (sea squirt, a tunicate) 686,396 Zea mays (maize) 656,945 Triticum aestivum (wheat) 600,039 Gallus gallus (chicken) 578,445 Sus scrofa (pig) 502,501 Xenopus laevis (African clawed frog) 473,792 Arabidopsis thaliana (thale cress) 420,789 Oryza sativa (rice) 406,790 Hordeum vulgare + subsp. vulgare (barley) 395,019 Drosophila melanogaster (fruit fly) 383,407 Glycine max (soybean) 355,978 Canis familiaris (dog) 349,306 Pinus taeda (loblolly pine) 329,469 Caenorhabditis elegans (nematode, round worm) 302,080 Branchiostoma floridae (Florida lancelet, Amphioxus) 277,538 Pimephales promelas (minnow) 249,938 Saccharum officinarum (sugar cane) 246,301 Oncorhynchus mykiss (rainbow trout) 239,327 Oryzias latipes (Japanese medaka) 221,546 Solanum tuberosum (potato) 219,765 Medicago truncatula (barrel medic) 216,703 Aedes aegypti (yellow fever mosquito) 213,805 Sorghum bicolor (sorghum) 208,466 Lycopersicon esculentum (tomato) 199,279 Malus x domestica (apple tree) 197,774 Vitis vinifera (grapes) 190,434 Hydra magnipapillata (hydra, radial animal) 174,162 Gasterosteus aculeatus (stickleback fish) 170,994 Chlamydomonas reinhardtii (green alga) 167,641 Schistosoma mansoni (blood fluke) 158,841 Dictyostelium discoideum (cellular slime mold) 155,032 Anopheles gambiae (African malaria mosquito) 153,165 Strongylocentrotus purpuratus (purple sea urchin) 130,988 Bombyx mori (domestic silkworm) 127,748 Toxoplasma gondii (protozoan parasite) 125,741 Physcomitrella patens subsp. patens (moss) 120,702 Salmo salar (Atlantic salmon) 113,002 Lotus corniculatus (Birdsfoot trefoil plant, legume) 111,623 Molgula tectiformis (ascidian, tunicate, like Ciona) 106,863 Picea glauca (white spruce) 104,305 Summary by Organism - January 2, 2004 Number of public entries: 19,635,256 Homo sapiens (human genome sequenced) 5,469,433 Mus musculus + domesticus (mouse genome sequenced) 4,030,839 Rattus sp. (rat genome almost sequenced) 558,402 Triticum aestivum (wheat) 549,915 Ciona intestinalis (sea squirt, a urochordate, genome seq)492,511 Gallus gallus (chicken) 451,655 Danio rerio (zebrafish) 405,962 Zea mays (maize) 391,145 Xenopus laevis (African clawed frog, tetraploid) 357,038 Hordeum vulgare + subsp. vulgare (barley) 348,282 Glycine max (soybean) 344,524 Bos taurus (cattle) 331,139 Silurana tropicalis (frog, diploid smaller genome) 297,086 Drosophila melanogaster (fruit fly genome sequenced) 267,332 Oryza sativa (rice genome sequenced) 266,949 Saccharum officinarum (sugar cane) 246,301 Sus scrofa (pig genome is being sequenced) 240,001 Caenorhabditis elegans (nematode worm, genome sequenced) 215,200 Arabidopsis thaliana (thale cress genome sequenced) 196,904 Medicago truncatula (barrel medic) 187,763 Sorghum bicolor (sorghum a grass) 161,766 Dictyostelium discoideum (slime mold genome sequenced) 155,032 Chlamydomonas reinhardtii(green algae,genome sequenced) 154,600 Lycopersicon esculentum (tomato) 150,410 Schistosoma mansoni (blood fluke) 139,135 Oncorhynchus mykiss (rainbow trout) 137,127 Vitis vinifera (grape) 135,712 Anopheles gambiae (African malaria mosquito genome seq) 134,784 Solanum tuberosum (potato) 132,122 Pinus taeda (loblolly pine) 110,622 Oryzias latipes (Japanese medaka) 103,098 Physcomitrella patens subsp. patens (moss) 82,313 Toxoplasma gondii (protozoan parasite) 72,859 Lactuca sativa (garden lettuce) 68,188 Populus tremula x Populus tremuloides (aspen tree hybrid) 65,981 Helianthus annuus (sunflower) 59,841 Salmo salar (salmon) 59,420 Strongylocentrotus purpuratus (purple sea urchin) 51,744
dbGSS release 123109GSS is a database of genomic fragments about the same size as ESTs, but these of course contain introns and non-coding regions as well as exons. Please note the organisms represented are different than the ESTs. This can be of help sometimes. Notice there are 10 different species of rice (Oryza). Oryzias latipes is not rice, but a small fish that grows in the rice paddies. Oryzias, Tetraodon nigroviridis, the freshwater pufferfish and Danio (zebrafish) are all fish. These genes can be surprisingly similar to human and they are helpful when trying to assemble genes from genomic DNA when you cannot find the intron-exon boundaries. Do not forget them, they can help you.
dbGSS release 123109 Summary by Organism - December 31, 2009 Number of public entries: 27,301,327 marine metagenome 2,642,942 Mus musculus + domesticus (mouse) 2,437,343 Zea mays + subsp. mays (maize) 2,091,983 Nicotiana tabacum (tobacco) 1,420,579 Homo sapiens (human) 1,268,770 Sus scrofa (pig) 981,378 Rattus norvegicus (rat) 866,532 Canis lupus familiaris (dog) 853,938 Sorghum bicolor 796,840 Brassica oleracea 680,836 Oryzias latipes (ricefish) 566,625 Bos taurus (bovine) 515,056 Arabidopsis thaliana 509,506 Ascaris suum 451,746 Ovis aries (sheep) 425,689 Bombyx mori (silkworm) 411,562 Glycine max (soybean) 368,551 Chlorocebus aethiops (African green monkey) 360,973 Solanum phureja (a kind of potato) 343,872 Mus musculus molossinus 337,492 Solanum lycopersicum (tomato) 319,766 Equus caballus (horse) 315,533 Oryza sativa (rice) 299,369 Summary by Organism - December 14, 2007 marine metagenome 2,642,942 Zea mays + subsp. mays 2,083,122 Mus musculus 1,845,032 Homo sapiens (human) 1,211,538 Canis familiaris (dog) 853,938 Sorghum bicolor 794,962 Brassica oleracea 595,474 Sus scrofa 595,150 Oryzias latipes 566,625 Bos taurus (bovine) 514,766 Arabidopsis thaliana 488,181 Ascaris suum 451,746 Ovis aries (sheep) 425,679 Glycine max (soybean) 368,246 Mus musculus molossinus 337,492 Solanum lycopersicum (tomato) 319,461 Equus caballus 315,464 Rattus norvegicus 307,774 Oryza sativa (rice) 280,447 Oryza australiensis 271,727 Chlorocebus aethiops 253,760 Oryza ridleyi 215,820 Oryza coarctata 204,809 Pan troglodytes 197,579 Ustilago maydis 193,251 Tetraodon nigroviridis 188,963 Brassica rapa subsp. pekinensis 179,678 Oryza minuta 179,008 Macropus eugenii 178,807 Medicago truncatula (barrel medic) 168,815 Gallus gallus (chicken) 164,629 Danio rerio 161,275 Oryza granulata 144,859 Oryza rufipogon 143,074 Solanum tuberosum 141,471 Oryza alta 136,269 Xenopus tropicalis 131,325 Aedes aegypti (yellow fever mosquito) 118,058 Oryza nivara 109,315 Vitis vinifera 109,147 Oryza officinalis 105,939 Drosophila melanogaster 102,113 dbGSS release 111105 Summary by Organism - November 11, 2005 Number of public entries: 13,340,273 Entries with more than 100,000 GSS reads Zea mays + subsp. mays 1,942,149 Mus musculus 1,515,692 Homo sapiens (human) 936,533 Canis familiaris (dog) 853,905 Brassica oleracea 595,420 Sorghum bicolor 573,724 Bos taurus (bovine) 488,965 Arabidopsis thaliana 416,123 Ovis aries (sheep) 376,602 Mus musculus molossinus 337,492 Sus scrofa 589,992 Rattus norvegicus 307,774 Oryza sativa (rice) 242,464 Oryza coarctata 195,285 Tetraodon nigroviridis 188,963 Lycopersicon esculentum (tomato) 184,832 Pan troglodytes 176,409 Oryza minuta 169,651 Medicago truncatula (barrel medic) 168,679 Gallus gallus (chicken) 164,557 Danio rerio 159,832 Oryza australiensis 137,530 Oryzias latipes 135,781 Xenopus tropicalis 131,325 Oryza alta 128,732 Aedes aegypti (yellow fever mosquito) 118,056 Oryza nivara 106,124 Oryza officinalis 103,251 dbGSS release 010204 Summary by Organism - January 2, 2004 Number of public entries: 7,968,534 Zea mays + subsp. mays (maize, corn) 1,618,438 Mus musculus (mouse) 1,156,190 Homo sapiens (human) 896,054 Canis familiaris (dog) 853,888 Brassica oleracea (cabbage, cauliflower) 595,321 Bos taurus (cattle) 309,619 Rattus norvegicus (rat) 307,593 Arabidopsis thaliana (thale cress) 286,669 Tetraodon nigroviridis (pufferfish) 188,963 Danio rerio (zebrafish) 159,024 Pan troglodytes (chimpanzee) 158,594 Gallus gallus (chicken) 137,133 Aedes aegypti (yellow fever mosquito) 118,044 Oryza sativa (rice) 108,060 Trypanosoma brucei (parasite) 90,847 Entamoeba histolytica (parasite) 79,674 Strongylocentrotus purpuratus(sea urchin) 76,019 Xenopus tropicalis (frog) 66,547 Medicago truncatula (barrel medic) 61,232 Anopheles gambiae (mosquito) 60,352 Drosophila melanogaster (fruit fly) 56,954 Takifugu rubripes (Japanese pufferfish) 50,808
The newest data on divisions from EMBL, Dec. 1, 2011. EMBL has some differences from Genbank. Fungi is broken out from Plants, there is no Primate section, etc.
Data from Jan 6, 2010 1. PRI - primate sequences 11,461,667 2. ROD - rodent sequences 3,916,991 3. MAM - other mammalian sequences 14,250,223 4. VRT - other vertebrate sequences 2,856,393 5. INV - invertebrate sequences 6,550,758 6. PLN - plant, fungal, and protist sequences 3,681,540 7. BCT - bacterial sequences (includes Archaea) 1,485,534 8. VRL - viral sequences 795,464 9. PHG - bacteriophage sequences 5,952 10. SYN - synthetic sequences 91,574 11. UNA - unannotated sequences 225 12. EST - EST sequences (expressed sequence tags) 64,569,628 13. PAT - patent sequences 12,388,931 14. STS - STS sequences (sequence tagged sites) 1,311,239 15. GSS - GSS sequences (genome survey sequences) 27,049,249 16. HTG - HTG sequences (high-throughput genomic sequences) 143,922 17. HTC - unfinished high-throughput cDNA sequencing 551,480 18. CON - constructed (for details on assembly of contigs) 7,648,171 19. ENV - environmental samples 19,702,621 Data from Jan 9, 2008 1. PRI - primate sequences 4,286,364 2. ROD - rodent sequences 3,851,113 3. MAM - other mammalian sequences 9,825,135 4. VRT - other vertebrate sequences 1,810.578 5. INV - invertebrate sequences 2,589,791 6. PLN - plant, fungal, and protist sequences 1,369,982 7. BCT - bacterial sequences (includes Archaea) 450,292 8. VRL - viral sequences 524,240 9. PHG - bacteriophage sequences 3,741 10. SYN - synthetic sequences 55,697 11. UNA - unannotated sequences 4,124,711 12. EST - EST sequences (expressed sequence tags) 49,040,258 13. PAT - patent sequences 4,745,089 14. STS - STS sequences (sequence tagged sites) 933,316 15. GSS - GSS sequences (genome survey sequences) 21,371,486 16. HTG - HTG sequences (high-throughput genomic sequences) 119,165 17. HTC - unfinished high-throughput cDNA sequencing 493,976 18. CON - constructed (for details on assembly of contigs) 4,228,826 19. ENV - environmental samples 2,044,479 Data from Jan 11, 2006 1. PRI - primate sequences 1,962,405 2. ROD - rodent sequences 2,640,641 3. MAM - other mammalian sequences 5,575,208 4. VRT - other vertebrate sequences 532,022 5. INV - invertebrate sequences 1,428,656 6. PLN - plant, fungal, and protist sequences 735,872 7. BCT - bacterial sequences (includes Archaea) 261,480 8. VRL - viral sequences 331,748 9. PHG - bacteriophage sequences 2,886 10. SYN - synthetic sequences 25,747 11. UNA - unannotated sequences 1,037,720 12. EST - EST sequences (expressed sequence tags) 32,633,559 13. PAT - patent sequences 2,685,324 14. STS - STS sequences (sequence tagged sites) 880,049 15. GSS - GSS sequences (genome survey sequences) 13,671,106 16. HTG - HTG sequences (high-throughput genomic sequences) 80,350 17. HTC - unfinished high-throughput cDNA sequencing 435,673 18. CON - contig (for details on assembly of contigs) 309,981 19. ENV - environmental samples 206,786 Data from April 15 2001 1. PRI - primate sequences 1,111,961 2. ROD - rodent sequences 484,969 3. MAM - other mammalian sequences 1,146,543 4. VRT - other vertebrate sequences 143,570 5. INV - invertebrate sequences 375,247 6. PLN - plant, fungal, and protist sequences 530,861 7. BCT - bacterial sequences (includes Archaea) 239,646 8. VRL - viral sequences 214,974 9. PHG - bacteriophage sequences 2,481 10. SYN - synthetic sequences 9,793 11. UNA - unannotated sequences 1,146 12. EST - EST sequences (expressed sequence tags) 19,667,401 13. PAT - patent sequences 1,489,482 14. STS - STS sequences (sequence tagged sites) 258,123 15. GSS - GSS sequences (genome survey sequences) 8,109,223 16. HTG - HTG sequences (high-throughput genomic sequences) 67,813 17. HTC - unfinished high-throughput cDNA sequencing 149,393 18. CON - contig (for details on assembly of contigs) 3,639 note: these numbers can be found by Entrez nucleotide search for gbdiv_pri etc.
>Dictyostelium CYP51 P450 C-terminal
ETQKDINDIVQKENQGEINFDGLKRMNRLETVIREVLRLHPPLIFLMRK
VMTPMEYKGKTIPAGHILAVSPQVGMRLPTVYKNPDSFEPKRFDVED
KTPFSFIAFGGGKHGCPGENFGILQIKTIWTVLSTKYNLEVGPVPPTD
FTSLVAGPKGPCMVKYSKKQK*
Once your sequence is pasted in the window, you need to select a database from a pull down menu. Nr/nt is the default. Below the Choose Search Set section
is the blast button. Do not click this yet. Underneath the BLAST button is algorithm parameters. These includes the filter option. I usually recommend turning the filter off, unless
there is a weird repetive sequence like a run of QQQQQ in your sequence. If you
leave the filter on, you may find some unpleasant runs of XXXXXXX in the blast
output where you do not want it. One of the best features of this blast
page is the limitation of sequences to be searched. This is found in the Choose Search Set section. Here you can type in a species like
mouse (Mus musculus) or a whole taxonomic range of
organisms like green plants (Viridiplantae) or Fungi. (All organisms is the
default). You can also choose to type in a very specific set yourself like
Diplomonadida or Rhodophyta. The program follows your spelling and offers you the closest options so you often do not have to type in more than three letters. In the Entrez Query window you can enter boolean queries such as metazoa[orgn] NOT vertebrata[orgn]. This will give you invertebrate sequences.
I use this feature all the time to restrict the search output to just what I
want. For the Dictyostelium test sequence above, you may select Dictyostelium discoideum from the organism menu if you
want to find the exact sequence above, or you may expand the search to closely
related organisms by typing in Mycetozoa. It may be more interesting to use
Fungi on this first blast, since there are many CYP51s
known from fungi in the nr database.
Once you press the BLAST button, You will get a screen that refreshes every few seconds until the search is done. This final output will give you
another window with a graphical box at the top that has
colored bars representing the hits. Red is for the highest scoring hit and the
colors get cooler as the scores drop. Red hits are generally good matches.
Since we searched a slime mold vs. Fungi here, the best hits are magenta not
red.
Below the graphics box is a text list of your hits with the best scores at the
top. The E value numbers (6e-20 etc.) refer to the chance that this was an
accidental match. The larger the negative exponent the better the match and the
lower the probability that it was a chance occurrence.
This list is hyperlinked to the actual Genbank entries for each sequence. Below
the text list are the actual alignments. This is where you see what you have
found.
The output here shows a gap in the Dicty sequence compared to most of the
fungal sequences. Since fungi have relatively few introns, this is probably not
an intron in the fungi, but it may be an insertion that is found in fungi and
not in the slime mold sequence. Notice that all the top hits are identified as
CYP51 sequences or 14 alpha demethylase enzymes. This is because the cytochrome
P450 family is extensively annotated and more than 16,000 of the genes are named.
Even the new sequences that are not officially named are usually identified by
this type of blast search and the authors often tag them as a P450 (in this
case CYP51).
Your first search was intended to get you started in blast searching with a simple example. There are many aspects of this tool that we need to discuss in more detail. The first is the selection of the different databases to search. Most first time users of a software package tend to use the default settings. In this case that could be disasterous, because you could miss a large number of sequences that are not to be found in the nr section of Genbank (the default). There are large EST projects on some organisms that will make only EST sequences. These will not be found in nr. Two examples are Triticum aestivum (wheat) and Zea mays (maize or corn) with >1,000,000 EST sequences each. If you are after a plant gene, it may be present in these ESTs but not in nr. The 1,500,000 Bos taurus (cattle) ESTs can also be used to help assemble human, mouse or other vertebrate genes, by indicating the location of introns. On human chromosome 6 There is a gene with an intron that is 479kb long. This gene could not have been assembled without an mRNA or EST sequence to show that the first exon was that far from the rest of the gene. Another example is Giardia (hikers diarrhea). Giardia is a common human pathogen and it is thought to be a very ancient eukaryote. Therefore, it is valuable to sequence for medical and evolutionary studies. The Giardia Genome project is being conducted at Woods Hole Oceanographic Institute and the data is searchable at NCBI in the HTGS section of Genbank. (HTGS = High Throughput Genomic Sequences).
If you have not added the blast page at NCBI to your bioinformatics bookmark file, please do it now. NCBI Blast page
I recommend that you add the TBLASTN page also. TBLASTN
You will be needing these over and over as we progress. As a general rule, it is good policy to search nr, HTGS, EST and GSS sections of Genbank when hunting for matches to a new sequence. If your species is in WGS you will definitely want to look there also.
The following is a real example taken from an alignment of mitochondrial carrier sequences. Mitochondrial carriers are membrane transport proteins of the inner mitochondrial membrane (and peroxisomal and hydrogenosomal membranes). These carriers enable communication betweeen the mitochondrial matrix and the cytosol. They are responsible for exporting ATP to the cytosol and importing ADP into the matrix to be remade into ATP. There are 46 of these genes in humans and similar numbers in other species (35 in yeast). I am posting a link to a sequence alignment of 265 of these carriers that has some gaps in some of the sequences. For demonstration purposes we will now try to fill in some of these gaps. Once you are at the page use the find command to search for GMPA-. This should take you to a small gap in sequence 92. This gap was not possible to fill when the alignment was made, because this sequence did not exist in Genbank. The sequence looks like this with eight missing amino acids.
90 -------------ADENGRVGGINLLTAGALAGVPAASLVTPADVIKTRLQVAARAGQ-----TTYSGVV
91 -------------ANEDGQVSPGSLLLAGAIAGMPAASVVTP-DVIKTRLQVAARAGQ-----TTYSGVI
92 -------------ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQ-----TTYNGVT
92A ------------PSPEDKTYNILRKVIIAGLASSLACVMSVTLDMAKCRIQGPQPVKG----EVKYQWTI
The second gap is an alignment gap and there are no missing amino acids here. Copy the sequence 92 above and paste it in the blast window at NCBI. Leave the dashes in the sequence. Select the EST database and turn the filter off. To reduce searching unwanted sequences type in mouse in the organism window. For this search limit to mouse ESTs. Do the blast.
>gi|1387895|gb|W76821.1|W76821 me73f10.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA
clone IMAGE:401227 5' similar to WP:K02F3.2 CE01348 ;.
Length = 455
Score = 48.1 bits (113), Expect = 2e-05
Identities = 23/23 (100%), Positives = 23/23 (100%)
Frame = +3
Query: 1 ANEDGQVSPGSLLLAGAIAGMPA 23
ANEDGQVSPGSLLLAGAIAGMPA
Sbjct: 384 ANEDGQVSPGSLLLAGAIAGMPA 452
>gi|1387483|gb|W77458.1|W77458 me66c04.r1 Soares mouse embryo NbME13.5 14.5 Mus musculus cDNA
clone IMAGE:400518 5' similar to WP:K02F3.2 CE01348 ;.
Length = 471
Score = 43.9 bits (102), Expect = 3e-04
Identities = 21/21 (100%), Positives = 21/21 (100%)
Frame = +2
Query: 24 VIKTRLQVAARAGQTTYNGVT 44
VIKTRLQVAARAGQTTYNGVT
Sbjct: 2 VIKTRLQVAARAGQTTYNGVT 64
Shown above are the two ESTs that were at the existing boundaries of the small gap. Look at the first one W76821.1 (this is an accession number for the sequence. It has a version number. Some version numbers may be as high as 29). Notice on the 3rd line that the Length is 455 nucleotides. Now look at the alignment. The upper sequence in the alignment is your query sequence. The lower sequence is the mouse EST hit. The line in between shows the matches between the two sequences. Look at the numbering on the lower sequence from 384 to 452. These are nucleotide numbers, 3/amino acid. 452 is only three nucleotides from the end of the sequence. So this sequence stops where our gap was. The second EST W77458.1 starts where the gap ends. It has only one nucleotide protruding into the gap. If you click on the accession number of the first sequence it takes you to the sequence entry which was created on June 20, 1996. The second est was created on the same date. Notice that both ESTs have another identifier that looks like this: me73f10.r1. The r1 on the end of this name indicates it was sequenced from a reverse primer. This information may be valuable, since the insert of this clone may have been sequenced from the opposite end of the clone. That sequence would have a different accession number, but it would have the same clone name with .s1 at the end instead of .r1. The s1 indicates a standard read from a forward primer. The two sequences would be from the same clone, but from opposite ends. This can unite sequence fragments that do not overlap in the middle. You can search for the clone name (without the .r1 part) in the NCBI homepage search window (select nucleotide in the pull down menu first). If you find the .s1 match then you have more sequence information to use.
We have seen that the gap in the original sequence was caused by these sequences ending just a little short of overlapping. The two best hits in this search do span the gap, allowing the missing sequence to be filled in.
>gi|9904727|gb|BE624311.1|BE624311 uu45a06.y1 Soares_thymus_2NbMT Mus musculus cDNA clone
IMAGE:3374866 5' similar to TR:O14575 O14575 SIMILAR TO
ADP/ATP CARRIER PROTEINS ;.
Length = 480
Score = 80.1 bits (196), Expect = 4e-15
Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%)
Frame = +1
Query: 1 ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44
ANEDGQVSPGSLLLAGAIAGMPA VIKTRLQVAARAGQTTYNGVT
Sbjct: 283 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 438
>gi|3519846|gb|AI119522.1|AI119522 uf04h04.y1 Sugano mouse liver mlia Mus musculus cDNA clone
IMAGE:1499671 5' similar to WP:K02F3.2 CE01348 ;.
Length = 670
Score = 80.1 bits (196), Expect = 4e-15
Identities = 44/52 (84%), Positives = 44/52 (84%), Gaps = 8/52 (15%)
Frame = +2
Query: 1 ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGVT 44
ANEDGQVSPGSLLLAGAIAGMPA VIKTRLQVAARAGQTTYNGVT
Sbjct: 482 ANEDGQVSPGSLLLAGAIAGMPAASLVTPADVIKTRLQVAARAGQTTYNGVT 637
BE624311 uu45a06.y1 was created August 24, 2000. The second one was created Sept. 2, 1998. Note that the clone name has a .y1 after it. That pairs up with a .z1 of the same name. The hit shown below does not come from the same gene. It is nearly identical on the right side, but not on the left. It is from sequence 90 in the alignment (see above). You need to be careful, especially with short sequences, that you are not making chimeric or hybrid sequences when you assemble protein sequences from fragments.
>gi|10753840|gb|BF022507.1|BF022507 uy43g06.y1 NCI_CGAP_Lu30 Mus musculus cDNA clone IMAGE:3662362 5'
similar to TR:O75746 O75746 ARALAR1 PROTEIN. ;.
Length = 450
Score = 56.6 bits (135), Expect = 5e-08
Identities = 32/51 (62%), Positives = 39/51 (75%), Gaps = 8/51 (15%)
Frame = +3
Query: 1 ANEDGQVSPGSLLLAGAIAGMPA--------VIKTRLQVAARAGQTTYNGV 43
A+E+G+V +LL AGA+AG+PA VIKTRLQVAARAGQTTY+GV
Sbjct: 294 ADENGRVGGINLLTAGALAGVPAASLVTPADVIKTRLQVAARAGQTTYSGV 446
That was an easy example of using the EST database to find a missing sequence fragment from a partial gene. Now we will try a harder problem with a larger gap, also in a mouse carrier. Go back to the sequence alignment link at the top of this page and search for -SFLAG in the alignment. You should find sequence 96 as shown below.
segment 3
94A ----------PRKSDGSGEAVFYWSLIAGLLSGMTSAFMVTPFDVVKTRLQADGEKK--------FKGIM 0
95 ------------FNELAGKASFAHSFVSGCVAGSIAAVAVTPLDVLKTRIQTLKKGLGE----DMYSGIT 0
96 ------------------------SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNE----DTYSGFL 0
96A ----------PRRNDGSGEAVFWCSFLAGLAAGSTAALAVNPFDVVKTRLQAIKKADGE----KEFKGIS 0
This gap extends upstream over several pages of the alignment. For your convenience they are shown here with some surrounding sequence. This will be important later. Notice that sequence 95 and 96 in the first segment below are nearly 100% identical. Sequence 95 is a human sequence and it is the ortholog of the mouse sequence with the gap. That will be useful to us.
segment 1
94 GMYRGAAVNLTLVTPEKAIKLAANDFFRHQLS-KDG-----QKLTLLKEMLAGCGAGTCQVIVTTPMEML 112
94A GMYRGSAVNIVLITPEKAIKLTANDFFRYHLASDDG------VIPLSRATLAGGLAGLFQIVVTTPMELL 0
95 GMYRGAAVNLTLVTPEKAIKLAANDFFRRLLM-EDG-----MQRNLKMEMLAGCGAGMCQVVVTCPMEML 0
96 GMYRGAAVNLTLVTPEKAIKLAANDFLRH----------------------------------------- 0
96A GMYRGSGVNILLITPEKAIKLTANDYFRHKLTTKDG------KLPLTSQMVAGGLAGAFQIIVTTPMELL 0
segment 2
94 KIQLQDAGRIAAQRKILAPRSTATQLTRDLLRSRG-IAGLYKGLGATLLRDVPFSVVYFPLFANLNQLG- 0
94A KIQMQDAGRVDRAAGREVKTITALGLTKTLLRERG-IFGLYKGVGATGVRDITFSMVYFPLMAWINDQG- 0
95 KIQLQDAGRLAVHHQGSARRPSATLIAWELLRTQG-LAGLYRGLGATLLRDIPFSIIYFPLFANLNNLG- 0
96 ---------------------------------------------------------------------- 0
96A KIQMQDAGRVAKLAGKTVEKVSATQLASQLIKDKG-IFGLYKGIGATGLRDVTFSIIYFPLFATLNDLG- 0
Try to do the same technique as before. Copy the beginning and end of sequence 96 and place them in the tblastn window. Select mouse ESTs and turn off the filter. Do the blast. The results will show hits that extend into the gap region.
>gi|11620519|gb|BF533156.1|BF533156 602073695F1 NCI_CGAP_Li9 Mus musculus cDNA clone IMAGE:4210775 5'.
Length = 890
Score = 86.3 bits (212), Expect = 4e-17
Identities = 42/42 (100%), Positives = 42/42 (100%)
Frame = +1
Query: 30 SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 71
SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL
Sbjct: 124 SFLAGCVAGSAAAVAVNPCDVVKTRLQSLERGVNEDTYSGFL 249
>gi|10379135|gb|BE861312.1|BE861312 UI-M-AK0-adi-c-11-0-UI.r1 NIH_BMAP_MHY Mus musculus cDNA clone
UI-M-AK0-adi-c-11-0-UI 5'.
Length = 538
Score = 65.5 bits (158), Expect = 8e-11
Identities = 39/73 (53%), Positives = 43/73 (58%), Gaps = 13/73 (17%)
Frame = +3
Query: 1 GMYRGAAVNLTLVTPEKAIKLAANDFLRH-------------SFLAGCVAGSAAAVAVNP 47
GMYRGAAVNLTLVTPEKAIKLAANDFLR LAGC AG P
Sbjct: 309 GMYRGAAVNLTLVTPEKAIKLAANDFLRQLLMQDGTQRNLKMEMLAGCGAGICQVGITCP 488
Query: 48 CDVVKTRLQSLER 60
+++K +LQ R
Sbjct: 489 KEMLKIQLQDAGR 527
The first hit above (BF533156) extends 123 bp into the gap upstream. You can see this from the numbering of the lower sequence starting at 124. To get this upstream sequence you must click on the accession number from the blast search and go to the sequence entry. Copy the sequence and take it to the EXPASY DNA translator
Paste it in the window and then translate it. Compare the translations to the mouse and the human sequence above it to find the correct frame.
The DNA sequence of BF533156
GCCAACCTGAATCAGCTGGGCAAACAGGGGGAAGTAGACAATGGAGAAGGGAACATCCCC
CTGTTTGCCAACCTGAATCAGCTGGGCCGCCCATCCTCTGAGGAGAAGTCGCCTTTCTAT
GTGTCCTTCCTAGCAGGCTGCGTGGCTGGGAGTGCAGCCGCTGTGGCCGTCAACCCTTGT
GATGTGGTGAAGACTCGGCTCCAGTCCCTTGAGAGAGGTGTTAATGAGGACACTTACTCT
GGGTTTCTGGACTGTGCAAGGAAGATCTGGAGACATGAAGGTCCCTCAGCCTTCCTGAAA
GGCGCATACTGCCGTGCGCTGGTCATTGCCCCGCTGTTTGGCATCGCCCAGGTGGTCTAC
TTTCTGGGCATTGCCGAGTCCCTGCTGGGGCTGCTGCAAGAACCCCAGGCCTGAGCCCAT
GGCTGCTTCTCTCCAGCCTATGGGCAGGGGCCAGAACAGGGTGACCAGCACAAGCCTGAG
GAGGAGTGGTCTCTCCCCGGTCCTCCTCATTAAGATGGGAAGGCAAGGGGAGGGTGCAGG
GTCCACATGGGTGATGCACACATAAGCCCCTGTGTGGTCCTGAAGGGACAACAAATGGGA
TCGAGGTCTTATCTATGTAGAAAATGCAGAAATCTGTACATCCCTCAAGCCAGTTCTGTC
CCATCCTTGTTACTCAAACCCAGTCCACTGGCTGAACACCCATGGGACAGAGCTGGTCTC
TGGGTGGGGGCCCCAGGCCTGGTTTGGGAGGGGGACCTACCTGGGGTTCACTGGGCCTGG
CCCTGGGGGCCCTGGCTTCCATAGGGGCCCACCCCCGATTTTTTGGGTTCCCCGCCAGGG
GTCTCGGCCGGCGAGCTTGCCGGTGGTCCCCTCGACCTGTCCCCGCTTGG
Only the first three lines are needed to include the first 124 bp.
Translation of the first 180bp phase 1 has the known sequence SFLAG...
And no stop codons, so this is probably correct if there are no frameshifts.
Compare it to the human sequence.
At least from PLFAN the sequences match, but upstream they do not and this region must be considered suspect. You have now filled in part of the gap. The region between FANLNNLG and GCVAG is not very strong and may contain a frameshift. Watch for better matches in this region.
Look at the other hit BE861312. This has sequence going into the gap from the other side. Compare the sequence with the human sequence. The top sequence is from the blast output starting after the known sequence NDFLR
Phase 1
ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC
Phase 2
PT*ISWANRGK*TMEKGTSPCLPT*ISWAAHPLRRSRLSMCPS*QAAWLGVQPLWPSTL
Phase 3
QPESAGQTGGSRQWRREHPPVCQPESAGPPIL*GEVAFLCVLPSRLRGWECSRCGRQPL
Phase 1 mouse
ANLNQLGKQGEVDNGEGNIPLFANLNQLGRPSSEEKSPFYVSFLAGCVAGSAAAVAVNPC
||||||| || | | || |||||| ||||| |
YRGLGATLLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL
Human 95
mouse
QLLMQDGTQRNLKMEMLAGCGAGICQVGITCPKEMLKIQLQDAGR
||| || ||||||||||||||| ||| ||||||||||||||||
RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR
Human 95
They are almost identical and so the frame is correct. This fills in much of the gap from the other side. There are only about 48 amino acids missing in the middle now. To try to fill this gap you can repeat the search of the EST database or you can go to another section to find the missing part, the HTGS section. I recommend this. You will find genomic DNA with introns here, so you need to have the human protein sequence to help identify the missing mouse exon sequence. Construct a hybrid sequence with the new mouse sequence you just found and human from seq 95 to fill in the gap. Then blast the mouse HTGS section of Genbank. In the sequence below human is in lower case.
RLLMEDGMQRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGR
lavhhqgsarrpsatliawellrtqglaglyrglgatllrdipfsiiyf
PLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL
The result from this search shows a good match which is probably to the same gene. There are some differences, however this is a genomic sequence and the gene is on a 90,000 bp contig that is probably more accurate than the EST sequences. There are also three different exons here (see the nucleotide numbering). One of the regions with the greatest number of sequence differences is the region that looked like it might contain frameshifts earlier. The differences on the 2nd exon are mouse human differences, because we searched with human sequence, not mouse.
>gb|AC084273.11|AC084273 Mus musculus clone rp23-313e8, WORKING DRAFT SEQUENCE, 5 unordered pieces
Length = 182050
Score = 81.6 bits (200), Expect = 2e-15
Identities = 38/45 (84%), Positives = 40/45 (88%)
Frame = -2
Query: 9 QRNLKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGS 53
QRNLKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRL H S
Sbjct: 142635 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLGEAHPNS 142501
Score = 58.5 bits (140), Expect = 2e-08
Identities = 30/34 (88%), Positives = 31/34 (90%)
Frame = -3
Query: 55 RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIP 88
RRPSATLIA ELLRTQGL+GLYRGLGATLLR P
Sbjct: 139970 RRPSATLIARELLRTQGLSGLYRGLGATLLR*AP 139869
Score = 92.4 bits (228), Expect = 1e-18
Identities = 44/53 (83%), Positives = 47/53 (88%)
Frame = -3
Query: 83 LLRDIPFSIIYFPLFANLNNLGFNELAGKASFAHSFVSGCVAGSIAAVAVTPL 135
L RDIPFSIIYFPLFANLN LG +EL GKASF HSFV+GC AGS+AAVAVTPL
Sbjct: 139628 LYRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPL 139470
Probable sequence after assembly
QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXXXXXXX
RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGKASFTHSFVAGCTAGSVAAVAVTPL
The sequence is almost done now, but there is the small matter of an intron exon joint to be settled. Introns are intervening sequences in genes that separate exons of coding sequence. Introns ususally begin with GT and end with AG. The human sequence has a phase 1 intron after DAGRL and the mouse is expected to have the same intron location. However, the sequence between the human and mouse does not match here for 7 amino acids. It would be desirable to find an EST for mouse that bridged this gap. The following search shows that there is an EST that does this and this sequence indicates that the mouse sequence is longer than the human sequence from the alignment.
>gb|AI848503.1|AI848503 UI-M-AP1-agf-c-11-0-UI.s2 NIH_BMAP_MST_N Mus musculus cDNA clone
UI-M-AP1-agf-c-11-0-UI 3'.
Length = 438
Score = 200 bits (508), Expect = 5e-51
Identities = 106/131 (80%), Positives = 106/131 (80%), Gaps = 18/131 (13%)
Frame = -3
Query: 1 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLXX------------------XX 42
QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRL
Sbjct: 415 QRNLKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGST 236
Query: 43 XXXRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 102
RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK
Sbjct: 235 STHRRPSATLIARELLRTQGLSGLYRGLGATLLRDIPFSIIYFPLFANLNQLGVSELTGK 56
Query: 103 ASFTHSFVAGC 113
ASFTHSFVAGC
Sbjct: 55 ASFTHSFVAGC 23
This actually identifies a flaw in the alignment where a part of the human gene was left out because it was too long to fit into the existing alignment. The match below is human on the bottom and mouse on the top, showing the extra sequence also exists in a human EST.
>gb|BF329616.1|BF329616 CM2-BN0273-080600-227-b08 BN0273 Homo sapiens cDNA.
Length = 298
Score = 167 bits (423), Expect = 9e-41
Identities = 84/95 (88%), Positives = 88/95 (92%)
Frame = -3
Query: 4 LKMEMLAGCGAGICQVVITCPMEMLKIQLQDAGRLAVCHQASASATPTSRPYSTGSTSTH 63
LKMEMLAGCGAG+CQVV+TCPMEMLKIQLQDAGRLAV HQ SASA TSR Y+TGS STH
Sbjct: 287 LKMEMLAGCGAGMCQVVVTCPMEMLKIQLQDAGRLAVHHQGSASAPSTSRSYTTGSASTH 108
Query: 64 RRPSATLIARELLRTQGLSGLYRGLGATLLRDIPF 98
RRPSATLIA ELLRTQGL+GLYRGLGATLLRDIPF
Sbjct: 107 RRPSATLIAWELLRTQGLAGLYRGLGATLLRDIPF 3
This effort now successfully fills in the gap in mouse sequence 96.
Assignment 1
To test out your new skills at blast searching lets assemble a gene or two from ESTs. Since there are about 10 of us registered for the course I have assigned each of you two Arabidopsis (or rice) sequences to search against the Zea mays EST database to find the maize orthologs of these sequences. The whole Arabidopsis sequences can be obtained at another linked file which has the FASTA format of the selected rice P450 sequences.
E. Akano 75A11 77A9
C. Bricker 77B2 79A7
S. Chintalapudi 89B1 96B2
G. Kuntamallappanavar 89C1 96E1
B. Manda 97B4 701A6
A. Pandey 97C2 703A3
P. Sharma 706C1 707A5
Guest 1 710A5
Guest 2 728B1
Guest 3 735A3
If you finish and would like more to do you can try the same sequences
searched against the wheat ESTs Triticum aestivum. These should be available since
there are over 1 million wheat ESTs.
Go to the FASTA file and search for your sequence name. The CYP prefix is part of the name for Cytochrome P450, but you don't need to search for the whole name. Copy the sequence and take it to the NCBI blast page. Do a tblastn search of the est others database division (not just EST, since that will include mouse and human) and set the species in the organism window to Zea mays. When you get your results, scroll down and look for those sequences with the best match. Shorter sequences will appear later even if they are better matches than longer sequences, so don't look just at the top. When you find your top two or three best matches, you can check if they are the orthologs of the rice sequence by blast searching against the rice P450 data set. I have a server just for this puprose at P450 Blast server. Paste your sequence in this window, select rice from the pull down menu and run the blast. The default database in this blast server is all the human P450 sequences, so you will need to change from the default. If your best hit to rice P450s is the sequence you started with then you probably have the ortholog. This is called the best reciprocal blast hit. Some sequences may not have ortholog hits in the rice ESTs. Do not despair. Just find the matches that are the best you can find to your starting sequence.
Another search option in the P450 blast page is to use the program blastx with the full length EST nucleotide sequence. You can get the nucleotide sequence by clicking on the hyperlinked accession number in the blast output from your NCBI search. You can compare it to rice on the same blast server. Your Zea mays sequences should match best to the rice P450 sequences.
Important tip: When working with sequences in a word processor set the font to
Courier 9 point. This is a fixed space font that will keep your alignments properly aligned. The 9 point size prevents wrapping of blast results when you paste them into a document.
This is an example of the output from a human CYP20 search of Salmo salar ESTs.
The starting sequence was human CYP20. You can check that by comparing the top (query sequence) with the P450 blast server. Genes are genes, so for this example I used human and fish. The results will be very similar when you use rice and Zea mays.
>gi|117525684|gb|EG857411.1| UniGene info EST_ssal_eve_50960 ssaleve thyroid Salmo salar cDNA Salmo salar
cDNA clone ssal_eve_569_158_fwd 5', mRNA sequence.
Length=735
Score = 266 bits (679), Expect = 1e-70, Method: Compositional matrix adjust.
Identities = 148/222 (66%), Positives = 180/222 (81%), Gaps = 1/222 (0%)
Frame = +2
this is the 5th hit on the left side of the graphic (N-terminal part of CYP20)
Query 1 MLDFAIFAVTFllalvgavlylyPASRQAAGIPGITPTEEKDGNLPDIVNSGSLHEFLVN 60
MLDFAIFAVTF++ LVGAVLYLYP+SR A+GIPG+ PTEEKDGNL DIVN GSLHEFL +
Sbjct 35 MLDFAIFAVTFVIFLVGAVLYLYPSSRSASGIPGLNPTEEKDGNLQDIVNRGSLHEFLAS 214
Query 61 LHERYGPVVSFWFGRRLVVSLGTVDVLKQHINPNKTSDPFETMLKSLLRYQSG-GGSVSE 119
LH ++GPV SFWFG R VVSLG+VD L+QHINPN+T+D FETMLKSLL YQSG GG +E
Sbjct 215 LHGQFGPVASFWFGGRPVVSLGSVDQLRQHINPNRTTDSFETMLKSLLGYQSGTGGGATE 394
Query 120 NHMRKKLYENGVTDSLKSNFAlllklseelldkwlsYPETQHVPLSQHMLGFAMKSVTQM 179
MRKKLYE+ V ++ + NF +LLKL EEL+ KW S+P+ QH PL H+ G AMK+VTQ+
Sbjct 395 AVMRKKLYESAVNNTPEKNFPMLLKLVEELVGKWQSFPKDQHTPLCAHLQGLAMKAVTQL 574
Query 180 VMGSTFEDDQEVIRFQKNHGTVWSEIGKGFLDGSLDKNMTRK 221
+G F +D EVI F+KNH +WSEIGKG+LDGS++K+ RK
Sbjct 575 ALGDRFRNDAEVIGFRKNHEAIWSEIGKGYLDGSMEKSSIRK 700
Use the bottom protein sequence (the sequence marked Sbjct) to search with the human P450 blast server. Or you can click on the accession number of your NCBI blast hit to get the nucleotide sequence. The whole nucleotide sequence can be pasted in the P450 Blast server window and the program must be set to blastx instead of blastp. Searches of Human or another vetebrate like Fugu (in the pull down menu) will align the translated sequence against the P450 database entries.
The output from this search will look something like this
>gi|89852388|gb|DY708511.1| UniGene info EST_ssal_rgb2_64250 ssalrgb2 mixed_tissue Salmo salar cDNA Salmo salar cDNA clone ssal_rgb2_603_269_rev 5', mRNA sequence. Length=866 Score = 355 bits (911), Expect = 2e-97, Method: Compositional matrix adjust. Identities = 175/287 (60%), Positives = 230/287 (80%), Gaps = 2/287 (0%) Frame = +3
This is the top hit in the graphic (middle part of CYP20)
Query 31 GIPGITPTEEKDGNLPDIVNSGSLHEFLVNLHERYGPVVSFWFGRRLVVSLGTVDVLKQH 90
GIPG+ PTEEKDGNL DIVN GSLHEFL +LH ++GPV SFWFGR VVSLG+VD L+QH
Sbjct 3 GIPGLNPTEEKDGNLQDIVNRGSLHEFLASLHGQFGPVASFWFGGRPVVSLGSVDQLRQH 182
Query 91 INPNKTSDPFETMLKSLLRYQSG-GGSVSENHMRKKLYENGVTDSLKSNFAlllklseel 149
INPN+T+D FETMLKSLL YQSG-GG +E MRKKLYE+ V ++L+ NF +LLKL EEL
Sbjct 183 INPNRTTDSFETMLKSLLGYQSGTGGGATEAVMRKKLYESAVNNTLEKNFPMLLKLVEEL 362
Query 150 ldkwlsYPETQHVPLSQHMLGFAMKSVTQMVMGSTFEDDQEVIRFQKNHGTVWSEIGKGF 209
+ KW S+P+ QH PL H+LG AMK+VTQ+ +G F +D EVI F+KNH + WSEIGKG+
Sbjct 363 VGKWQSFPKDQHTPLCAHLLGLAMKAVTQLALGDRFRNDAEVIGFRKNHEAIWSEIGKGY 542
Query 210 LDGSLDKNMTRKKQYEDALMQLESVLRNIIKERKGRNFSQHIFIDSLVQGNLNDQQILED 269
LDGS++K+ RK+ YE AL ++E+VL ++K+RKG+ SQ F+D+L+Q NL ++Q++ED
Sbjct 543 LDGSMEKSSIRKEHYESALAEMETVLMSVAKDRKGQR-SQTAFVDTLLQSNLTERQVMED 719
Query 270 SMIFSLASCIITAKLCTWAICFLTTSEEVQKKLYEEINQVFGNGPVT 316
SM+F+LA C+ITA LC WA+ FL+TSE+VQ+KL++E+ V G+ PV+
Sbjct 720 SMVFTLAGCVITANLCIWAVHFLSTSEDVQEKLHQELEDVLGSEPVS 860
>gi|24393371|gb|CA063128.1| UniGene infoGeo ssalrgb509318 mixed_tissue Salmo salar cDNA, mRNA sequence.
Length=695
Score = 201 bits (512), Expect = 3e-51, Method: Compositional matrix adjust.
Identities = 103/168 (61%), Positives = 136/168 (80%), Gaps = 0/168 (0%)
Frame = -1
This is the 5th hit on the right side of the graphic (C-terminal part of CYP20)
Query 294 TSEEVQKKLYEEINQVFGNGPVTPEKIEQLRYCQHVLCETVRTAKLTPVSAQLQDIEGKI 353
TSE+VQ+KL++E+ V G+ PV+ +KI QLRY Q VL ETVRTAKLTP++A+LQ EGK+
Sbjct 695 TSEDVQEKLHQELEDVLGSEPVSLDKIPQLRYFQQVLNETVRTAKLTPIAARLQXNEGKV 516
Query 354 DRFIIPRETLVLYALGVVLQDPNTWPSPHKFDPDRFDDELVMKTFSSLGFSGTQECPELR 413
D+ IIP+ETLV+YALGVVLQD +TW P+KFDPDRF ++ K+FS LGFSG Q CPELR
Sbjct 515 DQHIIPKETLVIYALGVVLQDADTWSCPYKFDPDRFTEDSARKSFSLLGFSGNQACPELR 336
Query 414 FAYMVTTvllsvlvkrlhllsvEGQVIETKYELVTSSREEAWITVSKR 461
FAY V TV+LS +V++L L V+GQV+E + ELV++ +++ WITVS+R
Sbjct 335 FAYTVATVVLSTVVRQLKLYQVKGQVVEARSELVSTPKDDTWITVSRR 192
Assembled salmon CYP20 sequence from the three ESTs (bottom lines)
Overlaps have been removed
MLDFAIFAVTFVIFLVGAVLYLYPSSRSASGIPGLNPTEEKDGNLQDIVNRGSLHEFLAS
LHGQFGPVASFWFGGRPVVSLGSVDQLRQHINPNRTTDSFETMLKSLLGYQSGTGGGATE
AVMRKKLYESAVNNTPEKNFPMLLKLVEELVGKWQSFPKDQHTPLCAHLQGLAMKAVTQL
ALGDRFRNDAEVIGFRKNHEAIWSEIGKGYLDGSMEKSSIRK
EHYESALAEMETVLMSVAKDRKGQRSQTAFVDTLLQSNLTERQVMED
SMVFTLAGCVITANLCIWAVHFLSTSEDVQEKLHQELEDVLGSEPVS
LDKIPQLRYFQQVLNETVRTAKLTPIAARLQXNEGKV
DQHIIPKETLVIYALGVVLQDADTWSCPYKFDPDRFTEDSARKSFSLLGFSGNQACPELR
FAYTVATVVLSTVVRQLKLYQVKGQVVEARSELVSTPKDDTWITVSRRS*
Note: I added the last amino acid and the stop codon from the translation of the CA063128 DNA sequence.
Whole salmon sequence compared to human in the P450 blast server
Note: the top seq is now the salmon sequence.
>CYP20 AC011737.8 chr 2 (missing exons 12,13) AC080075.2 (missing exons
1,7,8)
Length = 462
Score = 595 bits (1534), Expect = e-173
Identities = 288/462 (62%), Positives = 373/462 (80%), Gaps = 2/462 (0%)
Query: 1 MLDFAIFAVTFVIFLVGAVLYLYPSSRSASGIPGLNPTEEKDGNLQDIVNRGSLHEFLAS 60
MLDFAIFAVTF++ LVGAVLYLYP+SR A+GIPG+ PTEEKDGNL DIVN GSLHEFL +
Sbjct: 1 MLDFAIFAVTFLLALVGAVLYLYPASRQAAGIPGITPTEEKDGNLPDIVNSGSLHEFLVN 60
Query: 61 LHGQFGPVASFWFGGRPVVSLGSVDQLRQHINPNRTTDSFETMLKSLLGYQSGTGGGATE 120
LH ++GPV SFWFG R VVSLG+VD L+QHINPN+T+D FETMLKSLL YQSG GG +E
Sbjct: 61 LHERYGPVVSFWFGRRLVVSLGTVDVLKQHINPNKTSDPFETMLKSLLRYQSG-GGSVSE 119
Query: 121 AVMRKKLYESAVNNTPEKNFPMLLKLVEELVGKWQSFPKDQHTPLCAHLQGLAMKAVTQL 180
MRKKLYE+ V ++ + NF +LLKL EEL+ KW S+P+ QH PL H+ G AMK+VTQ+
Sbjct: 120 NHMRKKLYENGVTDSLKSNFALLLKLSEELLDKWLSYPETQHVPLSQHMLGFAMKSVTQM 179
Query: 181 ALGDRFRNDAEVIGFRKNHEAIWSEIGKGYLDGSMEKSSIRKEHYESALAEMETVLMSVA 240
+G F +D EVI F+KNH +WSEIGKG+LDGS++K+ RK+ YE AL ++E+VL ++
Sbjct: 180 VMGSTFEDDQEVIRFQKNHGTVWSEIGKGFLDGSLDKNMTRKKQYEDALMQLESVLRNII 239
Query: 241 KDRKGQR-SQTAFVDTLLQSNLTERQVMEDSMVFTLAGCVITANLCIWAVHFLSTSEDVQ 299
K+RKG+ SQ F+D+L+Q NL ++Q++EDSM+F+LA C+ITA LC WA+ FL+TSE+VQ
Sbjct: 240 KERKGRNFSQHIFIDSLVQGNLNDQQILEDSMIFSLASCIITAKLCTWAICFLTTSEEVQ 299
Query: 300 EKLHQELEDVLGSEPVSLDKIPQLRYFQQVLNETVRTAKLTPIAARLQXNEGKVDQHIIP 359
+KL++E+ V G+ PV+ +KI QLRY Q VL ETVRTAKLTP++A+LQ EGK+D+ IIP
Sbjct: 300 KKLYEEINQVFGNGPVTPEKIEQLRYCQHVLCETVRTAKLTPVSAQLQDIEGKIDRFIIP 359
Query: 360 KETLVIYALGVVLQDADTWSCPYKFDPDRFTEDSARKSFSLLGFSGNQACPELRFAYTVA 419
+ETLV+YALGVVLQD +TW P+KFDPDRF ++ K+FS LGFSG Q CPELRFAY V
Sbjct: 360 RETLVLYALGVVLQDPNTWPSPHKFDPDRFDDELVMKTFSSLGFSGTQECPELRFAYMVT 419
Query: 420 TVVLSTVVRQLKLYQVKGQVVEARSELVSTPKDDTWITVSRR 461
TV+LS +V++L L V+GQV+E + ELV++ +++ WITVS+R
Sbjct: 420 TVLLSVLVKRLHLLSVEGQVIETKYELVTSSREEAWITVSKR 461
For the assignment, please turn in the following for each of your two P450 sequences:
1) The accession numbers of your maize hits that you feel are your best hits. The accession number in the example above is EG857411. I cannot use your data if you do not supply the accession number.
2) The sequence name you were using for the search (CYP51G1 or CYP701A3 etc.)
3) The output from the human P450 blast server that shows the best match to your Zea mays EST.
4) your name
This will look like this:
salmon EST sequence EG857411
human starting sequence CYP20
>CYP20 AC011737.8 chr 2 (missing exons 12,13) AC080075.2 (missing exons
1,7,8)
Length = 462
Score = 595 bits (1534), Expect = e-173
Identities = 288/462 (62%), Positives = 373/462 (80%), Gaps = 2/462 (0%)
Query: 1 MLDFAIFAVTFVIFLVGAVLYLYPSSRSASGIPGLNPTEEKDGNLQDIVNRGSLHEFLAS 60
MLDFAIFAVTF++ LVGAVLYLYP+SR A+GIPG+ PTEEKDGNL DIVN GSLHEFL +
Sbjct: 1 MLDFAIFAVTFLLALVGAVLYLYPASRQAAGIPGITPTEEKDGNLPDIVNSGSLHEFLVN 60
Query: 61 LHGQFGPVASFWFGGRPVVSLGSVDQLRQHINPNRTTDSFETMLKSLLGYQSGTGGGATE 120
LH ++GPV SFWFG R VVSLG+VD L+QHINPN+T+D FETMLKSLL YQSG GG +E
Sbjct: 61 LHERYGPVVSFWFGRRLVVSLGTVDVLKQHINPNKTSDPFETMLKSLLRYQSG-GGSVSE 119
Your Name
Do this for each of your best hits (up to three).
Optional: walk along the Zea mays EST by using the translated protein sequence you found as the starting point for another search of the Zea mays EST database. This new search will find overlapping ESTs that may let you walk to the beginnning or end of the Zea P450. Try to assemble a whole P450 from overlapping ESTs. Please keep track of the accession numbers and turn them in with your sequences.
We did this in the example above by using the graphic to pick out overlapping ESTs that would continue the sequence.
Links for this class
NCBI main page
http://www.ncbi.nlm.nih.gov/
NCBI.
TBLASTN page at NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&LAYOUT=TwoWindows&AUTO_FORMAT=Semiauto&ALIGNMENTS=50&ALIGNMENT_VIEW=Pairwise&CLIENT=web&DATABASE=nr&DESCRIPTIONS=100&ENTREZ_QUERY=(none)&EXPECT=10&FILTER=L&FORMAT_OBJECT=Alignment&FORMAT_TYPE=HTML&GENETIC_CODE=0&HITLIST_SIZE=100&NCBI_GI=on&PAGE=Translations&PROGRAM=tblastn&SERVICE=plain&SET_DEFAULTS.x=23&SET_DEFAULTS.y=10&SHOW_OVERVIEW=on&UNGAPPED_ALIGNMENT=no&END_OF_HTTPGET=Yes
TBLASTN
NCBI BLAST page
http://www.ncbi.nlm.nih.gov/BLAST/
NCBI Blast page
P450 Blast server
http://blast.uthsc.edu/
P450 Blast server
EXPASY DNA translator
http://ca.expasy.org/tools/dna.html
EXPASY DNA translator
Mito carrier sequence alignment
http://drnelson.uthsc.edu/car202.class.html
sequence alignment
Rice cytochrome P450 sequences
http://drnelson.uthsc.edu/rice.search.seqs.htm
FASTA format of the selected rice P450 sequences.
my email address
drnelson1@gmail.com