Annotating P450s in Cottonwood Annotating P450s in Cottonwood Populus trichocarpa
David R. Nelson rev. Dec. 30, 2004
The Populus genome is being annotated with a gene jamboree meeting Dec. 6-10, 2004. This page is to advise those annotating Cytochrome P450 genes in Populus using the JGI browser and other web based tools. Some of the advice given here may be used for annotating any gene, but some is specific for P450 genes.
There are 566 P450 sequence fragments identified by chromosome location by Dec. 30, 2004. All of these have now been named. Many of these are short but may belong to complete genes. Some will be joined together, some are duplicates so the gene count may drop. See the gene list link below for statistics on family members, etc. For comparison, there are about 272 P450 genes (248) and pseudogenes(24) in Arabidopsis. Rice has 455 (366 genes + 99 pseudogenes.) A blast server has been set up that contains all the rice P450s and all the Arabidopsis P450s.
Populus P450 gene list in scaffold order in progress
Populus P450 Sequence Collection
Populus P450 Sequence Collection A more stuctured file with fields for data parsing
Populus P450s compared side by side with Arabidopsis and rice P450s
All P450 families on Arabidopsis are present in Populus except two (CYP702 and CYP708). For a distribution of P450 families in plants see
Populus P450 families compared to other plant P450s (see Malpighiales)
Another very useful web based tool for annotation is a general purpose DNA translator. I set the options to 1 letter code and three forward frame translations only.
The JGI browser has many levels. I am linking a most useful page in the browser, since it can be used to retrieve DNA sequence for a specific chromosome or scaffold and a given nucleotide range.
To obtain DNA from this page enter the scaffold name (i.e. LG_IX or scaffold_18069 the underscore is required) and the nucleotide range to retrieve. This does not have to be in any order. 100 in the top window and 1000 in the bottom window will retrieve the same sequence as 1000 in the top and 100 in the bottom. If you want the reverse complement sequence copy and paste into the DNA translator and hit the reverse complement button.
Most P450s in JGI have a gene model associated with them, but many of these models are wrong. However, it is very useful to view the model. The collection of P450 sequences has the model name included with each sequence if there is a model.These names look like this:
You can search for a model from the search page (upper left button on the DNA retrieval page).
Paste the gene model name into the box that says Gene Models, Name equals.Click on Search Models. The output appears at the bottom of the page
Click on (GB) to go to the genome browser and see the gene there, or click on the number in front of (GB) to go to the gene model. The top part of the model page looks like this:
This page is very useful because it includes a link to the DNA and three frame translation for the gene model. This is very valuable in correcting a bad model or verifying a good model.
To see this page click on “view nucleotide and 3-frame translation” (in green about half way down on the left.)
I had some problems with this sequence viewer on older versions of Mac Netscape and IE.The sequences were not aligned properly. However, they worked in Mac Netscape 7.1 and IE 5.2 for Mac.
The veiwer shows the translated regions in red nucleotide sequence making it easy to spot the GT and AG boundaries. If the model is short on either end, the buffer length at the top can be increased to look upstream or downstream for more coding regions.
When assembling a gene it is good to start by pasting the gene model sequence from the Populus P450 Sequence Collection file into the P450 blast server and blast it against the Arabidopsis set of P450s. Rice combined is also available there, but the cottonwood sequences are more like Arabidopsis. There are a few exceptions, where Arabidopsis is missing some families, like CYP92, CYP727, CYP728. These are in rice. CYP736 is not in Arabidopsis or rice, but it is in soybean and lotus. If the match in Arabidopsis or rice is not very good, it may be necessary to BLAST the nr section of GenBank or the EST section limited to Viridiplantae.
If the model is quite close to Arabidopsis, it may be an easy gene to assemble, because it has been done right already. Do a search for the gene model as described above. View the 3-frame translation and determine the intron boundaries. I code these as shown below, with the intron phase in parentheses. Phase 1 boundaries show the amino acid of the broken codon as the first amino acid of the next exon. Phase 2 boundaries keep the amino acid from the broken codon as the last amino acid of the exon. Finished models are green. Others are yellow.
If your gene model is not complete you may compare it to the closest Arabidopsis sequence while in the 3-frame translation viewer to look for missing regions. The Arbidopsis P450 set can be seen at
The rice P450 set can be seen at
If the sequence is more difficult and it is not obvious what is missing, I recommend retrieving the DNA from the DNA retrieval page. The cottonwood P450 sequence list has the scaffold name in contig order and a DNA location.
LG_I (-) 173344 716A LIKE
This may be one number or a range. Paste the scaffold name in the retrieval window and paste the nucleotide number in both windows.
Once you have that done decrease the Start number by 2000 bp and increase the End number by 2000. Then get the sequence. You should have 4001 bp around the region you wanted. Paste the sequence in the P450 blast server and select DNA sequence for a blastX search. Blast Arabidopsis with the DNA and you will get back matches to any exons in the sequence. This should help in finding missing pieces, though N-terminals can be hard because they are not very well conserved.
Once you have the information on which exons are present, You can paste the sequence in the DNA translation tool. If it is on the minus strand press the reverse complement button.Select 1 letter code and three forward frames and translate. I especially like this translator, because it gives the open reading frames at the bottom for each frame. This can help you zoom in on an exon. Then you can go back up to the 3-frame translation and look for the GT and AG boundaries.
To get the exact nucleotide numbering as I have done for the finished annotated genes in green, you will need to do a blast search at JGI with your assembled sequence as query. The Blast search page is linked at the top menu bar as the second link on the bar. Select tblastn and uncheck the filter button. Paste your sequence and blast. The output from a search with Arabidopsis CYP82C4 looks like this:
Clicking on a colored bar takes you to a graphical view of the hits. This view shows parts of two P450 genes. The missing parts (N-term and middle) are below the threshold for your blast search, so they do not show up.
A text display of each exon region is shown below the graphical display.
Clicking on the green Seq at the bottom of this section brings up the DNA sequence. Often you would like to see more sequence around this region. To do that, click on the green scaffold number at the top left of the DNA sequence. It will take you to the DNA retrieval page we talked about earlier. You will need to copy the DNA numbering before going to the DNA retrieval page, because it shows you all the info for the whole scaffold and not just the part you are interested in. Paste in the number range and widen the range as needed to get what you want. You have to work backwards on minus strands. Minus strands are displayed backwards, so it is hard to read the codons. It may be easier to paste into the DNA translator and reverse complement, or look at the gene model that is always shown in the forward orientation.
Note that at the top of the page there is a “view on browser” link. Clicking on that will take you to the genome browser window showing the region with your genes in it. You can zoom in or out and move left or right. This is quite helpful when looking at a gene cluster. It is also helpful when two or more gene models are really part of the same gene and they should be joined together.