Plants and bacteria are the most difficult kingdoms in P450 nomenclature. This is because plants have so many sequences (more than 230 so far) that there are no clear boundaries between what should be distinct family groups. Bacteria do not have quite as many sequences yet, but they use a good sequence over and over until there are very crowded families. Then it becomes a nightmare to decide on subfamily divisions. If you have ever made a phylogenetic tree by more than one method using the same data, you know that the trees are often different. Then, grouping sequences by their affinity becomes not only a matter of making the proper alignment, but of choosing the most appropriate scoring matrix and tree building method. .
Two common choices for scoring matrices are PAM 250 and BLOSUM 62. A scoring matrix is just an array with a score for each amino acid pair. Since trp is the most uncommon of the 20 amino acids a trp-trp score is usually the highest. These matrices are derived from statistics of sequence alignments. Based on the chosen matrix, two sequences that are aligned are given a score, the average of the amino acid pair scores, and this is the similarity between the two sequences. If a UNIT matrix is used, identities are scored as 1 and non-identities are scored as 0. The score is then the percent identity between the two sequences. .
In the idealized world of P450 nomenclature, some rules were established early on to sort sequences into families and subfamilies. These were very simple rules. If two sequences were 40% identical or more then they belonged in the same family. If they were 55% identical or more then they belonged in the same subfamily. PLEASE NOTE: THIS MAKES NO ASSUMPTIONS ABOUT FUNCTION. IN THIS SYSTEM OF NOMENCLATURE FUNCTION IS NOT A CONCERN. It has been shown that even a one amino acid difference in a P450 can change its function. Therefore, this system cannot sort its nomenclature based on function. It is very possible and I believe it has happened, that sequences in the same subfamily do not have the same function. As Mayor Marion Barry said of his cocaine conviction GET OVER IT! .
Sequences will be named by their sequence similarity and not by their function. .
As I said above, plants have many P450s. There are estimates that plants make over 12,000 alkaloids, many requiring P450 steps in their synthesis, so we can only anticipate more and more sequences. Francis Durst comments that "Alkaloids are just a sub-class of plant "secondary metabolites" (which is the term the plant people use to designate the whole set). Alkaloids contain nitrogen, but terpenes are even more numerous and then there are phenylpropanoids, etc... The last figure I have is from Pr. Hartmann (Braunschweig) who told me 100,000 [plant "secondary metabolites"] are identified. The estimated total number is 200-400,000". The 200-400,000 figure for all plant secondary metabolites is calculated from a paper by T. Swain (1977), Ann. Rev. Plant Physiol. 28, 479-501.With the simple definitions of 40% and 55% cutoffs for family and subfamily designations, and the fact of hundreds and perhaps thousands of plant P450 sequences, it becomes obvious that percent identities between sequences will begin to fall in the gray regions at these set boundaries. What to do when there is a sequence that is less than 40% identical with one family member, but more than 40% identical with another family member. Is the sequence in the family or in a new closely related family. This is the dilemma of nomenclature. How does one decide to make the divisions betweeen families and subfamilies when the choice is not clear cut? .
I have done this by making phylogentic trees and looking for the best place to make divisions. When trees are not so full, this is easy to do. It is like grading exams and looking for the break between the As and the Bs. This type of decision making leads to historical choices that might have been different if the early collection of sequences was different. For example: When the first 2D sequence was reported, there was a wide gap between the 2 family and the 1 family. The 2D sequence was in the gray region. I chose to include it in the 2 family, rather than make it into a new family of its own. I was a lumper rather than a splitter. This is my inclination, and I have been criticised for it. Others tend to be splitters. IT IS THE NATURE OF ANY RESEARCHER TO WANT HIS OR HER SEQUENCE TO BE IN A NEW FAMILY. THAT TENDS TO MAKE RESEARCHERS FALL IN THE SPLITTER CAMP. Many researchers who send me new sequences are unhappy when they do not get a new family name, but only a new subfamily name. Many argue that the function is different and it should not be kept in the given family based on the function. As I said above, function does not enter into the decision. The appearance of the clusters on a tree does enter into the discussion. .
AN EXAMPLE: The CYP81 family is very close to the CYP91 family. When the first 91 member was sent in, the two were at the gray region of 39% identity. Looking at the tree, I decided this time to give them separate family status. I split them. I was probably reacting to many criticisms of my lumping tendencies. Now there are 5 members in CYP91A, and the last one was 53% identical to 91A4, but it also turned out to be 48% identical to 81B1. In this case, the two families seem to belong to one family and CYP91 should be CYP81D. However, it has been a couple of years since I named CYP91A1 and this name has been used and it is difficult to change a name once it has been in circulation for a while. .
This example shows the opposite problem of the 2D subfamily, that should really be its own family. Here, two families really should be a single family. These problems will arise, and there will be no way to avoid them, since this naming process is historical. The question then becomes what to do about it. Do you rename sequences based on revised trees made with more sequences. I am worried about this because it damages the continuity that was the original goal of setting up this nomenclature: a unified name that sticks with a sequence from lab to lab and year to year and indicative of relationships to other sequences. To achieve the last goal, it would be better to lump, because a new family name does not give a relationship to other families. So, if I make errors, it seems that it would be better to make them as a lumper rather than a splitter. .
Legitimate differences can be caused by tree building methods and scoring matrices. There are now more than 750 P450 sequences in my collection. No tree builder in their right mind would want to make a tree with that many sequences. Therefore, I have split these sequences into groups that are easier to work with. Plants almost always cluster with other plants, though the CYP51 family is an exception, and CYP74 is so different from other P450s because its I helix is not conserved that it clusters with nothing. Insects cluster with insects, bacteria usually cluster together, though there are E-like bacteria that are more eukaryote like than bacterial like (CYP102, CYP110, CYP118). For this reason, I try to do trees with less than 100-130 members in them. The sequences are compared to a set of other sequences in various folders by MacVector software to find the most similar sequence group, then they are aligned and put in a tree with that group. .
You may find fault with my choices of tree building algorithms and scoring matrices. I have used the same programs for 10 years so there is consistency in the process, but I started in this knowing very little about tree building. I sought help in Masatoshi Nei's Center for Demographics and Population Genetics at the University of Texas Health Science Center in Houston. Clay Stephens gave me a program he had in Fortran for calculating UPGMA trees, and I converted it to Basic to run on an IBM. I also wrote programs to computing pecent identities from sequence alignments and store them in distance matrices for the UPGMA program. I modified this program to calculate similarity based on PAM120, PAM250 as well as the unit matrix, so I could do trees by each scoring matrix. This is what I have been using all this time. It is simple, it works, but it is not state of the art. I feel the need more and more to switch to MEGA or PHYLIP to do these trees, and I plan to do this this winter, but I do not know what will happen if significant changes are found in the existing trees and nomenclature. I suspect there will be some problems that may indicate revision of the nomenclature for some sequences. I will have to consult other members of the nomenclature committee before we do anything too drastic. .
If you care to comment on this posting feel free to send me email at dnelson@ utmem1.utmem.edu