Problems with naming new P450s and how I deal with them.
Plants and bacteria are the most difficult kingdoms in P450 nomenclature. This is because
plants have so many sequences (more than 230 so far) that there are no clear boundaries
between what should be distinct family groups. Bacteria do not have quite as many
sequences yet, but they use a good sequence over and over until there are very crowded
families. Then it becomes a nightmare to decide on subfamily divisions. If you have ever
made a phylogenetic tree by more than one method using the same data, you know that the
trees are often different. Then, grouping sequences by their affinity becomes not only a
matter of making the proper alignment, but of choosing the most appropriate scoring matrix
and tree building method.
Two common choices for scoring matrices are PAM 250 and BLOSUM 62. A scoring
matrix is just an array with a score for each amino acid pair. Since trp is the most
uncommon of the 20 amino acids a trp-trp score is usually the highest. These matrices are
derived from statistics of sequence alignments. Based on the chosen matrix, two
sequences that are aligned are given a score, the average of the amino acid pair scores, and
this is the similarity between the two sequences. If a UNIT matrix is used, identities are
scored as 1 and non-identities are scored as 0. The score is then the percent identity
between the two sequences.
In the idealized world of P450 nomenclature, some rules were established early on
to sort sequences into families and subfamilies. These were very simple rules. If two
sequences were 40% identical or more then they belonged in the same family. If they were
55% identical or more then they belonged in the same subfamily. PLEASE NOTE: THIS
MAKES NO ASSUMPTIONS ABOUT FUNCTION. IN THIS SYSTEM OF
NOMENCLATURE FUNCTION IS NOT A CONCERN. It has been shown that even a
one amino acid difference in a P450 can change its function. Therefore, this system cannot
sort its nomenclature based on function. It is very possible and I believe it has happened,
that sequences in the same subfamily do not have the same function. As Mayor Marion
Barry said of his cocaine conviction GET OVER IT!
Sequences will be named by their sequence similarity and not by their function.
Problems with plants
As I said above, plants have many P450s. There are estimates that plants make
over 12,000 alkaloids, many requiring P450 steps in their synthesis, so we can only
anticipate more and more sequences. Francis Durst comments that “Alkaloids are just a
sub-class of plant “secondary metabolites” (which is the term the plant people use to
designate the whole set). Alkaloids contain nitrogen, but terpenes are even more numerous
and then there are phenylpropanoids, etc… The last figure I have is from Pr. Hartmann
(Braunschweig) who told me 100,000 [plant “secondary metabolites”] are identified. The
estimated total number is 200-400,000″. The 200-400,000 figure for all plant secondary
metabolites is calculated from a paper by T. Swain (1977), Ann. Rev. Plant Physiol. 28,
479-501.With the simple definitions of 40% and 55% cutoffs for family and subfamily
designations, and the fact of hundreds and perhaps thousands of plant P450 sequences, it
becomes obvious that percent identities between sequences will begin to fall in the gray
regions at these set boundaries. What to do when there is a sequence that is less than 40%
identical with one family member, but more than 40% identical with another family
member. Is the sequence in the family or in a new closely related family. This is the
dilemma of nomenclature. How does one decide to make the divisions between families
and subfamilies when the choice is not clear cut?
I have done this by making phylogentic trees and looking for the best place to make
divisions. When trees are not so full, this is easy to do. It is like grading exams and
looking for the break between the As and the Bs. This type of decision making leads to
historical choices that might have been different if the early collection of sequences was
different. For example: When the first 2D sequence was reported, there was a wide gap
between the 2 family and the 1 family. The 2D sequence was in the gray region. I chose to
include it in the 2 family, rather than make it into a new family of its own. I was a
lumper rather than a splitter. This is my inclination, and I have been criticised for it.
Others tend to be splitters. IT IS THE NATURE OF ANY RESEARCHER TO WANT
HIS OR HER SEQUENCE TO BE IN A NEW FAMILY. THAT TENDS TO MAKE
RESEARCHERS FALL IN THE SPLITTER CAMP. Many researchers who send me
new sequences are unhappy when they do not get a new family name, but only a new
subfamily name. Many argue that the function is different and it should not be kept in the
given family based on the function. As I said above, function does not enter into the
decision. The appearance of the clusters on a tree does enter into the discussion.
AN EXAMPLE: The CYP81 family is very close to the CYP91 family. When the first 91
member was sent in, the two were at the gray region of 39% identity. Looking at the tree, I
decided this time to give them separate family status. I split them. I was probably reacting
to many criticisms of my lumping tendencies. Now there are 5 members in CYP91A, and
the last one was 53% identical to 91A4, but it also turned out to be 48% identical to 81B1.
In this case, the two families seem to belong to one family and CYP91 should be CYP81D.
However, it has been a couple of years since I named CYP91A1 and this name has been
used and it is difficult to change a name once it has been in circulation for a while.
This example shows the opposite problem of the 2D subfamily, that should really
be its own family. Here, two families really should be a single family. These problems
will arise, and there will be no way to avoid them, since this naming process is historical.
The question then becomes what to do about it. Do you rename sequences based on
revised trees made with more sequences. I am worried about this because it damages the
continuity that was the original goal of setting up this nomenclature: a unified name that
sticks with a sequence from lab to lab and year to year and indicative of relationships to
other sequences. To achieve the last goal, it would be better to lump, because a new family
name does not give a relationship to other families. So, if I make errors, it seems that it
would be better to make them as a lumper rather than a splitter.
Legitimate differences can be caused by tree building methods and scoring matrices.
There are now more than 750 P450 sequences in my collection. No tree builder in their
right mind would want to make a tree with that many sequences. Therefore, I have split
these sequences into groups that are easier to work with. Plants almost always cluster with
other plants, though the CYP51 family is an exception, and CYP74 is so different from
other P450s because it’s I helix is not conserved that it clusters with nothing. Insects cluster
with insects, bacteria usually cluster together, though there are E-like bacteria that are more
eukaryote like than bacterial like (CYP102, CYP110, CYP118). For this reason, I try to
do trees with less than 100-130 members in them. The sequences are compared to a set of
other sequences in various folders by MacVector software to find the most similar sequence
group, then they are aligned and put in a tree with that group.
You may find fault with my choices of tree building algorithms and scoring
matrices. I have used the same programs for 10 years so there is consistency in the
process, but I started in this knowing very little about tree building. I sought help in
Masatoshi Nei’s Center for Demographics and Population Genetics at the University of
Texas Health Science Center in Houston. Clay Stephens gave me a program he had in
Fortran for calculating UPGMA trees, and I converted it to Basic to run on an IBM. I also
wrote programs to computing pecent identities from sequence alignments and store them in
distance matrices for the UPGMA program. I modified this program to calculate similarity
based on PAM120, PAM250 as well as the unit matrix, so I could do trees by each scoring
matrix. This is what I have been using all this time. It is simple, it works, but it is not
state of the art. I feel the need more and more to switch to MEGA or PHYLIP to do these
trees, and I plan to do this this winter, but I do not know what will happen if significant
changes are found in the existing trees and nomenclature. I suspect there will be some
problems that may indicate revision of the nomenclature for some sequences. I will have to
consult other members of the nomenclature committee before we do anything too drastic.
If you care to comment on this posting feel free to send me email at firstname.lastname@example.org