DISCUSSIONS WITH THE RUSSIANS
August 4, 2000 D. Nelson
In July, I attended a meeting in Moscow called International Workshop "From sequence
to function: experimental and bioinformatic studies of the cytochrome P450 superfamily"
During the meeting many young Russian scientists presented their work on a Cytochrome
P450 Database, and bioinformatic methods for analyzing the P450 sequences coming
from around the planet. There has been much effort expended in assembling this new
database, and I have placed a link to it here (CPD)
and on the P450 WEBMATRIX page. I encourage you all to visit this site. This
database has 1200 P450 sequences in it.
Credit for the database is given on the users guide.
Institute of Biomedical Chemistry
Of the Russain Academy of Medical Sciences
Center for Molecular Design,
Division of the Janssen Research Foundation
Staff: Alexander I. Archakov, I.I. Karuzina, Semen Gusev and Andrey Lisitsa
Contact: Andrey Lisitsa email: firstname.lastname@example.org
I spoke at some length with Dr. Archakov, Andrey Lisitsa and Semen Gusev. This
discussion continued at the MDO2000 meeting in Stresa, Italy.
One of the novel results coming from this group is the application of new statistical
methods to the analysis of protein sequences. The method can be applied to consensus
sequences of P450 families or subfamilies to identify motifs characteristic for that set of
sequences. The method is called Sherman statistics, which is an old method dating from
the 1950s that has applications in other areas of mathematics besides bioinformatics. One
of the strengths of the Russian bioinformatic group is to be aware of these tools and apply
them in a novel way to the problem of protein sequence analysis.
Andrey Lisitsa and Semen Gusev have applied the Sherman statistic method to P450s
from Mycobacterium tuberculosis to identify motifs in the sequences that may be
involved in the activity and substrate specificity of the enzymes. A paper has been
submitted on this.
Another topic of discussion was P450 nomenclature. The Russian group has been
interested for a long time in tree building methods and ways to automate the naming of
new P450 genes. They have spent much effort in devising a system to establish a P450
naming system that is as close to the the existing nomenclature as possible. The existing
nomenclature is based on clustering of sequences on phylogenetic trees, so what they
have done is to create a tree building method that assigns names based on a strict cutoff
value of relatedness. To optimize the method, trees were computed with gap penalties
varying from 1-20 and gap extension penalties varying from 1-20 (400 combinations) and
the trees from each were used to assign nomenclature using the cutoff. Under the best
conditions found, the clusters generated by these progams matched the nomenclature
82% of the time for families and 85% of the time for subfamilies. I was told that it took
about two months of computing time to optimize all these variables. Of course, the end
goal is to use this automated method to assign names. A paper has been submitted with
Of course, there were some discrepancies between the computer generated nomenclature
and the manual nomenclature. We spent quite a bit of time talking about why that is true.
Part of the reason for differences is historical. An automated system working on
representative sets of all present day sequences has advantages that were not present
when fewer sequences were known. This is especially true for large families with many
subfamiles. I give two examples, the CYP2 family and the CYP4 family. CYP2 has
many members and many subfamilies. When few sequences were known, there was a
large gap on trees between the CYP2 and CYP1 families. New sequences that belonged
to either family clearly sorted into two clusters on trees. As sampling of the CYP2 family
sequences became more complete, more distant members were added and named CYP2,
because they were still clearly in the CYP2 cluster and separated from the CYP1 cluster.
Eventually, CYP2D was added. Some CYP2D members were more than 40% identical
to other CYP2 sequences, but some were less than 40%. This was the first case of the
family threshold creeping to lower numbers. Later a lobster CYP2L sequence was added.
Again it was just a little father back on the tree, but it did not seem to be in new family
cluster, so more threshold creep was allowed.
In a rigid automated system the 2D and 2L subfamiles would probably be given separate
family status. I did not do this because I felt they belonged to the 2 family cluster, so
there is some aesthetic decision making that must go on in naming genes. There is some
art to it that a computer is unable to discern.
The CYP4 family is the worst case scenario. Here the insects have made a bastion of
P450 sequences. Also for historical reasons, the 4 family was allowed to creep
backwards beyond the 40% threshold for a family. While it was quite well separated
from other clusters, it was allowed to move back. The Russians have found that a percent
identity of 36% was optimal for family clusters. That is pretty near what has become my
default, but I still like to see the trees in naming sequences.
One particular problem with insect sequences is not insects, but people. People want to
get the most sequences for the least money in a hurry. The best way to do that is PCR.
But PCR requires conserved regions to make degenerate primers and these are not found
at the ends of the P450 genes. They are found in the I-helix and at the heme binding
region. As a result, I have been sent hundreds of PCR fragments from insects that cover
this region of the protein and no more. To name them I have had to adjust the process to
include trees made only from the C-terminal half of the proteins. So names in the four
family are assigned with a different criteria than other full length sequences. This of
couse affects nomenclature. What happens when full length sequences such as the 86
Drosophila sequences are assigned names using an automated system. The names will
have some problems with those assigned based on half length sequences.
Another nagging problem is what to do with merging families. The CYP105 and
CYP107 families are running into one another. Some members are more than 40%
identical to members of the other family, but most members are less than 40%. When
they were first named, this was not the case. 107A1 and 105A1 were distinct. Now it
looks like the two families should be joined which requires the threshold to be lowered.
A final note on nomenclature. Nomenclature is supposed to serve people. It is suposed
to make sense and be useful. This means that there will be inconsistencies in the
nomenclature for good reasons. Computers do not understand this. An example is
CYP51. There are CYP51s in most eukaryotes (insects and some other invertebrates
have lost CYP51 and must eat sterols). Most eukaryotes have only a single CYP51 and
this is recognized across phyla. For simplicity these have not been named as the
nomenclature rules have said they should be. If that were true there would be a new
CYP51 subfamily for nearly every species, and Mycobacterium tuberculosis would not
qualify to be in the family. This is nonsense, so the CYP51s have all been named
CYP51, without subfamily designations. The practical result is that everyone knows
what is being discussed when someone says CYP51. This would not be true if every
sequence was in a new subfamily. This is good nomenclature, but it breaks the rules.
These several problems are why a curator is needed. Someone to make these decisions
and make a few mistakes also.
The automated system could be an aid in making an assignment. It should not be
implemented as an alternative to a human curator. Perhaps such a system could be
applied to a family that has no systematic nomenclature established, as a tool to sort
genomes and databases into managable groups of sequences. I would argue that humans
are not obsolete yet and there is value to be gained from the art of nomenclature.