August 4, 2000 D. Nelson In July, I attended a meeting in Moscow called International Workshop "From sequence to function: experimental and bioinformatic studies of the cytochrome P450 superfamily" During the meeting many young Russian scientists presented their work on a Cytochrome P450 Database, and bioinformatic methods for analyzing the P450 sequences coming from around the planet. There has been much effort expended in assembling this new database, and I have placed a link to it here (CPD) and on the P450 WEBMATRIX page. I encourage you all to visit this site. This database has 1200 P450 sequences in it. Credit for the database is given on the users guide. Institute of Biomedical Chemistry Of the Russain Academy of Medical Sciences Center for Molecular Design, Division of the Janssen Research Foundation Staff: Alexander I. Archakov, I.I. Karuzina, Semen Gusev and Andrey Lisitsa Contact: Andrey Lisitsa email: fox@ibmh.msk.su I spoke at some length with Dr. Archakov, Andrey Lisitsa and Semen Gusev. This discussion continued at the MDO2000 meeting in Stresa, Italy. One of the novel results coming from this group is the application of new statistical methods to the analysis of protein sequences. The method can be applied to consensus sequences of P450 families or subfamilies to identify motifs characteristic for that set of sequences. The method is called Sherman statistics, which is an old method dating from the 1950s that has applications in other areas of mathematics besides bioinformatics. One of the strengths of the Russian bioinformatic group is to be aware of these tools and apply them in a novel way to the problem of protein sequence analysis. Andrey Lisitsa and Semen Gusev have applied the Sherman statistic method to P450s from Mycobacterium tuberculosis to identify motifs in the sequences that may be involved in the activity and substrate specificity of the enzymes. A paper has been submitted on this. Another topic of discussion was P450 nomenclature. The Russian group has been interested for a long time in tree building methods and ways to automate the naming of new P450 genes. They have spent much effort in devising a system to establish a P450 naming system that is as close to the the existing nomenclature as possible. The existing nomenclature is based on clustering of sequences on phylogenetic trees, so what they have done is to create a tree building method that assigns names based on a strict cutoff value of relatedness. To optimize the method, trees were computed with gap penalties varying from 1-20 and gap extension penalties varying from 1-20 (400 combinations) and the trees from each were used to assign nomenclature using the cutoff. Under the best conditions found, the clusters generated by these progams matched the nomenclature 82% of the time for families and 85% of the time for subfamilies. I was told that it took about two months of computing time to optimize all these variables. Of course, the end goal is to use this automated method to assign names. A paper has been submitted with these results. Of course, there were some discrepancies between the computer generated nomenclature and the manual nomenclature. We spent quite a bit of time talking about why that is true. Part of the reason for differences is historical. An automated system working on representative sets of all present day sequences has advantages that were not present when fewer sequences were known. This is especially true for large families with many subfamiles. I give two examples, the CYP2 family and the CYP4 family. CYP2 has many members and many subfamilies. When few sequences were known, there was a large gap on trees between the CYP2 and CYP1 families. New sequences that belonged to either family clearly sorted into two clusters on trees. As sampling of the CYP2 family sequences became more complete, more distant members were added and named CYP2, because they were still clearly in the CYP2 cluster and separated from the CYP1 cluster. Eventually, CYP2D was added. Some CYP2D members were more than 40% identical to other CYP2 sequences, but some were less than 40%. This was the first case of the family threshold creeping to lower numbers. Later a lobster CYP2L sequence was added. Again it was just a little father back on the tree, but it did not seem to be in new family cluster, so more threshold creep was allowed. In a rigid automated system the 2D and 2L subfamiles would probably be given separate family status. I did not do this because I felt they belonged to the 2 family cluster, so there is some aesthetic decision making that must go on in naming genes. There is some art to it that a computer is unable to discern. The CYP4 family is the worst case scenario. Here the insects have made a bastion of P450 sequences. Also for historical reasons, the 4 family was allowed to creep backwards beyond the 40% threshold for a family. While it was quite well separated from other clusters, it was allowed to move back. The Russians have found that a percent identity of 36% was optimal for family clusters. That is pretty near what has become my default, but I still like to see the trees in naming sequences. One particular problem with insect sequences is not insects, but people. People want to get the most sequences for the least money in a hurry. The best way to do that is PCR. But PCR requires conserved regions to make degenerate primers and these are not found at the ends of the P450 genes. They are found in the I-helix and at the heme binding region. As a result, I have been sent hundreds of PCR fragments from insects that cover this region of the protein and no more. To name them I have had to adjust the process to include trees made only from the C-terminal half of the proteins. So names in the four family are assigned with a different criteria than other full length sequences. This of couse affects nomenclature. What happens when full length sequences such as the 86 Drosophila sequences are assigned names using an automated system. The names will have some problems with those assigned based on half length sequences. Another nagging problem is what to do with merging families. The CYP105 and CYP107 families are running into one another. Some members are more than 40% identical to members of the other family, but most members are less than 40%. When they were first named, this was not the case. 107A1 and 105A1 were distinct. Now it looks like the two families should be joined which requires the threshold to be lowered. A final note on nomenclature. Nomenclature is supposed to serve people. It is suposed to make sense and be useful. This means that there will be inconsistencies in the nomenclature for good reasons. Computers do not understand this. An example is CYP51. There are CYP51s in most eukaryotes (insects and some other invertebrates have lost CYP51 and must eat sterols). Most eukaryotes have only a single CYP51 and this is recognized across phyla. For simplicity these have not been named as the nomenclature rules have said they should be. If that were true there would be a new CYP51 subfamily for nearly every species, and Mycobacterium tuberculosis would not qualify to be in the family. This is nonsense, so the CYP51s have all been named CYP51, without subfamily designations. The practical result is that everyone knows what is being discussed when someone says CYP51. This would not be true if every sequence was in a new subfamily. This is good nomenclature, but it breaks the rules. These several problems are why a curator is needed. Someone to make these decisions and make a few mistakes also. The automated system could be an aid in making an assignment. It should not be implemented as an alternative to a human curator. Perhaps such a system could be applied to a family that has no systematic nomenclature established, as a tool to sort genomes and databases into managable groups of sequences. I would argue that humans are not obsolete yet and there is value to be gained from the art of nomenclature.