MSCI Module 6.

Proteomics II

George Hilliard, February 14, 2008

Identification of Proteins with Proteomics

This module deals with advanced proteomics database searching.  The specific difference from the previous module is the use of peptide sequence data input to the search engine. 

As established last module, unknown proteins contained in polyacrylamide gel pieces can be identified by proteolysis combined with mass spectrometry.  The steps A and B in the following flow chart were covered in the last module.  In this session, we will continue along the path in steps C and D through the use of "peptide sequence tags”.  If you will remember that steps A and B use intact peptides created by trypsin, the difference is any one of those masses can be fragmented further to deduce some peptide sequence.  As stated last module, the majority of a pool of unknown proteins is identified with peptide mass fingerprinting in steps A and B, steps C and D pick up the rest. 



Proteomics processing steps created the proteolyzed peptides to initiate the analysis, typically with the enzyme trypsin.  Further fragmentation of any single one of these peptides is created in the mass spectrometer when desired.  The degree of fragmentation can be controlled and occurs around the peptide bond.  There is a nomenclature that is used which describes these fragments, and for our purposes we will use the "b" and "y" ions for database searches.  The fragments with charges on their N-terminal end are "b" ions, and the "y" ions have their charge on the C-terminal end of the fragment.

The ions of a "b" or "y" ion series differ in mass by the respective molecular weight of each side chain.  There are 20 amino acid side chains, or R groups, therefore, amino acid sequence of a peptide can be interpreted from the fragments created in the mass spectrometer.  As was specified last module, peptides have a chemical formula and therefore molecular weights for an entire database can be calculated in silico, including these sequence fragments.  The search algorithms simply ask which entry in the database matches best to the measured or experimentally derived mass lists.  The mass of a peptide is measured with an error compared to the calculated molecular weight.  The magnitude of this error is a function of the performance of the mass spectrometer, and has a direct impact on the use of the MS data in a database search.  The smaller the error, the more stringent the database search can be.  You will see this effect in the database search engine software.


The parent or "precursor" peptide that was fragmented is shown in the upper mass spectrum, and the fragments of that mass are shown in the lower spectrum.  In the lower spectrum, a few b and y fragment ions are labeled.  Each of these masses was the result of a random fragmentation event of the precursor mass 786.3, a tryptic peptide.

These fragments can be used in a database search query.  Why is the database search with sequence tags more specific than that of mass footprint data?  The explanation is as follows:  first realize that a query to a database with peptide sequence data alone provides a very specific search.  It is the basis for practically all of the BLAST search tools at NCBI.  However, now realize that sequence data in combination with mass values at each residue is used in peptide sequence tag searches.  It is then easy to appreciate at minimum a 1-million-fold increase in stringency of the search specificity.  This concept follows simple rules of probability presented below, the mass values are from the above spectra.



Search Criteria for MS Sequence Tags

 match criteria                                                     probability factor

                                 1        C-terminal cleavage site of preceding peptide (Arg or Lys)                  1/10


                                 2        mass of region 1                                                                                       1/110


             3        tag sequence                                                                                   1/20 x 1/20 x 1/20


                                4        mass of region 3                                                                                         1/110


                                5        C-terminal cleavage (Arg or Lys)                                                                1/10

 P false positive =(1/10)(1/110)(1/20)(1/20)(1/20)(1/110)(1/10) » 1 X 10-9

 P nonrandom match = (1- P false positive)n     n = amino acids in database

 Adapted from:  Mann and Wilm, 1994, Anal. Chem., 66, 4390-4399.


The masses from the example spectra in the example are listed here, and can be used in a peptide sequence search with Sonar search software.  Be sure to pay attention to the database searched, and the charge (z) value in the settings.  These searches take quite a bit longer than last module.

Access Proteomics Search Software in my Laboratory Engine 1, Engine 2



    email identities to me at

Unknown 1

A Mammalian peptide, fragment mass list available here.




Unknown 2

A bacterial peptide, fragment mass list available here.



Extra Credit

Raw Data Only, no hints:





Access my Laboratory's web page here Protein Analysis and Proteomics Laboratory

References for Module 6

52.  Mann, M., Wilm, M., 1994, Error-Tolerant Identification of Peptides in Sequence Databases by Peptide Sequence Tags, Anal. Chem. 66, 4390-4399.


53.  Yates, J.R., Eng, J.K., McCormack, A.L., and Schieltz, D., 1995, Method to Correlate Tandem Mass Spectra of Modified Peptides to Amino Acid Sequences in the Protein Database, Anal. Chem. 67, 1426-1436.