Qunhua Li
Protein identification using mass spectrometry is a high-throughput
way to identify proteins in biological samples. In this talk, we use
statistical approaches to model two key steps in this process, namely,
identifying peptides from mass spectra using protein database search
and identifying proteins from putative peptide identifications. Both
problems are featured by high-dimensional data and low signal-to-noise
ratio.
For the problem of peptide identification, we developed a
likelihood-based algorithm based on a latent variable model, which
measures the likelihood that the observed spectrum arises from the
theoretical spectra predicted from each peptide contained in a protein
database. By carefully modeling the noise structure, our probability
model takes account of multiple sources of noise in the data and
extract some of the subtle signals which other methods miss. In
addition, our likelihood-based approach also provides natural measures
for assessing the uncerteinty of each identification.
The task of protein identification essentially is to assess the
evidence of presence for proteins constructed from putative peptides
identifications. We develop an unsupervised protein identification
algorithm based on a nested mixture model, which incorporates the
evidence feedback between peptide level and protein level. Our model
essentially is a model-based clustering method, which jointly
validates the correctness of peptide identification and infers the
evidence of presence of proteins by simultaneously clustering the
labels of peptides and proteins. Using a yeast dataset, we show that
our method has a competitive performance over leading products on
protein identification and peptide validation.