Youyi Fong
Genome sequencing and annotation projects generate a lot of predicted protein sequences. To learn more about the biological functions of these proteins, we investigate subfamily structure of protein families. A protein family is a group of proteins that share similar sequences and functions, but it is generally not a homogeneous group. We propose as model for protein sequences belonging to a protein family, as iid realizations of a mixture of high-dimensional (p=20 to 50) generalized Bernoullis, and identify the order of this mixture model. Several methods are applied to a simulation dataset; we take a close look at the winning approach and discuss the challenges in applying this method to real datasets and possible extensions of the model.