Computational deconvolution of gene expression data from heterogeneous biological samples

 Biological samples almost always consist of different kinds of cells mixed in varying proportions. The variability introduced by differences in cell type composition often masks other differences between samples that may be of greater interest. This confounding is both widely known and routinely ignored. One way to deal with it is to physically separate the different cell types that make up the samples, performing experiments only on the purified cell types, but this is challenging and often unfeasible in practice. In addition, it disrupts the cell context and can alter the processes the researcher is trying to understand. An alternative to physical separation, which has been gaining traction in recent years, is to use computational methods to infer the composition of the samples. Knowledge of the sample composition can then be used to help understand the biological processes taking place within specific cell types. This talk will review computational methods to deconvolve heterogeneous biological samples using gene expression data and describe a new method we have developed, based on a semi-supervised version of non-negative matrix factorization. It will also provide an overview of our software package, CellMix, the goal of which is to provide a unified framework to develop and apply gene expression deconvolution methods.