QLS Featured Seminar - David Rocke
Excess False Positives in Negative-Binomial Based Analysis of Data from Ï㽶ÊÓƵ-Seq Experiments
David M. Rocke1,2, PhD and Yilun Zhang, MS1
1Division of Biostatistics, Department of Public Health Sciences, UC Davis
Ìý2Department of Biomedical Engineering, UC Davis
Key Words: Ï㽶ÊÓƵ-Seq, Gene Expression, Negative Binomial, DESeq, edgeR, limma-voom
Ìý
Ï㽶ÊÓƵ-Seq data are increasingly used for whole-genome differential mÏ㽶ÊÓƵ expression analysis in lieu of gene expression arrays such as those from Affymetrix and Illumina. Because the raw data in Ï㽶ÊÓƵ-Seq consist of counts of fragments mapping to each gene or exon, and because the counts are over-dispersed, it is common to model the distribution as negative binomial. Yet empirically methods based on the negative binomial generate often massively inflated false positives whether real data are used or simulated negative binomial data. This appears to be a consequence of the fact that the negative binomial with unknown scale is not an exponential family distribution, and that as a quasi-likelihood, the link function, and thus the natural parameter, are functions of the scale parameter. Consequently also, a linear model with negative binomial quasi-likelihood is not a proper generalized linear model unless the scale is known. We demonstrate that, even when the data are truly negative binomial, it is better to use transformation or weighting followed by standard linear models than it is to fit a version of a generalized linear model with estimated scale.