Study Highlights Potential Gene Expression Data Analysis Bias

November 16, 2019

New research in PLOS Biology describes how RNA sequencing (RNA-seq) analysis can be affected by a technical bias that leads to recurrent false results. The authors detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. They report that “This stochastic sample-specific effect is not corrected by common normalization methods.” The study authors are Shir Mandelbaum, Zohar Manber, Orna Elroy-Stein, and Ran Elkon from Tel Aviv University. They added that “Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data.”

RNA-seq is one of the most widely used techniques in biological and biomedical research and is used for tasks such as uncovering transcriptional networks as well as diagnostic and prognostic expression signatures of disease or drug response.

A critical step in RNA-seq analysis is data normalization, which aims to remove systematic effects from the data to ensure that technical biases have minimal impact on results. Mandelbaum and colleagues noticed a problem with this process after analyzing dozens of publicly available RNA-seq datasets, which profiled cellular responses to numerous stresses. They found that sets of particularly short or long genes repeatedly showed changes in expression level. They noted that “Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls.”

Importantly, the effect was not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization

The study authors wondered if this phenomenon reflected a biological response common to many stressors or whether it stemmed from an experimental artifact. To answer this question, they compared replicate samples that had undergone similar testing. They were surprised to find the same pattern of particularly short or long genes showing changes in expression level in this study, suggesting this phenomenon is the result of a technical bias related to gene length.

The authors report that this sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. In their study, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, they found that application of gene-set tests that take into account gene–gene correlations attenuates false positive rates caused by the length bias, but statistical power was reduced as well.

The authors note that “ A well-known inherent technical effect in RNA-seq experiments relates to gene length and stems from the fact that in standard RNA-seq protocols, RNA (or cDNA) molecules are fragmented prior to sequencing in such a way that longer transcripts are sheared into more fragments than shorter ones are. Therefore, the number of reads for a given transcript is proportional not only to its expression level but also to its length.”

They then conclude that “Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data.”