RNA-Seq Research Expands Human Genome by Nearly 5K Genes

May 29, 2018
RNA-Seq Research Expands Human Genome by Nearly 5K Genes
Source: Natali_Mis/Getty Images

Estimates for the number of genes in the human genome have been trending downward, from 50,000–100,000 (a figure widely cited in the years preceding the Human Genome Project), to 20,000–25,000 (a figure proposed in 2004); to as few as 19,000–20,000 (a range reflecting recent surveys that have made use of improved analytical technology). A brand-new survey, however, indicates that the estimated gene count should rise a bit. This survey, from researchers based at Johns Hopkins University, says that there are 43,162 genes, of which 21,306 are protein-coding genes and 21,856 are noncoding genes.

These numbers are of more than academic interest. They establish a baseline that is used to orient genetic studies of many kinds. These include exome sequencing projects, genome-wide association studies, gene expression analyses, and efforts to identify disease-causing mutations.

The new findings were generated by the computational biology lab of Johns Hopkins’ Steven L. Salzberg, Ph.D., a professor of biomedical engineering, computer science, and biostatistics. Dr. Salzberg and colleagues presented their findings May 28 on the bioRxiv website, in an article entitled, “Thousands of large-scale RNA sequencing experiments yield a comprehensive new human gene list and reveal extensive transcriptional noise.”

The authors of this article noted that estimates for the number of human genes have not only been shrinking, they have also been citing narrower ranges, indicating growing precision. Although the new survey departs from the trend toward a shrinking human gene catalog, it sustains the commitment to precision, mainly by carefully distinguishing between genes that encode functional proteins and those that encode nonfunctional proteins, but also by painstakingly separating experimental signals from potentially confounding noise.

“Our total gene count,” noted the study, “corresponds to the total number of distinct chromosomal intervals, or loci, that encode either proteins or noncoding RNAs; in addition, we report the total number of gene variants, which includes all alternative transcripts expressed at each locus.” After assembling sequences from 9,795 RNA sequencing experiments, collected from 31 human tissues and hundreds of subjects as part of the Genotype-Tissue Expression (GTEx) project, Dr. Salzberg’s team counted a total of 323,824 transcripts, for an average of 7.5 transcripts per gene.

“Our expanded gene list includes 4,998 novel genes (1,178 coding and 3,819 noncoding) and 97,511 novel splice variants of protein-coding genes as compared to the most recent human gene catalogs,” the bioRxiv article indicated. “We detected over 30 million additional transcripts at more than 650,000 sites, nearly all of which are likely to be nonfunctional, revealing a heretofore unappreciated amount of transcriptional noise in human cells.”

The authors of the new study allow “that the proper determination of function can be a lengthy, complex process, and that at present the function of many human genes is unknown or only partially understood.” Nonetheless, the authors also insist that their approach has led to a more comprehensive catalog of genes and splice variants, one that should provide a better foundation for RNA-seq experiments, exome sequencing experiments, genome-wide association studies, and many other studies that rely on human gene annotation as the basis for their analysis.

“Although [this catalog] represents only a modest increase in the number of protein-coding genes (1,178, or 5.5% out of 21,306 total),” the article’s authors noted, “it more than doubles the number of splice variants and other isoforms of these genes, to 267,476.” The authors added that their findings suggest that the cell is a fairly inefficient machine, one that transcribes more DNA into RNA than it needs.

“Based on the results described here,” the authors concluded, “it appears that nearly 99% of the transcriptional variety produced in human cells has no apparent function, although most of these variants appear at such low levels that they cumulatively account for only 32% of transcriptional volume.”