An online registry of over 1,200,000 human and mouse candidate functional elements that regulate genes has been generated as part of the third phase of the global ENCyclopedia of DNA Elements (ENCODE) project. The latest ENCODE findings, reported in a collection of 14 papers in Nature, Nature Methods and Nature Communications, provide new insights into genome organization and function. In an overview paper in Nature, titled, “Expanded encyclopaedias of DNA elements in the human and mouse genomes,” Zhiping Weng, Ph.D., and colleagues describe the generation of nearly 6,000 new experiments (4,834 involving human samples and 1,158 with mouse samples) that extend the earlier phases of ENCODE.
“When the first draft of the human genome was completed . . . it became immediately clear that while we had the primary sequence of the genome, or we had a draft of it . . . we needed to have an annotation for the genome,” said Cold Spring Harbor Laboratory professor Thomas Gingeras, PhD, whose team has been contributing to the ENCODE project since its inception. “We knew where the genes were located. Where the regulatory mechanisms and loci were located was significantly underdeveloped.”
With completion of its latest phase, the ENCODE project has added millions of candidate DNA “switches” from the human and mouse genomes that appear to regulate when and where genes are turned on. “The data generated in ENCODE 3 dramatically increase our understanding of the human genome,” said Brenton Graveley, Ph.D., professor and chair of the Department of Genetics and Genome Sciences at UCONN Health. “The project has added tremendous resolution and clarity for previous data types, such as DNA-binding proteins and chromatin marks, and new data types, such as long-range DNA interactions and protein-RNA interactions.”
The online registry comprises 926,535 human and 339,815 mouse candidate cis-regulatory elements (regions of non-coding DNA that regulate the transcription of genes), covering 7.9% and 3.4% of their respective genomes. A web-based tool called SCREEN allows users to visualize the data supporting these interpretations. “There are 3 billion base pairs in our genome and not every one of them has a known function,” said Dr. Weng, the Li Weibo Chair in Biomedical Research, professor of biochemistry & molecular pharmacology and director of the Program in Bioinformatics & Integrative Biology. “Identifying and annotating the specific regions of DNA that help control our genes is key to understanding the complexity of the genome and how it works … If our genome is like a car, then the protein coding part of the car is the engine. It propels us forward. How we control and make use of that engine – accelerating, turning, braking – is controlled by other mechanisms. In the genome, one family of these mechanisms is the cis-regulatory elements that promote and enhance, turn on or off, and fine-tune our genes.”
ENCODE is funded by the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health (NIH). NHGRI director, Eric Green, MD, PhD, commented, “A major priority of ENCODE 3 was to develop means to share data from the thousands of ENCODE experiments with the broader research community to help expand our understanding of genome function. ENCODE 3 search and visualization tools make these data accessible, thereby advancing efforts in open science.”
Whereas much of the research during the previous phases had been conducted using model cell lines, the latest phase includes 503 cell or tissue types from more than 1,369 biological sample sources, Weng et al. wrote. “ … phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins … Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.”
Other papers in the published collection describe use of the ENCODE datasets to reveal the principles that govern how some of these functional elements work. For example, Michael Snyder and colleagues mapped the interactions of chromatin in 24 human cell types and found that differences in chromatin looping between cell types can affect gene expression. “Maps of 3D chromatin interactions have become increasingly useful in explaining how distal regulatory elements can exert their influence,” Snyder et al noted. “Here, we have demonstrated how knowledge of cell-type-specific interactions can further expand the utility of such maps … This atlas of chromatin loops complements the diverse maps of regulatory architecture that comprise the ENCODE Encyclopedia, and will help to support emerging analyses of genome structure and function.”
The ENCODE researchers explored the regulatory role of cis-regulatory elements during prenatal mammalian development, in mice. Three papers published in Nature present information on fetal mice during eight developmental stages. In two of these papers, Bing Ren, PhD, and colleagues, and Joseph Ecker, PhD, and colleagues, demonstrate that the human equivalents of some enhancer elements (regulatory DNA sequences that can enhance the transcription of a gene) that are active in specific tissues in mice during development are enriched for disease-associated genetic variants. The findings may provide a starting point for the investigation of regulatory elements involved in human developmental disorders.
“The mouse ENCODE data sets provide a compendium of resources for biomedical researchers and achieve, to our knowledge, the most comprehensive view of chromatin dynamics during mammalian fetal development to date,” Ren et al. noted in their paper.
Ecker and colleagues discovered more than 1.8 million regions of the mouse genome that had variations in methylation based on tissue, developmental stage or both. Commenting on their work, the researachers concluded, “Overall, we present, to our knowledge, the most comprehensive set of temporal fetal tissue epigenome mapping data available in terms of the number of developmental stages and tissue types investigated, expanding upon the previous phase of the mouse ENCODE project, which focused exclusively on adult mouse tissues … These spatiotemporal epigenome maps provide a resource for studies of gene regulation during tissue or organ progression, and a starting point for investigating regulatory elements that are involved in human developmental disorders.”
The data could help to narrow down regions of the human genome that play roles in diseases such as schizophrenia and Rett Syndrome. Howard Hughes Medical Institute Investigator Ecker, a professor in Salk’s Genomic Analysis Laboratory, further explained, “This is the only available data set that looks at the methylation in a developing mouse over time, tissue by tissue. It’s going to be a valuable resource to help in narrowing down the causal tissues of human developmental diseases.” First author Yupeng He, PhD, who was previously a Salk postdoctoral research fellow and is now a senior bioinformatics scientist at Guardant Health, added, “The breadth of samples that we applied this technology to is what’s really key. We think that the removal of methylation makes the whole genome more open to dynamic regulation during development. After birth, genes critical for early development need to be more stably silenced because we don’t want them turned on in mature tissue, so that’s when methylation comes in and helps shut down the early developmental enhancers.”
Gingeras’s team is investigating genome elements that instruct cells about how and when to transcribe DNA sequences into RNA. In a companion publication to the ENCODE report, a team led by Gingeras and collaborator Roderic Guigó, at the Centre for Genomic Regulation, detail work identifying molecular fingerprints that can be used to identify five groups of human cells. “Our work redefines, based on gene expression, the basic histological types in which tissues have been traditionally classified,” Guigó said. In their paper, the authors stated, “Our analyses suggest that a large fraction of human cells and cell types in tissues belong to a few major cell types, providing a high-level transcriptionally based hierarchical classification of human cells … the data collected here on the transcriptomics of human primary cells, and the links that we have established between these data and the phenotypic traits of organs constitute a unique resource, serving as an intermediate resolution of complexity between single-cell and whole-organ transcriptomics.”
Established in 2003, the ENCODE project is a worldwide effort to understand how the human genome functions. The human body is composed of trillions of cells, with thousands of types of cells. While all these cells share a common set of DNA instructions, the diverse cell types (e.g., heart, lung and brain) carry out distinct functions by using the information encoded in DNA differently. The DNA regions that act as switches to turn genes on or off, or tune the exact levels of gene activity, help drive the formation of distinct cell types in the body and govern their functioning in health and disease.
The researchers aim to develop a comprehensive map of the functional elements—regions of DNA that code for molecular products or biochemical activities with roles in gene regulation—of the human and mouse genomes. “The human genome comprises a vast repository of DNA-encoded instructions that are read, interpreted, and executed by the cellular protein and RNA machinery to enable the diverse functions of living cells and tissues,” Weng and colleagues noted. “The ENCODE Project aims to delineate precisely and comprehensively the segments of the human and mouse genomes that encode functional elements.”
The global project is founded on extensive collaborative research involving groups across the US and internationally, comprising over 500 scientists with diverse expertise. It has benefited from and built upon decades of research on gene regulation performed by independent researchers around the world. ENCODE teams have created a community resource, to make the project’s data accessible to scientists around the world. These efforts in open science have resulted in over 2,000 publications from non-ENCODE researchers who used data generated by the ENCODE Project. “This demonstrates that the encyclopedia is widely used, which is what we had always aimed for,” said Elise Feingold, PhD, scientific advisor for strategic implementation in the Division of Genome Sciences at NHGRI and a lead on ENCODE for the institute. “Many of these publications are related to human disease, attesting to the resource’s value for relating basic biological knowledge to health research.”
As part of phase III of ENCODE, and to assess the potential functions of different DNA regions, researchers studied biochemical processes that are typically associated with the switches that regulate genes. This biochemical approach is an efficient way to explore the entire genome rapidly and comprehensively. This method helps to locate regions in the DNA that are “candidate functional elements”—DNA regions that are predicted to be functional elements based on these biochemical properties. Candidates can then be tested in further experiments to identify and characterize their functional roles in gene regulation.
Significant progress has been made in characterizing protein-coding genes, which comprise less than 2% of the human genome. Researchers know much less about the remaining 98% of the genome, including how much and which parts of it perform other functions. ENCODE is helping to fill in this significant knowledge gap. “A key challenge in ENCODE is that different genes and functional regions are active in different cell types,” explained Feingold. “This means that we need to test a large and diverse number of biological samples to work towards a catalog of candidate functional elements in the genome.”
During the third phase of ENCODE, researchers performed nearly 6,000 experiments—4,834 in humans and 1,158 in mice – to uncover details of the genes and their potential regulators in their respective genomes. The ENCODE 3 researchers studied developing embryonic mouse tissues to understand the timeline of various genomic and biochemical changes that occur during mouse development. Mice, due to their genomic and biological similarity to humans, can help to inform our understanding of human biology and disease.
These experiments in humans and mice were carried out in several biological contexts. Researchers analyzed how chemical modifications of DNA, proteins that bind to DNA, and RNA (a sister molecule to DNA) interact to regulate genes. Results from ENCODE 3 can thus help to explain how variations in DNA sequences outside of protein-coding regions can influence the expression of genes, even genes located far away from a specific variant itself.
“Across multiple data types, the increase in the scale of experimental data has provided new insights into genome organization and function, and catalyzed new capabilities for deriving biological understandings and principles …” Weng and collaborators noted in their paper in Nature. “Ultimately, we anticipate that the ENCODE Encyclopedia will help researchers to decode the molecular mechanisms that underpin the genetic bases of human traits and diseases.” Gingeras further concluded, “This encyclopedia is a living resource. It has a beginning but really no end. It will continue to be improved, and grown, as time goes on.”
In an accompanying perspective, also published in Nature, Snyder and colleagues note that elements that govern genome control and function are densely encoded in the human genome. However, despite the discovery of a large number of these elements, many elements that affect particular cell types or states remain to be identified. As part of ENCODE phase IV, considerable effort is being devoted to expanding the cell types and tissues analyzed.
“Importantly, although very large numbers of noncoding elements have been defined, the functional annotation of ENCODE-identified elements is still in its infancy,” Snyder and colleagues noted. “High-throughput reporter-based assays, CRISPR-based genome and epigenome editing methods, and other high-throughput approaches are being used in the current phase of ENCODE to assess the functions of many thousands of elements and to relate those functional results to their biochemical signatures. These targeted functional assays, combined with the large-scale annotation of biochemical features, should further enhance the value of ENCODE data.”