Teaching Machines to Sift Big Data

June 26, 2017
Teaching Machines to Sift Big Data
Source: scanrail/Getty Images

Kristen Slawinski, Ph.D.

Biodata gold is all around us, but it is hard to find, distributed as it is in traces throughout a vast biodata landscape, one that has scarcely been explored properly, even as it continues to be reshaped by next-generation sequencing, the effects of which are, well, tectonic. Thus far, biodata prospecting has been limited to isolated deposits, to independent claims staked here and there.

It is about to become more systematic, however, now that powerful biodata tools and techniques are becoming more widely available.

The challenge now is to sift through the fast-accumulating petabytes of biodata, extracting the 24-karat nuggets while filtering out the fool’s gold. Going forward, ambitious data mining efforts are bound to enrich drug development in areas such as target identification and clinical trial design. Data mining will also demonstrate clinical value, from diagnostics to patient stratification to the prediction of patient responses to therapeutic interventions.

At the genomics capital of the world, San Diego, bioinformaticians, biostatisticians, CEOs, and data scientists gathered to discuss and present innovations in biological data management at the second annual Next-Generation Sequencing (NGS) Data Analysis & Informatics Conference. The conference expanded beyond issues of NGS data management to address general challenges in Big Biodata processing.

Machine learning, artificial intelligence (AI), and cloud processing were themes at the event, which kept a tight focus on the field of cancer genomics. The early link between cancer and genomic aberrations has resulted in a glut of omics studies stored as discrete datasets or collected into inconvenient databases.

Several years ago, Pfizer found that there was a need for a platform that combined publicly available cancer genome and transcriptome datasets after several large-scale collaborations left the company with a large amount of sequencing data to analyze. OASIS, an open-access cancer genomics web portal, was Pfizer’s solution.

“Raw omics data is difficult to analyze,” said Julio Fernández Banet, Ph.D., principal scientist on Pfizer’s oncology research team. “Our aim in developing OASIS was to minimize barriers for scientists to access trusted omics datasets and, as easily as possible, perform ad hoc analyses to answer their questions.”

The OASIS platform integrates numerous datasets from Pfizer collaborations and established sources such as The Cancer Genome Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE), and the Genotype-Tissue Expression (GTEx) project. The portal includes sample-level annotations and gene-level mutation, copy number variation, and expression data on tens of thousands of samples across dozens of cancers and tissues.

Not only can the data be easily and freely mined, but OASIS also has built-in analytics. Researchers can plot expression data and compare with sample mutation status, find cell lines with a particular mutation of interest, and generate reports summarizing alterations for a list of genes across multiple cancer types. In version 3.0, available in the second half of this year, OASIS will also include proteomic data, an exciting enhancement to connect genetic aberrations with protein translation.

As more relevant connections are made among biological data, more meaningful conclusions can be made to drive drug discovery and development. Large datasets are a starting point for integrating and connecting data; however, there is also critical biological information that exists discreetly buried within millions of scientific publications. Accessing and linking this information requires an intelligence beyond human processing.

Data4Cure’s all-inclusive bioinformatics platform, the Biomedical Intelligence Cloud, is powered by a dynamic graphical knowledge base named CURIE, which uses advanced machine learning to automatically mine molecular datasets as well as published texts.

“CURIE combines a variety of technologies including bioinformatics, machine learning, and natural language processing, as well as a proprietary semantic AI integration engine to continuously extract and integrate knowledge from multiple sources,” explained Data4Cure’s CEO, Janusz Dutkowski, Ph.D. “The knowledge graph can also be automatically expanded with user’s proprietary data in a private cloud mode, an important feature for the pharmaceutical industry.”

The CURIE graph is continually expanded and integrates omics data, methylation profiles, genotype/phenotype associations, biomarker databases, and even patient sample metadata from clinical trials. The graph currently spans over 150 thousand nodes, each representing a different biological entity or property, and 100 million relationships among them. Relationships are strengthened when data is supported by multiple sources or agrees with known pathways, giving the model higher predictive properties.

The robust network of biological information contained within the Biomedical Intelligence Cloud allows users to visualize their favorite biological molecule at a systems level. Even more impressive is the way in which the data can be analyzed to stratify patients into disease subtypes, a capability that can lead to predictions of therapeutic efficacy.

For example, a pathway activity analysis using patient expression data can identify immune cell-infiltrated and noninfiltrated tumor subtypes and associates them with specific genomic alterations. This information can then be used in immuno-oncology clinical trial planning by linking subtypes to treatment response. Use of Data4Cure’s bioinformatic platform ensures an informed drug development plan which is beneficial for everyone.

Standardizing NGS in the Clinic

In an effort to treat patients using a targeted approach, more and more hospitals are incorporating sequencing programs for diseased patients where genetic variation plays a role. Unfortunately, the programs are suffering from a lack of standardization, a deficit that can lead to discrepancies in care among healthcare facilities. Sophia Genetics hopes to change that with a collective AI for genomic data processing and analytics named SOPHiA.

“Our model takes raw NGS data directly from laboratory sequencers and delivers a variant report to the clinician in only a few hours,” said Bernardo J. Foth, Ph.D., senior bioinformatician at Sophia Genetics. “SOPHiA artificial intelligence analyzes raw NGS data to obtain an accurate variant detection report for each lab independently from the technologies they use.

“Some variants are fairly easy to detect, like SNPs. More difficult are variants with a string of base-pair repeats or long indels [DNA deletions or insertions] on the order of kilobases. Our platform uses specific modules to recognize variants that are more challenging to detect, giving clinicians a comprehensive and reliable report.”

SOPHiA was trained early on with data from hospital collaborations and publicly available NGS datasets and is continuously learning from experts using it. An example of the AI’s remarkable ability to detect challenging variants is seen in the resolution of long base-pair repeats within the gene responsible for cystic fibrosis, CTFR.

The CTFR gene has 11 thymine-guanine (TG) repeats followed by 7 T’s. Changes in these regions can be clinically relevant and must be resolved for a proper diagnosis. SOPHiA’s algorithm identifies the noise inherent in the raw NGS data, which is found in both the reference genome and the variant sample, and is able to determine the CTFR sequence with 99.99% sensitivity.

Accurate and automated NGS reporting is only half the battle for clinicians to direct treatment decisions. They must also be informed of what it means for a patient to harbor a particular variant. Sophia Genetics’ platform helps to link gene variants with pathogenicity by sharing knowledge among the 250 institutions that are part of its community. Clinical classification for gene variants are assigned by users and shared within the system, giving clinicians more confidence when using sequencing data to diagnose and treat patients.

Another resource to help with interpretation of cancer variants is the publicly available Clinical Knowledgebase (CKB) from The Jackson Laboratory (JAX). The database contains gene and variant descriptions, drug indication statuses, clinical trial information, treatment approaches, and efficacy evidence related to oncology.

“Curation of the Clinical Knowledgebase is semiautomated and undergoes careful review by our dedicated curation team,” noted Guru Ananda, Ph.D., scientist in the computational sciences group at JAX. “The tool provides connections among variant information, therapies, and literature-based efficacy evidence to justify why a specific variant is associated with a particular therapy.”

The CKB was originally developed as a resource for cancer gene panel assays offered by JAX, assays that analyze genes identified as clinically relevant or associated with response or resistance to FDA-approved targeted therapies. However, the database is a convenient tool for all oncologists to connect variants with appropriate therapies for an informed treatment plan.

High-Performance Computing of Big Biodata

Petabytes of data stored as 1’s and 0’s are generated in clinical NGS programs. The immense amount of processing required to analyze the data requires a high-performance computing (HPC) environment and optimized computing instructions, referred to as a bioinformatic pipeline. If not properly organized, complex pipelines will compute at a snail’s pace, debilitating NGS data analysis.

JAX soon discovered a need for a computing framework that allowed for the development of multiple, easily maintainable, robust, and reproducible bioinformatics pipelines as the research institution’s participation in clinical programs grew. In response, JAX developed Civet, an open-source framework for automation of bioinformatic analyses within a HPC environment.

Civet allows pipelines to be easily created and modified while automatically optimizing the schedule of computing steps, such as aligning sequencing data to a reference genome or determining copy number variations among sequencing samples. In addition, the framework incorporates traceability logs exceeding regulatory standards to ensure that files and libraries used in the execution of each pipeline are tracked and accessible.

While Civet optimizes bioinformatic pipelines, helping to achieve faster data processing, ultimately, processing speed is determined by the performance of the HPC environment. HPC uses a network of parallel processors that work on multiple bioinformatic jobs within the pipeline, simultaneously. Performance is determined by how quickly the pipeline is able to access files located on network-attached storage (NAS). The NAS may be remote to the HPC environment, causing a significant access time delay, known as latency, and is one of the most substantial problems in the computing of Big Data over the cloud.

Cloud computing uses a network of remote servers hosted on the internet to store, manage, and process data, rather than a local server. If all data is stored and processed on the cloud, latency is low. However, this method requires the uploading of massive amounts of data and storage on a third-party vendor cloud system.

Another method to limit latency is to build on-premises storage and computing facilities, though this is very costly and difficult to keep up with a growing amount of data. Avere Systems proposes a cloud hybrid model in which data in on-premises NAS is sent to the cloud for processing. Typically, sending data to the cloud increases latency due to the physical distance between the NAS and cloud processors, but Avere Systems has a solution for that.

“We use a technique called bursting to solve the problem with latency,” explained Scott Jeschonek, director of cloud products at Avere Systems. “Bursting is a type of read-ahead caching using algorithms that learn the access patterns of data transfer from on-premises NAS to the cloud and only transfer relevant data when needed during processing.”

Avere Systems has optimization solutions for any on-premises or cloud-computing model; however, the hybrid cloud model is an appealing option as it allows for on-site data storage and the cost-saving benefits of cloud computing.

This story originally appeared in the May 15, 2017 edition of Genetic Engineering & Biotechnology News (GEN)