A San Francisco startup is partnering with the NIH to address a key challenge posed by the agency collecting the largest set of COVID-19 patient records since June: How can access to that repository be broadened for researchers without compromising the privacy of patients who contributed all that data?
Syntegra plans to tackle that challenge by applying its synthetic data engine to the NIH’s National COVID Cohort Collaborative (N3C). The company uses machine learning to create validated “synthetic data”—replicas of healthcare data that are designed to precisely duplicate its statistical properties, with patient privacy protected by removing all links to the original. Syntegra markets its algorithm to a customer base that includes large health systems, life science companies, insurance providers, data scientists, and clinical research organizations.
Syntegra said this week that it will generate and validate a non-identifiable synthetic version of the entire N3C dataset: All 2.6 billion rows of data collected from more than 2.7 million screened individuals, including over 413,000 COVID-19 positive patients.
“Upfront, what we’re focused on at Syntegra is democratizing healthcare data,” Carter Prince, Syntegra’s Head of Business Development, told GEN. “We’re learning from an entire dataset and all the billions of relationships between every data point within an underlying dataset. And that allows us, using our own advanced machine learning techniques, to generate down a new synthetic dataset that has all of the statistical properties of the original dataset.”
The synthetic dataset that Syntegra will generate for the NIH will enable the agency to widen access to N3C data, the agency’s largest available repository of patient-level COVID-19 electronic medical records, as well as lay the foundation for greater access to data for life sciences researchers studying other diseases or drug development.
More than 70 healthcare organizations worldwide have contributed data to N3C, a public-private collaboration supported by the Bill and Melinda Gates Foundation through the COVID-19 Therapeutic Accelerator. The Foundation joined Wellcome and Mastercard in committing $125 million to launch the Accelerator in March 2020.
COVID-Related Pivot
The foundation has awarded Syntegra contracts of approximately $150,000 and $175,000 toward the development of COVID-19 synthetic data for sharing. The Foundation first connected with Syntegra before the pandemic about creating synthetic versions of clinical trials for its HIV and Maternal, Newborn & Child Health programs, to allow sharing of data without breaching privacy laws
“When COVID hit, both the Gates Foundation and us pivoted towards trying to develop the synthetic COVID data,” Syntegra’s co-founder and CEO, Michael D. Lesh, MD, recalled.
Syntegra was spun out last year from University of California, San Francisco (UCSF), where Lesh is a professor of medicine and formerly served as Executive Director of Health Technology Innovation in the Office of the Vice Chancellor.
Lesh, a cardiac electrophysiologist who once served as UCSF’s Chief of Cardiac Electrophysiology, began as an entrepreneur more than two decades ago by co-founding Atrionix, a company created in 1997 to commercialize a catheter he invented to treat atrial fibrillation. Atrionix was acquired in 2000 by Johnson & Johnson’s Cordis unit—now part of Cardinal Health—for $62.8 million.
After years of co-founding and leading several medical device companies, Lesh returned in 2017 to UCSF to jumpstart its tech commercialization effort, with a focus on data.
“Universities know how to patent and license molecules or devices,” Lesh recalled, “but I thought, There’s all this data in our silos. Wouldn’t it be great if we could just take all that data, which is really where I would say the wisdom of patient care sit—all the interactions between the patients and the health system—and we could share that while upholding our ethical obligation to maintain privacy?”
“I thought about all the stuff that we could learn and new drugs that could be developed,” Lesh continued. “Well, that was a bit of a naive a concept at that point because it turns out it’s really, really difficult to share data.”
Privacy Challenge
Privacy laws like the Health Insurance Portability and Accountability Act of 1996 (HIPAA) compelled healthcare stakeholders such as providers and insurers to invest in de-identification technologies that too often failed, leaving them very reluctant to share data
Lesh confronted the problem by pursuing the idea of applying synthetic data to healthcare. The concept had been used with success in settings ranging from manufacturing to autonomous vehicles tested by engineers using simulated “synthetic” miles.
Lesh credits Syntegra co-founder and Chief Technology Officer Ofer Mendelevitch with developing a way of creating synthetic data for healthcare: “At that point, we decided to start this company and spin it out from UCSF,” Lesh recalled.
Syntegra has received $3.1 million in seed funding from investors that included Hike Ventures, Impact Venture Capital, Innovation Global Capital, Village Global, Wisconn Valley Ventures, and Sweat Equity Ventures, and other unnamed investors.
Sweat Equity Ventures, established by LinkedIn co-founder Reid Hoffman, also helped Syntegra by supplying approximately $800,000 of in-kind talent, allowing it to tap the talents of software engineers and other professionals without having to hire costly staff early on. Of Syntegra’s roughly 12 people, half are full-time, and the rest hired through Sweat Equity.
In addition to the NIH, Syntegra is working with the FDA to assess how synthetic data can be applied in and outside of COVID-19. That work is still exploratory, based on interest shown in the potential of synthetic data for helping inform agency decisions on approvals of new indications or subpopulations for drugs by FDA Principal Deputy Commissioner Amy Abernethy, MD, PhD.
Since the start of the pandemic, Lesh said, Abernethy and FDA statisticians have met with Syntegra and engaged with N3C.
“She has been very interested in synthetic data since well before COVID,” Lesh said. “We’ve been meeting with her and the FDA to see whether synthetic data would be something that they could use in compliance and approval.”
“The goal is—and not just for COVID—does this synthetic data work well enough as something that the FDA would consider in their regulatory decisions?”