Pharmaceutical development has a large impact on the nation’s economy and public health. Despite substantial annual outlays for pharmaceutical development, many drugs fail in clinical trials, while the majority of those making it to market fail to yield a profit. Data Science methods hold out the promise of solving the high-cost, low-return problem. Machine learning, for example, can sift through large and complex datasets and make good preliminary predictions about effective compounds. However, a key obstacle is how to pull together insight from many highly restricted pharmaceutical datasets and patient datasets to allow better predictions.

To develop solutions to these problems, Drs. Kayvan Najarian, H. V. Jagadish and Jonathan Gryak were recently awarded a planning grant by the National Science Foundation to establish the Center for Data-Driven Drug Development and Treatment Assessment (DATA), which will produce new methodologies and infrastructure for industry-wide collaborative drug discovery. Partnering with the MIDAS faculty team to develop this center are 25 pharmaceutical companies, data science companies, health organizations, and health informatics companies.

The envisioned Center will focus on three areas of research:

1) the development, testing, and validation of cutting-edge machine learning methods
2) providing an industry-wide and vendor-agnostic Secure Data Hub for pharmaceutical and patient data
3) enabling federated machine learning over encrypted databases — a method that can combine insights generated from multiple datasets without having to combine the datasets.

The team will take advantage of efficient and fully homomorphic encryption and the newest computational prediction methods such as coupled tensor-matrix and tensor-tensor completion methods.

This project brings together data scientists, mathematicians, biomedical researchers, and healthcare providers to produce reproducible methodologies that will make a broad impact on drug discovery and biomedical applications of data science.

DATA is funded through the NSF Industry-University Cooperative Research Center (IUCRC) program. This program enables long-term research partnerships between industry, academia, and government through funding thematic research centers focused on pre-competitive research projects. The initial phase of DATA involves planning the Center’s organizational structure, operating procedures, intellectual property policies, and initial experimental plan.