Researchers develop new technique to identify relationships between medical concepts
A team of researchers from the Department of Veterans Affairs, Oak Ridge National Laboratory, Harvard TH Chan School of Public Health, Harvard Medical School and Brigham and Women’s Hospital have developed a new technique based on learning machine to explore and identify relationships between medical concepts. using electronic health record data from multiple health care providers.
The method, called Knowledge Extraction via Sparse Embedding Regression, or KESER, was recently published in Nature Digital Medicine. The process integrates electronic health record data from two major institutions –; the Boston-based VA and Partners Healthcare -; and provides automated trait selection that leads to phenotype identification algorithms and knowledge discovery.
KESER provides a high-level view of the relationships between clinical knowledge that we cannot always see when caring for patients on an individual or group level. We look forward to translating the methods and results of the study from applications in clinical research into advances in clinical care. »
Dr. Katherine Liao, KESER Principal Investigator at VA Boston and Associate Professor of Medicine at Harvard Medical School
The project is part of the groundwork on phenomics led by Drs. Kelly Cho and Mike Gaziano of VA Boston and Harvard as part of VA’s Million Veteran Program, or MVP, a “national research program to learn how genes, lifestyle, and military exposures affect health and disease”, according to the VA Office of Research and Development MVP website.
In 2016, ORNL began collaborating with the VA on MVP-CHAMPION, a big data initiative under the MVP program, to create a large precision medicine platform to house the vast data set of medical records VA -; consisting of records for some 24 million veterans. To strengthen cross-cutting innovation in support of many research projects under this joint VA-DOE program, ORNL worked closely with MVP Data Core from VA Boston and Harvard to identify research areas specific to pursue. Among these was an effort to answer the question: what elements do we need to find in electronic health records to correctly identify a given phenotype?
Working with what they believe to be the largest cohort of health data used for this type of research in the United States, the team set out to automate the identification of phenotypic relationships while providing visibility into hypotheses under underlying machine learning and decision processes.
To do this, they designed and built the KESER methodology in four steps: converting the data into a structured format, constructing a low-dimensional vector representation of each medical code, selecting the features to assign importance to, and mapping the relationships. assigned as a network.
Data processing and representation learning
ORNL has played a key role in the tedious but essential work of processing and structuring various medical data -; patient procedures, diagnoses and measurements, as well as doctor’s notes, prescription information and more; of millions of patients across the VA and Partners Healthcare.
“There’s a lot of unstructured data processing that’s done before you end up with structured information that can be put into statistical methods,” said Edmon Begoli, head of the ORNL AI Systems section and principal investigator of the MVP project. -CHAMPION. “The team spent years working on the data to get it into a state where we could start using it for research.”
With the processed data, the team constructed a co-occurrence matrix, made up of more than 100,000 event types, or healthcare codes -; essentially a massive, but sparse table of data with one row and one column for every possible health care code. Each co-occurrence in time between two events helps create a clearer and more detailed picture of a given phenotype.
Leverage ORNL’s big data infrastructure and expertise in scientific computing; essential when working at this scale of data -; the team worked to automate data preprocessing and make the process publicly available.
“A researcher or institution can download the code, store their data in the correct format, and our process will perform all the necessary steps to integrate their data with everyone else’s,” said Everett Rush, research scientist at ORNL and lead data engineer on the project.
The research team took great care to protect patient privacy throughout the project. The team processed all VA data inside ORNL’s secure Protected Health Data infrastructure. After turning it into an anonymous summary level, they shared it with Harvard and other collaborators. The resulting KESER matrix retains no connection to individual patients.
“There’s no way to trace the end results back to an individual patient because they’re aggregates,” said Dallas Sacca, ORNL’s senior solutions engineer. Sacca manages the protected health data enclave at ORNL and reviews each piece of data to ensure it meets HIPAA guidelines for anonymization before allowing it to leave the enclave.
The matrix is full of anonymized information about this huge patient cohort that can be probed with different methods, such as KESER, to gain new insights into human health. Using a series of modern statistical methods, the team transformed summary data into vectors, tuned a model that codes the relationship of each vector, and extracted the most important features and feature weights for each. phenotype.
“These statistical methods, which include graphical Gaussian models for sparse modeling of covariance structures, are particularly capable of assigning importance that exposes potential causal relationships, a concept with which classical AI technology, such as the deep learning, tends to struggle,” George said. Ostrouchov, principal researcher at ORNL and principal statistician of the MVP-CHAMPION project.
After running the KESER method, the team selected eight phenotypes –; including depression, rheumatoid arthritis and ulcerative colitis -; to explore. Using the traits selected by KESER, they trained models to identify phenotypes of interest.
The possibilities offered by KESER’s new ability to anonymize, integrate and analyze data from multiple healthcare facilities seem limitless.
Tianxi Cai, Professor of Biomedical Informatics at Harvard Medical School and Principal Investigator at KESER, said, “We are excited to have a highly scalable approach that can handle arrays an order of magnitude larger than what we are currently working.
The team is already integrating more clinical descriptors into knowledge graphs. Additionally, the team began to explore knowledge graphs to better understand emerging diseases.
“In a situation like COVID, for example, where everyone needs to share data and we need to start investigating all the different things that are related to this specific disease, you could potentially do that with this system,” said Chuan Hong. , assistant professor at Duke University, who led research on the KESER project as an instructor at Harvard last year. “It’s essentially plug-and-play; you go into the data warehouse, follow the four-step process, and integrate your results directly.”
The potential for future collaboration and discovery may be the project’s greatest success. “This innovation will facilitate multi-center collaborations,” the team wrote in Nature“and bring the field closer to the promise of creating distributed networks for learning across institutions while maintaining patient privacy.”