Email: dix2 at pitt dot edu
Research Interests:
Artificial Intelligence in Medicine
Investigation of a Data and Knowledge Driven System for Sequential Diagnosis
The goal of this project was to combine EHR data with medical knowledge to generate a sequential diagnostic system.
I developed an algorithm generating medical diagnostic trees that recommend diagnostic actions by considering clinical workflow, diagnostic accuracy, and misdiagnosis costs.
Demonstrated with EHR data for heart disease classification task in Emergency room (116,797 encounters, 37,376 features), the new system I created has better clinical alignment, higher diagnostic accuracy, and lower misdiagnosis costs than baseline models which were developed using a traditional multi-label classification tree algorithm (ML-C4.5) and a deep reinforcement learning algorithm (deep Q learning).
Heart Failure Readmission Prediction
The goal of this project was to use a large amount of EHR data to develop models that can assess patients' risk of returning to hospital within 30 days.
We collected data of 15,703 visits from 14 hospitals in the University of Pittsburgh Medical Center Health System, retrieving six types of data in our study: demographic information, diagnosis, healthcare utilization information (e.g., number of ER visits within 12 months), 6148 distinct drug categories (RxNorm and NDF-RT), 3,545 distinct laboratory tests, and 20,304 distinct UMLS CUIs from reports.
I developed a risk assessment predictive model, which had a good discriminatory power (AUROC 0.650), which was statistically significantly better than the HOSPITAL score method and LACE score method. I found that healthcare utilization was the most valuable data type for readmission risk assessment.
An Evaluation of a Natural Language Processing Tool for Identifying and Encoding Information in Emergency Department Clinical Notes
The goal of this project was to determine the accuracy of the cTAKES system in identifying predetermined factors from unstructured ED electronic health records. We created a list of 10 factors related to cardiac medical diagnosis and disease status. Two physicians reviewed 419 clinical records for each of the 10 factors and coded each factor as "positive," "negative," and "missing". The clinical reviewer assignment was used as the gold standard for the purpose of analysis. I used cTAKES to retrieve clinical information and analyzed performance. The result showed that 84.4\% factors were identified correctly by cTAKES.
Classification of Positive Valence System Symptom Severity using Initial Psychiatric Evaluation Records
The goal of this project was to develop a framework that can automatically classify initial psychiatric evaluation records to one of four positive valence system severities, which was part of the CEGS N-GRID 2016 Shared Task in Clinical Natural Language Processing competition. We derived question-answer features and UMLS concept features (using MedLEE) from the initial reports, and built two decision tree models and one Bayesian network model. I built two decision tree models using the "rpartScore" package, and the macro average-inverse normalized mean absolute error scores were 82.56. There were 24 participating teams and 65 valid submitted runs. Our models earned 4th place, and we were invited to publish one paper in the Journal of Biomedical Informatics.
Predicting Clinical Outcomes Using High-dimensional Genomic Datasets
The objective of this investigation was to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data, which included one hundred 1000-single nucleotide polymorphismc (SNP) simulated datasets, ten 10000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets. I conducted experiments to predict clinical outcomes using different machine learning algorithms, including Naive Bayes, model averaging Naive Bayes, feature selection Naive Bayes, efficient Bayesian multivariate classifier, logistic regression, support vector machines, and lasso. In five-fold cross-validation studies, the SVM performed best on the 1000-SNP datasets, while the BN-based methods performed best on the other datasets, with EBMC performing best. The results were published in the Journal of the American Medical Informatics Association.
Predicting Patient Survivorship Using Efficient Bayesian Network Learning
The purpose of this project was to develop and evaluate a new Bayesian network (BN)-based patient survivorship prediction method. We developed the EBMC-Survivorship (EBMC-S) method, which predicts survivorship for each year individually. EBMC-S is based on the EBMC BN algorithm, which has been shown to be able to handle high-dimensional data. We evaluated EBMC-S using the Molecular Taxonomy of Breast Cancer International Consortium dataset. Results showed that EMBC-S performs better than the Cox proportional hazard model and is comparable to the random survival forest method. The results were published in Cancer Informatics.