Skip to main content Skip to secondary navigation


Main content start

Biobank discovery

In the Biobanks, data are often missing. We develop methods for phenotype recognition that imputes rare and common diseases for all Biobank participants to empower downstream analysis such as genetic association studies. 

A ligand in a binding pocket

Structural Informatics & Molecular Level Analysis

Goal: We apply artificial intelligence and other computational methods to chemical and protein structure data to gain insight into the molecular mechanisms of pharmacology.

Our current project areas include:

Protein site representations: We develop methods for representing local functional sites in protein structures to improve computational analysis of structural and functional sites. Applications of these methods include functional site prediction and annotation, binding site identification, pocket matching, protein-ligand interaction modeling, and more.

- COLLAPSE: We apply state-of-the-art self-supervised learning methods to large datasets of protein structure to learn information-rich embeddings of local structural sites. We use transfer learning and fine-tuning of COLLAPSE embeddings to improve performance on downstream tasks and develop tools that leverage direct comparisons in the embedding space to generate new insights about protein structure and function.

- FEATURE: We developed a system for modeling functional sites in protein structures. FEATURE represents the local 3D environment around sites of interest using many physicochemical properties (at the atomic, molecular, residue, and secondary structure level) collected in radial, spherical volumes centered on the site.

Machine Learning for 3D Structure: We apply the state of the art in neural networks to investigate binding relationships between proteins and their small molecule ligands, identify novel binding sites, and model the interaction between proteins and small molecules.

Chemoinformatics: Small molecule drugs are the foundation of pharmacy. We seek to apply and advance the state of the art in chemical informatics methodologies, using small molecule structure data to predict properties, activities, and clinical outcomes. Our primary areas of interest are cancer, tropical diseases, drug metabolism, and drug-drug interactions.

Social Media Pharmacovigilance

Goal: Leverage social media as a data source for pharmacovigilance, focusing on analyzing trends in the opioid epidemic.

Pharmacovigilance involves the post-market detection, assessment, analysis, and prevention of drug side effects and abuse. The opioid epidemic is a national crisis and pharmacovigilance of opioid abuse is critical for mitigating further death. Social media data holds promise for assisting in pharmacovigilance of drug abuse, as its informal and pseudo-anonymous nature makes users more forthcoming with details about their experiences with illicit drug use. However, as social media is a noisier and less complete data source than official reporting mechanisms, additional work is required to adequately leverage the knowledge it contains. Our group has several projects in this area.

Recent tools for social media pharmacovigilance include:

- RedMed: A word embedding model and resulting lexicon of drug slang terms for use in social media pharmacovigilance, trained on Reddit. See the RedMed paperZenodo record (embedding model and lexicon), and GitHub repository (annotation tool using RedMed lexicon).

- SAEDR and DRIP: Quantitative severity estimate scores for adverse drug reactions and drugs, respectively, derived from social media word embeddings. See the paper (with SAEDR and DRIP scores available from the supplement).

- GPT-3 for building colloquial drug lexicons: A pipeline to use OpenAI's GPT-3 to generate drug slang terms and misspellings for use in social media pharmacovigilance. See the paper and GitHub repository (lexicon and code to rerun pipeline and create new lexicons).


Adverse drug reactions (ADRs) are the unwanted result of variable response to drugs. ADRs are one of the leading causes of mortality in health care, resulting in over 7,000 deaths annually. While not all cases are as extreme, adverse reaction, no reaction, or poor response to drugs can be observed for up to 95% of medications.

            Pharmacogenomics studies show that interindividual genetic differences explain some of the variance in drug response. However, emerging data suggest that epigenetic mechanisms may also contribute to this variance. Pharmacoepigenetics (PEGx), the study of epigenetic variability and response to drugs, currently focuses on drugs that target epigenetic mechanisms; however, little is known about how epigenetic modifications influence our response to ‘classical’ or non-epigenetic drugs. Moreover, many classical drugs likely modify our epigenetic makeup, although the full effects of these modifications are unknown. To fully understand variable and adverse drug responses, it is crucial to understand how epigenetic markers and ‘classical’ drugs interact.

Datasets correlating epigenetic modifications to molecular and phenotypic information have recently grown in number due to technological advances, allowing machine-learning approaches to characterize potential pharmacoepigenetic relationships. We aim to leverage the newly available datasets to 1) identify PEGx relationships, 2) predict novel PEGx relationships, and 3) predict the epigenetic effect of classical drugs. These computational insights are key as they provide a starting point for subsequent experimental validation which will drive forward the field of PEGx and how it contributes to ADRs. 


Check out our recent review on this topic.

Extracting Biomedical Relations from Text

Goal: Develop algorithms using natural language processing techniques to extract biomedical relationships from text.

Publication of results is the primary goal of all scientific research. With this focus on publication, the biomedical literature has become the ultimate source of all known information about drugs, genes, and other biomedical entities. While high quality curated databases provide structured relationships for browsing and download, the majority of these relationships remain buried int the biomedical literature. Our lab is focused on developing text mining approaches for extracting these biomedical relationships from text, both in an unsupervised and supervised manner.

Recent algorithms and text mining tools include:

The Ensemble Biclustering for Classification (EBC) algorithm to automatically cluster biomedical relationships from text. Code available at github.

A gene-gene extractor for the system DeepDive (collaboration with Chris's Re at Stanford University). Code available at github.

More information about DeepDive can be found at

Links: EBC code is available on github; Gene-Gene application for Deepdive is also available on github.

Find more on our GitHub