Publications | Wentao (Winston) Li

2025

Machine-learning driven strategies for adapting immunotherapy in metastatic NSCLC

Maliazurina B Saad, Qasem Al-Tashi, Lingzhi Hong, and 8 more authors

Nature Communications, 2025

Website
From Classical Machine Learning to Emerging Foundation Models: Review on Multimodal Data Integration for Cancer Research

Amgad Muneer, Muhammad Waqas, Maliazurina B Saad, and 8 more authors

arXiv preprint arXiv:2507.09028, 2025

Website

2024

FedGMMAT: Federated generalized linear mixed model association tests

Wentao Li, Han Chen, Xiaoqian Jiang, and 1 more author

PLOS Computational Biology, Jul 2024

Abs Website

Increasing genetic and phenotypic data size is critical for understanding the genetic determinants of diseases. Evidently, establishing practical means for collaboration and data sharing among institutions is a fundamental methodological barrier for performing high-powered studies. As the sample sizes become more heterogeneous, complex statistical approaches, such as generalized linear mixed effects models, must be used to correct for the confounders that may bias results. On another front, due to the privacy concerns around Protected Health Information (PHI), genetic information is restrictively protected by sharing according to regulations such as Health Insurance Portability and Accountability Act (HIPAA). This limits data sharing among institutions and hampers efforts around executing high-powered collaborative studies. Federated approaches are promising to alleviate the issues around privacy and performance, since sensitive data never leaves the local sites. Motivated by these, we developed FedGMMAT, a federated genetic association testing tool that utilizes a federated statistical testing approach for efficient association tests that can correct for confounding fixed and additive polygenic random effects among different collaborating sites. Genetic data is never shared among collaborating sites, and the intermediate statistics are protected by encryption. Using simulated and real datasets, we demonstrate FedGMMAT can achieve the virtually same results as pooled analysis under a privacy-preserving framework with practical resource requirements.
Attention-Fusion Model for Multi-omics (AMMO) Data Integration in Lung Adenocarcinoma

Wentao Li, Amgad Muneer, Muhammad Waqas, and 2 more authors

In International Workshop on Computational Mathematics Modeling in Cancer Analysis, Jul 2024

Website
Estimating the Average Treatment Effect Using Weighting Methods in Lung Cancer Immunotherapy

Maliazurina B Saad, Qasem Al-Tashi, Lingzhi Hong, and 7 more authors

In International Workshop on Computational Mathematics Modeling in Cancer Analysis, Jul 2024

Website
1236 Machine learning-based clinico-genomic prediction of benefits to add chemotherapy to immunotherapy in metastatic non-small cell lung cancer

Maliazurina Binti Saad, Qasem Al-Tashi, Lingzhi Hong, and 8 more authors

Jul 2024

Website

2023

Federated generalized linear mixed models for collaborative genome-wide association studies

Wentao Li, Han Chen, Xiaoqian Jiang, and 1 more author

Iscience, Jul 2023

Abs PDF Website

Federated association testing is a powerful approach to conduct large-scale association studies where sites share intermediate statistics through a central server. There are, however, several standing challenges. Confounding factors like population stratiﬁcation should be carefully modeled across sites. In addition, it is crucial to consider disease etiology using ﬂexible models to prevent biases. Privacy protections for participants pose another signiﬁcant challenge. Here, we propose distributed Mixed Effects Genome-wide Association study (dMEGA), a method that enables federated generalized linear mixed model-based association testing across multiple sites without explicitly sharing genotype and phenotype data. dMEGA employs a reference projection to correct for population-stratiﬁcation and utilizes efﬁcient local-gradient updates among sites, incorporating both ﬁxed and random effects. The accuracy and efﬁciency of dMEGA are demonstrated through simulated and real datasets. dMEGA is publicly available at https://github.com/Li-Wentao/dMEGA.
Non-invasive arterial blood pressure measurement and SpO2 estimation using PPG signal: a deep learning framework

Yan Chu, Kaichen Tang, Yu-Chun Hsu, and 6 more authors

BMC Medical Informatics and Decision Making, Jul 2023

Abs Website

Monitoring blood pressure and peripheral capillary oxygen saturation plays a crucial role in healthcare management for patients with chronic diseases, especially hypertension and vascular disease. However, current blood pressure measurement methods have intrinsic limitations; for instance, arterial blood pressure is measured by inserting a catheter in the artery causing discomfort and infection.
COLLAGENE enables privacy-aware federated and collaborative genomic data analysis

Wentao Li, Miran Kim, Kai Zhang, and 3 more authors

Genome Biology, Sep 2023

Abs Website

Growing regulatory requirements set barriers around genetic data sharing and collaborations. Moreover, existing privacy-aware paradigms are challenging to deploy in collaborative settings. We present COLLAGENE, a tool base for building secure collaborative genomic data analysis methods. COLLAGENE protects data using shared-key homomorphic encryption and combines encryption with multiparty strategies for efficient privacy-aware collaborative method development. COLLAGENE provides ready-to-run tools for encryption/decryption, matrix processing, and network transfers, which can be immediately integrated into existing pipelines. We demonstrate the usage of COLLAGENE by building a practical federated GWAS protocol for binary phenotypes and a secure meta-analysis protocol. COLLAGENE is available at https://zenodo.org/record/8125935.
Efficient Federated Kinship Relationship Identification

Xinyue Wang, Leonard Dervishi, Wentao Li, and 3 more authors

AMIA Summits on Translational Science Proceedings, Jun 2023

Abs Website

Kinship relationship estimation plays a significant role in today’s genome studies. Since genetic data are mostly stored and protected in different silos, retrieving the desirable kinship relationships across federated data warehouses is a non-trivial problem. The ability to identify and connect related individuals is important for both research and clinical applications. In this work, we propose a new privacy-preserving kinship relationship estimation framework: Incremental Update Kinship Identification (INK). The proposed framework includes three key components that allow us to control the balance between privacy and accuracy (of kinship estimation): an incremental process coupled with the use of auxiliary information and informative scores. Our empirical evaluation shows that INK can achieve higher kinship identification correctness while exposing fewer genetic markers.

2022

Facilitating federated genomic data analysis by identifying record correlations while ensuring privacy

Leonard Dervishi, Xinyue Wang, Wentao Li, and 4 more authors

AMIA Annual Symposium Proceedings, Jun 2022

Abs PDF Website

With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore, there may be correlations in the data, which if not detected, can impact the analysis. In this paper, we take the first step towards identifying correlated records across multiple data repositories in a privacy-preserving manner. The proposed framework, based on random shuffling, synthetic record generation, and local differential privacy, allows a trade-off of accuracy and computational efficiency. An extensive evaluation on real genomic data from the OpenSNP dataset shows that the proposed solution is efficient and effective.
Federated learning algorithms for generalized mixed-effects model (GLMM) on horizontally partitioned data from distributed sources

Wentao Li, Jiayi Tong, Md. Monowar Anjum, and 3 more authors

BMC Medical Informatics and Decision Making, Oct 2022

Abs Website

This paper developed federated solutions based on two approximation algorithms to achieve federated generalized linear mixed effect models (GLMM). The paper also proposed a solution for numerical errors and singularity issues. And showed the two proposed methods can perform well in revealing the significance of parameter in distributed datasets, comparing to a centralized GLMM algorithm from R package (‘lme4’) as the baseline model.
Privacy preserving collaborative learning of generalized linear mixed model

Md. Monowar Anjum, Noman Mohammed, Wentao Li, and 1 more author

Journal of Biomedical Informatics, Oct 2022

Abs Website

Generalized Linear Mixed Model is one of the most pervasive class of statistical models. It is widely used in the medical domain. Training such models in a collaborative setting often entails privacy risks. Standard privacy preserving mechanisms such as differential privacy can be used to mitigate the privacy risk during training the model. However, experimental evidence suggests that adding differential privacy to the training of the model can cause significant utility loss which makes the model impractical for real world usage. Therefore, it becomes clear that the specific class of generalized linear mixed models which lose their usability under differential privacy requires a different approach for privacy preserving model training. In this work, we propose a value-blind training method in a collaborative setting for generalized linear mixed models. In our proposed training method, the central server optimizes model parameters for a generalized linear mixed model without ever getting access to the raw training data or intermediate computation values. Intermediate computation values that are shared by the collaborating parties with the central server are encrypted using homomorphic encryption. Experimentation on multiple datasets suggests that the model trained by our proposed method achieves very low error rate while preserving privacy. To the best of our knowledge, this is the first work that performs a systematic privacy analysis of generalized linear mixed model training in collaborative setting.
Privacy-aware estimation of relatedness in admixed populations

Su Wang, Miran Kim, Wentao Li, and 3 more authors

Briefings in Bioinformatics, Nov 2022

Abs PDF Website

Estimation of genetic relatedness, or kinship, is used occasionally for recreational purposes and in forensic applications. While numerous methods were developed to estimate kinship, they suffer from high computational requirements and often make an untenable assumption of homogeneous population ancestry of the samples. Moreover, genetic privacy is generally overlooked in the usage of kinship estimation methods. There can be ethical concerns about finding unknown familial relationships in third-party databases. Similar ethical concerns may arise while estimating and reporting sensitive population-level statistics such as inbreeding coefficients for the concerns around marginalization and stigmatization.Here, we present SIGFRIED, which makes use of existing reference panels with a projection-based approach that simplifies kinship estimation in the admixed populations. We use simulated and real datasets to demonstrate the accuracy and efficiency of kinship estimation. We present a secure federated kinship estimation framework and implement a secure kinship estimator using homomorphic encryption-based primitives for computing relatedness between samples in two different sites while genotype data are kept confidential. Source code and documentation for our methods can be found at https://doi.org/10.5281/zenodo.7053352.Analysis of relatedness is fundamentally important for identifying relatives, in association studies, and for estimation of population-level estimates of inbreeding. As the awareness of individual and group genomic privacy is growing, privacy-preserving methods for the estimation of relatedness are needed. Presented methods alleviate the ethical and privacy concerns in the analysis of relatedness in admixed, historically isolated and underrepresented populations.Genetic relatedness is a central quantity used for finding relatives in databases, correcting biases in genome wide association studies and for estimating population-level statistics. Methods for estimating genetic relatedness have high computational requirements, and occasionally do not consider individuals from admixed ancestries. Furthermore, the ethical concerns around using genetic data and calculating relatedness are not considered. We present a projection-based approach that can efficiently and accurately estimate kinship. We implement our method using encryption-based techniques that provide provable security guarantees to protect genetic data while kinship statistics are computed among multiple sites.
Privacy-Aware Kinship Inference in Admixed Populations using Projection on Reference Panels

Su Wang, Miran Kim, Wentao Li, and 3 more authors

bioRxiv, Nov 2022

Publisher: Cold Spring Harbor Laboratory

2021

Open Imputation Server provides secure Imputation services with provable genomic privacy

Arif O. Harmanci, Miran Kim, Su Wang, and 4 more authors

bioRxiv, Nov 2021

Abs PDF Website

Summary As DNA sequencing data is available for personal use, genomic privacy is becoming a major challenge. Nevertheless, high-throughput genomic data analysis outsourcing is performed using pipelines that tend to overlook these challenges.Results We present a client-server-based outsourcing framework for genotype imputation, an important step in genomic data analyses. Genotype data is encrypted by the client and encrypted data are used by the server that never observes the data in plain. Cloud-based framework can benefit from virtually unlimited computational resources while providing provable confidentiality. We demonstrate server’s utility from several aspects using genotype dataset from the 1000 Genomes datasets. First, we benchmark the accuracy of common variant imputation in comparison to BEAGLE, a state-of-the-art imputation method. We also provide the detailed time requirements of the server to showcase scaling of time usage in different steps of imputation. We also present a simple correlation metric that can be used to estimate imputation accuracy using only the reference panels. This is important for filtering the variants in downstream analyses. As a further demonstration and a different use case, we performed a simulated genomewide association study (GWAS) using imputed and known genotypes and highlight potential utility of the server for association studies. Overall, our study present multiple lines of evidence for usability of secure imputation service.Availability Server is publicly available at https://www.secureomics.org/OpenImpute. Users can anonymously test and use imputation server without registration.Contact Arif.O.Harmanciatuth.tmc.eduCompeting Interest StatementThe authors have declared no competing interest.
VERTIcal Grid lOgistic regression with Confidence Intervals (VERTIGO-CI)

Jihoon Kim, Wentao Li, Tyler Bath, and 2 more authors

AMIA Summits Transl. Sci. Proc., May 2021

Abs PDF Website

Federated learning of data from multiple participating parties is getting more attention and has many healthcare applications. We have previously developed VERTIGO, a distributed logistic regression model for vertically partitioned data. The model takes advantage of the linear separation property of kernel matrices of a dual space model to harmonize information in a privacy-preserving manner. However, this method does not handle the variance estimation and only provides point estimates: it cannot report test statistics and associated P-values. In this work, we extend VERTIGO by introducing a novel ring-structure protocol to pass on intermediary statistics among clients and successfully reconstructed the covariance matrix in the dual space. This extension, VERTIGO-CI, is a complete protocol to construct a logistic regression model from vertically partitioned datasets as if it is trained on combined data in a centralized setting. We evaluated our results on synthetic and real data, showing the equivalent accuracy and tolerable performance overhead compared to the centralized version. This novel extension can be applied to other types of generalized linear models that have dual objectives.

2020

A tutorial on calibration measurements and calibration models for clinical prediction models

Yingxiang Huang, Wentao Li, Fima Macheret, and 2 more authors

Journal of the American Medical Informatics Association, Feb 2020

Abs Website

Our primary objective is to provide the clinical informatics community with an introductory tutorial on calibration measurements and calibration models for predictive models using existing R packages and custom implemented code in R on real and simulated data. Clinical predictive model performance is commonly published based on discrimination measures, but use of models for individualized predictions requires adequate model calibration. This tutorial is intended for clinical researchers who want to evaluate predictive models in terms of their applicability to a particular population. It is also for informaticians and for software engineers who want to understand the role that calibration plays in the evaluation of a clinical predictive model, and to provide them with a solid starting point to consider incorporating calibration evaluation and calibration models in their work.Covered topics include (1) an introduction to the importance of calibration in the clinical setting, (2) an illustration of the distinct roles that discrimination and calibration play in the assessment of clinical predictive models, (3) a tutorial and demonstration of selected calibration measurements, (4) a tutorial and demonstration of selected calibration models, and (5) a brief discussion of limitations of these methods and practical suggestions on how to use them in practice.