Welcome to our research page featuring recent publications in the field of biostatistics and epidemiology! These fields play a crucial role in advancing our understanding of the causes, prevention, and treatment of various health conditions. Our team is dedicated to advancing the field through innovative studies and cutting-edge statistical analyses. On this page, you will find our collection of research publications describing the development of new statistical methods and their application to real-world data. Please feel free to contact us with any questions or comments.
Filter
Topic
Showing 1 of 10 publications
Background: With many disease-modifying therapies currently approved for the management of multiple sclerosis, there is a growing need to evaluate the comparative effectiveness and safety of those therapies from real-world data sources. Propensity score methods have recently gained popularity in multiple sclerosis research to generate real-world evidence. Recent evidence suggests, however, that the conduct and reporting of propensity score analyses are often suboptimal in multiple sclerosis studies.
Objectives: To provide practical guidance to clinicians and researchers on the use of propensity score methods within the context of multiple sclerosis research.
Methods: We summarize recommendations on the use of propensity score matching and weighting based on the current methodological literature, and provide examples of good practice.
Results: Step-by-step recommendations are presented, starting with covariate selection and propensity score estimation, followed by guidance on the assessment of covariate balance and implementation of propensity score matching and weighting. Finally, we focus on treatment effect estimation and sensitivity analyses.
Conclusion: This comprehensive set of recommendations highlights key elements that require careful attention when using propensity score methods.
As the scientific research community along with health care professionals and decision-makers around the world fight tirelessly against the COVID-19 pandemic, the need for comparative effectiveness research (CER) on preventive and therapeutic interventions for COVID-19 is immense. Randomized controlled trials markedly underrepresent the frail and complex patients seen in routine care, and they do not typically have data on long-term treatment effects. The increasing availability of electronic health records (EHRs) for clinical research offers the opportunity to generate timely real-world evidence reflective of routine care for optimal management of COVID-19. However, there are many potential threats to the validity of CER based on EHR data that are not originally generated for research purposes. To ensure unbiased and robust results, we need high-quality healthcare databases, rigorous study designs, and proper implementation of appropriate statistical methods. We aimed to describe opportunities and challenges in EHR-based CER for COVID-19-related questions and to introduce best practices in pharmacoepidemiology to minimize potential biases. We structured our discussion into the following topics: 1) Study population identification based on exposure status; 2) Ascertainment of outcomes; 3) Common biases and potential solutions; and 4) Data operational challenges specific to COVID-19 CER using EHR. We provide structured guidance for the proper conduct and appraisal of drug and vaccine effectiveness and safety research using EHR data for the pandemic. This manuscript is endorsed by the International Society for Pharmacoepidemiology (ISPE).
Objectives: Among ID studies seeking to make causal inferences and pooling individual-level longitudinal data from multiple infectious disease cohorts, we sought to assess what methods are being used, how those methods are being reported, and whether these factors have changed over time.
Study design and setting: Systematic review of longitudinal observational infectious disease studies pooling individual-level patient data from 2+ studies published in English in 2009. 2014, or 2019. This systematic review protocol is registered with PROSPERO (CRD42020204104).
Results: Our search yielded 1,462 unique articles. Of these, 16 were included in the final review. Our analysis showed a lack of causal inference methods and of clear reporting on methods and the required assumptions.
Conclusion: There are many approaches to causal inference which may help facilitate accurate inference in the presence of unmeasured and time-varying confounding. In observational ID studies leveraging pooled, longitudinal IPD, the absence of these causal inference methods and gaps in the reporting of key methodological considerations suggests there is ample opportunity to enhance the rigor and reporting of research in this field. Interdisciplinary collaborations between substantive and methodological experts would strengthen future work.
While the opportunities of ML and AI in healthcare are promising, the growth of complex data-driven prediction models requires careful quality and applicability assessment before they are applied and disseminated in daily practice. This scoping review aimed to identify actionable guidance for those closely involved in AI-based prediction model (AIPM) development, evaluation and implementation including software engineers, data scientists, and healthcare professionals and to identify potential gaps in this guidance. We performed a scoping review of the relevant literature providing guidance or quality criteria regarding the development, evaluation, and implementation of AIPMs using a comprehensive multi-stage screening strategy. PubMed, Web of Science, and the ACM Digital Library were searched, and AI experts were consulted. Topics were extracted from the identified literature and summarized across the six phases at the core of this review: (1) data preparation, (2) AIPM development, (3) AIPM validation, (4) software development, (5) AIPM impact assessment, and (6) AIPM implementation into daily healthcare practice. From 2683 unique hits, 72 relevant guidance documents were identified. Substantial guidance was found for data preparation, AIPM development and AIPM validation (phases 1-3), while later phases clearly have received less attention (software development, impact assessment and implementation) in the scientific literature. The six phases of the AIPM development, evaluation and implementation cycle provide a framework for responsible introduction of AI-based prediction models in healthcare. Additional domain and technology specific research may be necessary and more practical experience with implementing AIPMs is needed to support further guidance.
Objectives: Missing data is a common problem during the development, evaluation, and implementation of prediction models. Although machine learning (ML) methods are often said to be capable of circumventing missing data, it is unclear how these methods are used in medical research. We aim to find out if and how well prediction model studies using machine learning report on their handling of missing data.
Study design and Setting: We systematically searched the literature on published papers between 2018 and 2019 about primary studies developing and/or validating clinical prediction models using any supervised ML methodology across medical fields. From the retrieved studies information about the amount and nature (e.g. missing completely at random, potential reasons for missingness) of missing data and the way they were handled were extracted.
Results: We identified 152 machine learning-based clinical prediction model studies. A substantial amount of these 152 papers did not report anything on missing data (n = 56/152). A majority (n = 96/152) reported details on the handling of missing data (e.g., methods used), though many of these (n = 46/96) did not report the amount of the missingness in the data. In these 96 papers the authors only sometimes reported possible reasons for missingness (n = 7/96) and information about missing data mechanisms (n = 8/96). The most common approach for handling missing data was deletion (n = 65/96), mostly via complete-case analysis (CCA) (n = 43/96). Very few studies used multiple imputation (n = 8/96) or built-in mechanisms such as surrogate splits (n = 7/96) that directly address missing data during the development, validation, or implementation of the prediction model.
Conclusion: Though missing values are highly common in any type of medical research and certainly in the research based on routine healthcare data, a majority of the prediction model studies using machine learning does not report sufficient information on the presence and handling of missing data. Strategies in which patient data are simply omitted are unfortunately the most often used methods, even though it is generally advised against and well known that it likely causes bias and loss of analytical power in prediction model development and in the predictive accuracy estimates. Prediction model researchers should be much more aware of alternative methodologies to address missing data.
This commentary describes the creation of the Zika Virus Individual Participant Data Consortium, a global collaboration to address outstanding questions in Zika virus (ZIKV) epidemiology through conducting an individual participant data meta-analysis (IPD-MA). The aims of the IPD-MA are to (1) estimate the absolute and relative risks of miscarriage, fetal loss, and short- and long-term sequelae of fetal exposure; (2) identify and quantify the relative importance of different sources of heterogeneity (e.g., immune profiles, concurrent flavivirus infection) for the risk of adverse fetal, infant, and child outcomes among infants exposed to ZIKV in utero; and (3) develop and validate a prognostic model for the early identification of high risk pregnancies and inform communication between health care providers and their patients and public health interventions (e.g., vector control strategies, antenatal care, and family planning programs). By leveraging data from a diversity of populations across the world, the IPD-MA will provide a more precise estimate of the risk of adverse ZIKV-related outcomes within clinically relevant subgroups and a quantitative assessment of the generalizability of these estimates across populations and settings. The ZIKV IPD Consortium effort is indicative of the growing recognition that data sharing is a central component of global health security and outbreak response.
Precision medicine research often searches for treatment-covariate interactions, which refers to when a treatment effect (eg, measured as a mean difference, odds ratio, hazard ratio) changes across values of a participant-level covariate (eg, age, gender, biomarker). Single trials do not usually have sufficient power to detect genuine treatment-covariate interactions, which motivate the sharing of individual participant data (IPD) from multiple trials for meta-analysis. Here, we provide statistical recommendations for conducting and planning an IPD meta-analysis of randomized trials to examine treatment-covariate interactions. For conduct, two-stage and one-stage statistical models are described, and we recommend: (i) interactions should be estimated directly, and not by calculating differences in meta-analysis results for subgroups; (ii) interaction estimates should be based solely on within-study information; (iii) continuous covariates and outcomes should be analyzed on their continuous scale; (iv) nonlinear relationships should be examined for continuous covariates, using a multivariate meta-analysis of the trend (eg, using restricted cubic spline functions); and (v) translation of interactions into clinical practice is nontrivial, requiring individualized treatment effect prediction. For planning, we describe first why the decision to initiate an IPD meta-analysis project should not be based on between-study heterogeneity in the overall treatment effect; and second, how to calculate the power of a potential IPD meta-analysis project in advance of IPD collection, conditional on characteristics (eg, number of participants, standard deviation of covariates) of the trials (potentially) promising their IPD. Real IPD meta-analysis projects are used for illustration throughout.
Many randomized trials evaluate an intervention effect on time-to-event outcomes. Individual participant data (IPD) from such trials can be obtained and combined in a so-called IPD meta-analysis (IPD-MA), to summarize the overall intervention effect. We performed a narrative literature review to provide an overview of methods for conducting an IPD-MA of randomized intervention studies with a time-to-event outcome. We focused on identifying good methodological practice for modeling frailty of trial participants across trials, modeling heterogeneity of intervention effects, choosing appropriate association measures, dealing with (trial differences in) censoring and follow-up times, and addressing time-varying intervention effects and effect modification (interactions).
We discuss how to achieve this using parametric and semi-parametric methods, and describe how to implement these in a one-stage or two-stage IPD-MA framework. We recommend exploring heterogeneity of the effect(s) through interaction and non-linear effects. Random effects should be applied to account for residual heterogeneity of the intervention effect. We provide further recommendations, many of which specific to IPD-MA of time-to-event data from randomized trials examining an intervention effect.
We illustrate several key methods in a real IPD-MA, where IPD of 1225 participants from 5 randomized clinical trials were combined to compare the effects of Carbamazepine and Valproate on the incidence of epileptic seizures.
Background: Individual participant data meta-analysis (IPD-MA) is considered the gold standard for investigating subgroup effects. Frequently used regression-based approaches to detect subgroups in IPD-MA are: meta-regression, per-subgroup meta-analysis (PS-MA), meta-analysis of interaction terms (MA-IT), naive one-stage IPD-MA (ignoring potential study-level confounding), and centred one-stage IPD-MA (accounting for potential study-level confounding). Clear guidance on the analyses is lacking and clinical researchers may use approaches with suboptimal efficiency to investigate subgroup effects in an IPD setting. Therefore, our aim is to overview and compare the aforementioned methods, and provide recommendations over which should be preferred.
Methods: We conducted a simulation study where we generated IPD of randomised trials and varied the magnitude of subgroup effect (0, 25, 50%; relative reduction), between-study treatment effect heterogeneity (none, medium, large), ecological bias (none, quantitative, qualitative), sample size (50,100,200), and number of trials (5,10) for binary, continuous and time-to-event outcomes. For each scenario, we assessed the power, false positive rate (FPR) and bias of aforementioned five approaches.
Results: Naive and centred IPD-MA yielded the highest power, whilst preserving acceptable FPR around the nominal 5% in all scenarios. Centred IPD-MA showed slightly less biased estimates than naïve IPD-MA. Similar results were obtained for MA-IT, except when analysing binary outcomes (where it yielded less power and FPR <5%). PS-MA showed similar power as MA-IT in non-heterogeneous scenarios, but power collapsed as heterogeneity increased, and decreased even more in the presence of ecological bias. PS-MA suffered from too high FPRs in non-heterogeneous settings and showed biased estimates in all scenarios. Meta-regression showed poor power (<20%) in all scenarios and completely biased results in settings with qualitative ecological bias.
Conclusions: Our results indicate that subgroup detection in IPD-MA requires careful modelling. Naive and centred IPD-MA performed equally well, but due to less bias of the estimates in the presence of ecological bias, we recommend the latter.
Clinical prediction models aim to provide estimates of absolute risk for a diagnostic or prognostic endpoint. Such models may be derived from data from various studies in the context of a meta-analysis. We describe and propose approaches for assessing heterogeneity in predictor effects and predictions arising from models based on data from different sources. These methods are illustrated in a case study with patients suffering from traumatic brain injury, where we aim to predict 6-month mortality based on individual patient data using meta-analytic techniques (15 studies, n = 11022 patients). The insights into various aspects of heterogeneity are important to develop better models and understand problems with the transportability of absolute risk predictions.