Many prediction models perform worse when applied to new individuals, which may be caused by the use of invalid model parameters or by differences in case-mix distributions. Because it is increasingly common to develop and validate prediction models using data from different studies, settings and populations, the validity of model predictions may be affected by variation in the samples' representativeness of the targeted population.
We present propensity-based standardized measures of discrimination and calibration performance, and discuss how these can be used during external validation to assess model transportability and decide upon updating strategies. Further, when developing a new prediction model, we discuss how standardization of available samples improves the estimation of model parameters and subsequent performance across the included settings and populations.
We evaluate the proposed propensity-based standardization methods in an extensive simulation study where we explore under what circumstances standardization of multiple development samples improves prediction model performance. We also explore to what extent it remains possible to assess model performance in particular settings and populations if all data are used during model development. Results demonstrate that combining and standardizing all available samples for development purposes yields more favorable c-statistics and Brier scores even when some of the included studies have case-mix distributions or predictor-outcome associations that do not properly reflect the target population. When no samples are reserved for external validation, performance assessment requires standardized bootstrap procedures to avoid bias and over-optimism. Finally, we illustrate our methods in a motivating example where data from 13 studies were used to develop and externally validate a prediction model to diagnose deep vein thrombosis in patients suspected of deep vein thrombosis.
In conclusion, propensity score-based standardization might help (i) improve the interpretation of external validation studies of existing prediction models, and (ii) enhance the reproduciblity of newly derived prediction models across different settings and populations.