Model selection for partial least squares calibration and implications for analysis of atmospheric organic aerosol samples with mid-infrared spectroscopy
In developing partial least squares calibration models, selecting the number of latent variables used for their construction to minimize both model bias and model variance remains a challenge. Several metrics exist for incorporating these trade-offs, but the cost of model parsimony and the potential for underfitting on achievable prediction errors are difficult to anticipate. We propose a metric that penalizes growing model variance against decreasing bias as additional latent variables are added. The magnitude of the penalty is scaled by a user-defined parameter that is formulated to provide a constraint on the fractional increase in root mean square error of cross-validation (RMSECV) when selecting a parsimonious model over the conventional minimum RMSECV solution. We evaluate this approach for quantification of four organic functional groups using 238 laboratory standards and 750 complex atmospheric organic aerosol mixtures with mid-infrared spectroscopy. Parametric variation of this penalty demonstrates that increase in prediction errors due to underfitting is bounded by the magnitude of the penalty for samples similar to laboratory standards used for model training and validation. Imposing an ensemble of penalties corresponding to a 0-30% allowable increase in RMSECV through sum of ranking differences leads to the selection of a model that increases the actual RMSECV up to 20% for laboratory standards but achieves an 85% reduction in the mean error in predicted concentrations for environmental mixtures. Partial least squares models developed with laboratory mixtures can provide useful predictions in complex environmental samples, but may benefit from protection against overfitting. (C) 2015 The Authors. Journal of Chemometrics published by John Wiley & Sons Ltd.