Airlines schedules are built such as to maximize expected profit. Such schedules turn out to be more sensitive to delays and are hence unstable. The trend has thus evolved towards robust schedules, trading off between sensitivity with respect to disruptions (and hence lower delay costs) and higher operational costs in the deterministic schedule. In this paper, we discuss the different ways to evaluate the robustness. We first show that the definition of robustness is not unique, and mainly differs in the way it is modeled and evaluated. We compare different models for the Maintenance Routing Problem (MRP) according to the most common robustness metrics. We use data of a real airline to evaluate the robustness of different models aiming at increasing total slack in order to reduce delay propagation. We show that some of the robustness metrics are correlated but not necessarily positively. Furthermore, we show that for a same metric, the efficiency varies depending on several factors such as the objective of the model, whether or not the model uses historical data and in which way. We show that no solution is globally better than the others, but that all of them improve the original schedule.