Files

Abstract

End-to-end learning methods like deep neural networks have been the driving force in the remarkable progress of machine learning in recent years. However, despite their success, the deployment process of such networks in safety-critical use cases, such as healthcare, has been lagging. This is due to the black-box nature of deep neural networks. Such networks rely on raw data as input and learn relevant features directly from the data, which makes understanding the inference process hard. To mitigate this, several explanation methods have been proposed, such as local linear proxy models, attribution methods, feature activation maps or attention mechanisms. However, many of these explanation methods, attribution maps in particular, tend not to fulfill certain desiderata of faithful explanations, in particular robustness, i.e., explanations should be invariant towards imperceptible perturbations in the input that do not alter the inference outcome. The poor robustness of attribution maps to such input alterations is a key factor that hinders trust in explanations and the deployment on neural networks in high-stakes scenarios. While the robustness of attribution maps has been studied extensively in the image domain, it has not been researched in text domains at all. This is the focus of this thesis. First, we show that the existence of imperceptible, adversarial perturbations on attributions extends to text classifiers as well. We demonstrate this on five text classification datasets and a range of state-of-the-art classifier architectures. Moreover, we show that such perturbations transfer across model architectures and attribution methods, being effective in scenarios where the target model and explanation method are unknown. Our initial findings demonstrate the need for a definition of attribution robustness that incorporates the extent to which the input sentences are altered in order to differentiate between more perceptible adversarial perturbations. Thus, we establish a new definition of attribution robustness, based on Lipschitz continuity, that reflects the perceptibility of such alterations. This allows for effectively quantifying and comparing the robustness of neural network attributions. As part of this effort, we propose a set of metrics that effectively capture the perceptibility of perturbations in text. Then, based on our new definition, we introduce a novel attack that yields perturbations altering explanations to a greater extent while being less perceptible. Lastly, in order to improve attribution robustness in text classifiers, we introduce a general framework for training robust classifiers, which is a generalized formulation of current robust training objectives. We propose instantiations of this framework and show, with experiments on three biomedical text datasets, that attributions in medical text classifiers lack robustness to small input perturbations as well. Then we showcase that our instantiations successfully train networks with improved attribution robustness, outperforming baseline methods. Finally, we show that our framework performs better or comparably to current methods in image classification as well, while being more general. In summary, our work significantly contributes to quantifying and improving the attribution robustness of text classifiers, taking a step towards enabling the safe deployment of state-of-the-art neural networks in real life, safety-critical applications like healthcare ones.

Details

PDF