Abstract

Humans have the ability to learn. Having seen an object we can recognise it later. We can do this because our nervous system uses an efficient and robust visual processing and capabilities to learn from sensory input. On the other hand, designing algorithms to learn from visual data is a difficult task. More than fifty years ago, Rosenblatt proposed the perceptron algorithm. The perceptron learns from data examples a linear separation, which categorises the data in two classes. The algorithm served as a simple model of neuronal learning. Two further important ideas were added to the perceptron. First, to look for a maximal margin of separation. Second, to separate the data in a possibly high dimensional feature space, related nonlinearly to the initial space of the data, and allowing nonlinear separations. Important is that learning in the feature space can be performed implicitly and hence efficiently with the use of a kernel, a measure of similarity between two data points. The combination of these ideas led to the support vector machine, an efficient algorithm with high performance. In this thesis, we design an algorithm to learn the categorisation of data into multiple classes. This algorithm is applied to a real-time vision task, the recognition of human faces. Our algorithm can be seen as a generalisation of the support vector machine to multiple classes. It is shown how the algorithm can be efficiently implemented. To avoid a large number of small but time consuming updates of the variables limited accuracy computations are used. We prove a bound on the accuracy needed to find a solution. The proof motivates the use of a heuristic, which further increases efficiency. We derive a second implementation using a stochastic gradient descent method. This implementation is appealing as it has a direct interpretation and can be used in an online setting. Conceptually our approach differs from standard support vector approaches because examples can be rejected and are not necessarily attributed to one of the categories. This is natural in the context of a vision task. At any time, the sensory input can be something unseen before and hence cannot be recognised. Our visual data are images acquired with the recently developed adaptive vision sensor from CSEM. The vision sensor has two important features. First, like the human retina, it is locally adaptive to light intensity. Hence, the sensor has a high dynamic range. Second, the image gradient is computed on the sensor chip and is thus available directly from the sensor in real time. The sensor output is time encoded. The information about a strong local contrast is transmitted rst and the weakest contrast information at the end. To recognise faces, possibly moving in front of the camera, the sensor images have to be processed in a robust way. Representing images to exhibit local invariances is a common yet unsolved problem in computer vision. We develop the following representation of the sensor output. The image gradient information is decomposed into local histograms over contrast intensity. The histograms are local in position and direction of the gradient. Hence, the representation has local invariance properties to translation, rotation, and scaling. The histograms can be efficiently computed because the sensor output is already ordered with respect to the local contrast. Our support vector approach for multicategorical data uses the local histogram features to learn the recognition of faces. As recognition is time consuming, a face detection stage is used beforehand. We learn the detection features in an unsupervised manner using a specially designed optimisation procedure. The combined system to detect and recognise faces of a small group of individuals is efficient, robust, and reliable.

Details

PDF