We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretic principles, previously suggested for active learning. Our goal is not only to learn d-sparse predictors (which can be evaluated in O(d) rather than O(n), d much smaller than n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n d^2), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities ("error bars"), allows for Bayesian model selection and is less complex in implementation.