Student project

Presentation and study of robustness for several methods to classify individuals based on their gene expressions

Several studies have shown that it is possible to detect cancer tissues based on gene expressions using methods of machine learning. The main problem with classifying gene expression data is to obtain accurate rules that are easy to interpret and provide indications for follow up studies. Indeed high accuracy is hard to achieve due to the small number of observations and the large amount of genes in the human genome. Some methods of machine learning are based on an important quantity of genes, which lead to decision rules that are usually difficult to interpret. These methods were tested on different samples and their results were compared. Most of them provided good results with a high accuracy. Among these methods for gene classification one distanced itself from the others by producing transparents results which were readily interpretable and were very useful for follow up studies. It highlighted pair of genes that were the most efficient to classify individuals with respect to their gene expressions. This is the so called Top Scoring Pair (TSP) classifier. This method achieves prediction rates that are as high as those of the other methods. In contrast to other classifiers which use considerably more genes and more complicated procedures, the TSP has an easy and quick implementation and involves very few genes, namely only two. This provides very easy rules that are accurate and transparent. Finally, the TSP is paramter-free, which avoids overfitting and inflation of the estimation of the prediction rate.


    • EPFL-STUDENT-168227

    Record created on 2011-08-23, modified on 2016-08-09

Related material