Infoscience

Thesis

Prediction of Survival and Risk Assessment Using Joint Analysis of Microarray Gene Expression Data

Gene expression profiles have been widely used in molecular classification, diagnosis and prediction, particularly in the area of oncology where accurate and early diagnosis is needed for appropriate treatment. Avoiding under-/over-treatment when it is not necessary can extend a patient's survival and prevent disease recurrence. These high-throughput assay technologies have generated terabytes of data exploited extensively to provide insights on cancer biology and the underlying mechanism of disease progression. The ultimate goal is to identify possibly tailored treatment and therapy for personalized medicine. Analysis of microarray data is constrained by the following characteristics: (i) noisy due to missing or erroneous values; (ii) high dimensional due to a large number of genes versus a few number of samples in which their expression levels are measured; (iii) costly due to expensive microarray experiments. Abundant microarray gene expression data should be processed by appropriate computational and statistical learning methodologies such as machine learning techniques. These methods are robust to noisy data and have a great capacity to analyze high dimensional data. Their computational power is nevertheless limited to sample size based on which these methods are built. These algorithms have been widely applied to microarray gene expression data to identify a set of genes known as a gene signature whose expressions are highly correlated to a target value or outcome such as disease status, tumor subtype, a patient's survival time, risk of mortality or cancer relapse. Prediction of survival time and a patient's risk which is unknown at diagnosis presents a more challenging task for machine learning methods than tumor subtype or disease classification, which is already established by oncologists. The properties of microarray data cited above, the limitation of the number of samples in cancer patients and dependency of the machine learning methods' performance on sample size justify joint analysis of microarray data to increase the number of samples. We applied joint analysis methods to breast and lung cancer data sets to improve survival prediction and risk assessment. In overall, no significant improvement or deterioration of the performance accuracy was obtained with joint analysis. However, increasing sample size helped to identify robust or stable gene signatures predictive of survival time and risk assessment. Our achievements and learned-lessons from joint analysis of microarray gene expression data can be used as a guideline for future research studies in classification and prediction.

Fulltext

Related material