Classification of (alternatively) spliced exons using state-of-the-art sequence kernels
|Title||Classification of (alternatively) spliced exons using state-of-the-art sequence kernels|
|Organization||Software Competence Center Hagenberg|
|Department||Master's Program Bioinformatics|
|School||Johannes Kepler University Linz|
RNA splicing plays an essential role in protein synthesis and a defect in this process can have major effects. Many diseases (e.g. cancer) are caused by defective splicing [Busch and Hertel, 2015]. In this step sequence regions which do not code for proteins (introns) are removed and the remaining sequences (exons) are rejoined. Due to alternative splicing it is possible that multiple proteins are formed based on a single gene by considering different splice sites. How these splice sites are chosen and recognized is still not fully understood by scientists.
In this work, DNA sequences from the C. elegans and H. sapiens genome are analyzed by applying state-of-the-art machine learning techniques. Such methods are frequently used in many different areas these days, especially in Bioinformatics. The processed C. elegans data has been published by [Rätsch et al., 2005]. Due to the lack of accurate and current alternative splicing data sets, a new H. sapiens data set has been created as essential part of this thesis. A major goal of this work is to train classifiers which are able to distinguish alternatively spliced exons from constitutively spliced exons by applying support vector machines (SVMs) and (sequence) kernels.
It can be shown that these methods accomplish quite remarkable classification performances on biological sequence data. Regarding the C. elegans data set, the performance of the classifier reported by the authors can be outranged and also on the newly created H. sapiens data very good results are achieved.
Subsequently, important sequence regions for classification are determined by calculating and analyzing prediction profiles based on the support vector machine models. Prediction profiles show the influences of individual sequence positions / regions to the final decision value. These sequence hot-spots are utilized in order to determine characteristic motifs of alternatively and constitutively spliced exons in C. elegans and H. sapiens which are statistically validated. Finally, these significant motifs are set into a biological context.