Dimension reduction and classification with high-dimensional microarray data [Elektronische Ressource] / vorgelegt von Anne-Laure Isabeau Boulesteix

ludwig-maximilians-universitat_munchen - Boulesteix , Anne-Laure

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

116 pages

Deutsch

Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

A propos
Informations
Extrait

Description

Sujets

Dimension Reduction and Classi cation
with High-Dimensional Microarray Data
Dissertation an der Fakult at fur Mathematik, Informatik und
Statistik der Ludwig-Maximilian-Universitat Munc hen
vorgelegt von
Anne-Laure Isabeau Boulesteix
am 18.11.2004Dimension Reduction and Classi cation
with High-Dimensional Microarray Data
Dissertation an der Fakult at fur Mathematik, Informatik und
Statistik der Ludwig-Maximilian-Universit at Munc hen
vorgelegt von
Anne-Laure Isabeau Boulesteix
am 18.11.2004
1. Gutachter: Prof. Dr. G. Tutz
2. Gutachter: Prof. Dr. L. Fahrmeir
3. Gutachter: Prof. Dr. U. Gather
Rigorosum: 22.02.2005VORWORT
Diese Arbeit entstand im Laufe der letzten zweieinhalb Jahren w ahrend meiner T atigkeit
als wissenschaftliche Mitarbeiterin am Institut fur Statistik der Ludwig-Maximilian Uni-
versit at Munc hen. Sie wurde zum Teil durch Mittel des Sonderforschungsbereichs 386
und des Emmy-Noether-Programms der DFG gef ordert.
Bedanken m ochte ich mich zuallererst bei meinem Doktorvater Gerhard Tutz, der mir
durch fruchtbare Gespr ache sehr geholfen hat, neue Ideen entstehen zu lassen und diese
zu verwirklichen. Er hat mir dabei viele Freiheit und Vertrauen geschenkt. Ein beson-
derer Dank gilt meinem zweiten Betreuer Korbinian Strimmer, der mir insbesondere am
Anfang meiner Promotion sehr hilfsbereit zur Seite stand und als Zimmernachbar fur
gutes Arbeitsklima gesorgt hat.
Ich bedanke mich auch bei Ludwig Fahrmeir und Ursula Gather, die sich freundlicherweise
bereit erkl art haben, diese Arbeit zu begutachten und bei meinem Diplomvater Volkmar
Liebscher, der in mir die Lust am wissenschaftlichen Arbeiten geweckt hat. Bedanken
m ochte ich mich au erdem bei meinen Kollegen des Instituts fur Statistik, insbesondere
den MitarbeiterInnnen des Seminars fur angewandte Stochastik und der Arbeitsgruppe fur
statistische Genetik und Bioinformatik, die fur eine angenehme Arbeitsstimmung gesorgt
haben.
Zu guter Letzt m ochte ich mich bei meinen Eltern sowie bei meinem Mann fur ihre
jahrelange Unterstutzung und bei meinem Sohn Victor fur seine gute Laune und seine
aufmunternden L acheln herzlich bedanken. Ohne den enormen Beitrag meines Mannes
bei der Organisation unseres Familienalltags h atte ich bestimmt erst in zwei Jahren pro-
moviert !ZUSAMMENFASSUNG
Klassische Microarray Datens atze enthalten in der Regel bei Beobachtungszahlen im
zweistelligen Bereich Tausende von Pr adiktoren. Daher ist es eine gro e Herausforderung,
den hochdimensionalen Pr adiktorenraum so zu transformieren, da damit die Klassi k a-
tion wie zum Beispiel die Krebsdiagnose m oglich wird. In dieser Arbeit werden ver-
schiedene Ans atze zur Dimensionsreduktion solcher Daten untersucht.
Das Kapitel 2 ist eine Einfuhrung in die Klassi k ation mit Microarray Daten und weit-
erhin enth alt es auch einen Uberblick einiger spezi sc her Probleme (Variablenselektion,
Vergleich mehrerer Klassi k ationsmethoden). Im Kapitel 3 untersuche ich besondere In-
teraktionsstrukturen im Kontext der Klassi k ation: ’Emerging Patterns’. Ich fuhre eine
neue und allgemeinere De nition, die auf den unterliegenden Wahrscheinlichkeiten beruht,
ein und stelle eine neue auf dem CART-Algorithmus basierende einfache Suchmethode,
die die entsprechenden empirischen Patterns in konkreten Datens atzen ndet, vor. Ich
habe den Suchalgorithmus sowie die Klassi k ationsmethode in der Programmiersprache
R implementiert. Einige dieser Programme sind auf meiner Homepage frei verfugbar. Im
Kapitel 4 geht es um die klassische lineare Dimensionsreduktion. Im Rahmen der bin aren
Klassi k ation mit stetigen Pr adiktoren beweise ich die Zusammenh ange zwischen der Par-
tial Least Squares (PLS) Methode, der "between-group" Hauptkomponentenanalyse und
der linearen Diskriminanzanalyse. Die PLS Dimensionsreduktion wird im Kapitel 5 im
Detail untersucht. Die Klassi k ationsmethode der PLS Dimensionsreduktion kombiniert
mit der linearen Diskriminanzanalyse wird fur neun Microarray Datens atze mit den besten
bekannten Methoden verglichen und erweist sich als der beste Ansatz. Au erdem wende
ich einen Boosting Algorithmus auf diese Klassi k ationsmethode an. Ebenso schlage ich
auch einen einfachen Ansatz zur Wahl der Anzahl der PLS Komponenten vor. Zum
Schluss untersuche ich den theoretischen Zusammenhang zwischen PLS Dimensionsre-
duktion und Variablenselektion: ich beweise eine Equivalenzeigenschaft zwischen einem
bekannten Kriterium zur Variablenselektion und einem auf der ersten PLS Komponente
basierenden Ansatz.SUMMARY
Usual microarray data sets include only a handful of observations, but several thousands
of predictor variables. Transforming the high-dimensional predictor space to make clas-
si cation (for instance cancer diagnosis) possible is a major challenge. This thesis deals
with various dimension reduction approaches which can handle such data.
Chapter 2 gives an introduction into classi cation with microarray data as well as an
overview of a few speci c problems such as variable selection and comparison of classi-
cation methods. In Chapter 3, I discuss a particular class of interaction structures in
the classi cation framework: "emerging patterns". I propose a new and more general
de nition referring to underlying probabilities and present a new simple method which is
based on the CART algorithm to nd the corresponding empirical patterns in concrete
data sets. In addition, the detected patterns can be used to de ne new variables for
classi cation. Thus, I propose a simple scheme to use the patterns to improve the per-
formance of classi cation procedures. I implemented the search algorithm as well as the
classi cation procedure in the language R. Some of these programs are publicly available
from my homepage. Chapter 4 deals with classical linear dimension reduction methods.
In the context of binary classi cation with continuous predictors, I prove two properties
concerning the connections between Partial Least Squares (PLS) dimension reduction,
between-group PCA and between linear discriminant analysis and between-group PCA.
PLS dimension reduction for classi cation is examined thoroughly in Chapter 5. The
classi cation procedure consisting of PLS dimension reduction and linear discriminant
analysis on the new components is compared favorably with some of the best state-of-
the-art classi cation methods using nine real microarray cancer data sets. Moreover, I
apply a boosting algorithm to this classi cation method, which is a novel approach. In
addition, I suggest a simple procedure to choose the number of PLS components. At
last, I examine the connection between PLS dimension reduction and variable selection
and prove a property concerning the equivalence between a common univariate
criterion and a variable selection approach based on the rst PLS component.Contents
1 Introduction 1
1.1 High-dimensional microarray data . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Guideline through the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Classi cation with application to microarray data 7
2.1 Overview of classi cation with high-dimensional microarray data . . . . . 7
2.2 Comparing methods . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Comparing two classi cation methods in practice . . . . . . . . . . 11
2.3 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Univariate ranking methods . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Optimal subset selection . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Emerging and Interaction Patterns 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 De nition of interaction patterns . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Interaction Patterns for two classes . . . . . . . . . . . . . . . . . . 21
3.2.2 Generalization to multicategorical response . . . . . . . . . . . . . 24
3.3 Discovering interaction patterns with trees . . . . . . . . . . . . . . . . . . 25
3.3.1 Tree methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Discovering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 28
iii
3.3.3 Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . 29
3.3.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Classi cation based on interaction patterns . . . . . . . . . . . . . . . . . 34
3.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.3 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.5 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Linear dimension reduction for classi cation 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Between-group PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 De nition . . . . . . . . .