Introduction to Statistical Methods for Microarray Data Analysis
71 pages
English

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris

Introduction to Statistical Methods for Microarray Data Analysis

Découvre YouScribe en t'inscrivant gratuitement

Je m'inscris
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus
71 pages
English
Obtenez un accès à la bibliothèque pour le consulter en ligne
En savoir plus

Description

Introduction to Statistical Methods for Microarray Data Analysis T. Mary-Huard, F. Picard, S. Robin Institut National Agronomique Paris-Grignon UMR INA PG / INRA / ENGREF 518 de Biometrie 16, rue Claude Bernard, F-75005 Paris, France (maryhuar)(picard)(robin)@inapg.inra.fr June 30, 2004

  • proteins via molecules called

  • proteins can

  • data collection

  • experimental designs

  • dna molecules

  • translated into

  • class prediction

  • model-based methods


Sujets

Informations

Publié par
Nombre de lectures 11
Langue English
Poids de l'ouvrage 3 Mo

Extrait

Introduction to Statistical Methods
for Microarray Data Analysis
T. Mary-Huard, F. Picard, S. Robin
Institut National Agronomique Paris-Grignon
UMR INA PG / INRA / ENGREF 518 de Biom´etrie
16, rue Claude Bernard, F-75005 Paris,France
(maryhuar)(picard)(robin)@inapg.inra.fr
June 30, 2004Contents
1 Introduction 4
1.1 From genomics to functional genomics . . . . . . . . . . . . . . . . . . . . 4
1.1.1 The basics of molecular genetic studies . . . . . . . . . . . . . . . . 4
1.1.2 The success of sequencing projects . . . . . . . . . . . . . . . . . . 5
1.1.3 Aims of functional genomics . . . . . . . . . . . . . . . . . . . . . . 6
1.2 A new technology for transcriptome studies . . . . . . . . . . . . . . . . . 6
1.2.1 The potential of transcriptome studies . . . . . . . . . . . . . . . . 6
1.2.2 The basis of microarray experiments . . . . . . . . . . . . . . . . . 6
1.2.3 Different types of microarrays . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Upstream intervention of statistical concepts . . . . . . . . . . . . . . . . . 8
1.3.1 The variability of microarray data and the need for normalization . 9
1.3.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Downstream need for appropriate statistical tools . . . . . . . . . . . . . . 10
1.4.1 Class Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Class Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Class Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Experimental designs 12
2.1 Aim of designing experiments . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Two conditions comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Unpaired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Comparison between T conditions . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Designs for paired data . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Data normalization 20
3.1 Detection of technical biases . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Exploratory methods . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Detection of specific artifacts . . . . . . . . . . . . . . . . . . . . . 21
3.2 Correction of technical artifacts . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Systematic biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
13.2.2 Gene dependent biases . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Variance normalization . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Conditions for normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Three hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Enhancement of the normalization . . . . . . . . . . . . . . . . . . 27
4 Gene clustering 29
4.1 Distance-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.1 Dissimilarities and distances between genes . . . . . . . . . . . . . . 30
4.1.2 Combinatorial complexity and heuristics . . . . . . . . . . . . . . . 32
4.1.3 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.4 K means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Model-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Choice of the number of groups . . . . . . . . . . . . . . . . . . . . 42
5 Differential analysis 43
5.1 Classical concepts and tools for hypothesis testing . . . . . . . . . . . . . . 44
5.2 Presentation of the t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 The t-test in the parametric context . . . . . . . . . . . . . . . . . 45
5.2.2 The non parametric context . . . . . . . . . . . . . . . . . . . . . . 47
5.2.3 Power of the t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Modeling the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.1 A gene specific variance ? . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.2 A common variance ? . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 An intermediate solution . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Multiple testing problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4.1 Controlling the Family Wise Error Rate . . . . . . . . . . . . . . . 52
5.4.2 Practical implementation of control procedures . . . . . . . . . . . . 53
5.4.3 Adaptative procedures for the control of the FWER . . . . . . . . . 54
5.4.4 Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 An other approach, the False Discovery Rate . . . . . . . . . . . . . . . . . 55
5.5.1 Controlling the False Discovery Rate . . . . . . . . . . . . . . . . . 55
5.5.2 Estimating the False Discovery Rate and the definition of q-values . 56
6 Supervised classification 57
6.1 The aim of supervised classification . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Supervised classification methods . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.1 Fisher Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 59
6.2.2 k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Error rate estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
26.4 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3Chapter 1
Introduction
1.1 From genomics to functional genomics
1.1.1 The basics of molecular genetic studies
The basics of molecular biology has been summarized in a concept called the Central
Dogma of Molecular Biology. DNA molecules contain biological informations coded in an
alphabet of four letters, A (Adenosine), T (Thymine), C (Cytosine), G (Guanine). The
succession of these letters is referred as a sequence of DNA that constitutes the complete
genetic information defining the structure and function of an organism.
ProteinscanbeviewedaseffectorsofthegeneticinformationcontainedinDNAcoding
sequences. They are formed using the genetic code of the DNA to convert the informa-
tion contained in the 4 letter alphabet into a new alphabet of 20 amino acids. Despite
an apparent simplicity of this translation procedure, the conversion of the DNA-based
information requires two steps in eucariotyc cells since the genetic material in the nucleus
is physically separated from the site of protein synthesis in the cytoplasm of the cell.
Transcription constitutes the intermediate step, where a DNA segment that constitutes a
geneisreadandtranscribedinto asinglestranded moleculeofRNA(the4letter alphabet
remains with the replacement of Thymine molecules by Uracyle molecules). RNAs that
contain information to be translated into proteins are called messenger RNAs, since they
constitute the physical vector that carry the genetic information form the nucleus to the
cytoplasm where it is translated into proteins via molecules called ribosomes (figure 1.1).
Biological information is contained in the DNA molecule that can be viewed as a
template, then in the RNA sequence that is a vector, and in proteins which constitute
effectors. These three levels of information constitute the fundamental material for the
study of the genetic information contained in any organism:
1 - Finding coding sequences in the DNA,
2 - Measuring the abundance of RNAs,
3 - Studing the diversity of Proteins.
4Figure 1.1: The central dogma of molecular biology
1.1.2 The success of sequencing projects
Inthepast decades, considerable effort hasbeenmade inthecollection andinthedissem-
ination of DNA sequences informations, through initiatives such as the Human Genome
1Project . The explosion of sequence based informations is illustrated by the sequencing
of the genome of more than 800 organisms, that represents more than 3.5 million genetic
sequences deposited in international repositories (Butte (2002)). The aim of this first
phase of the genomic area consisted in the elucidation of the exact sequence of the nu-
cleotides in the DNA code, that has allowed the search for coding sequences diluted all
along the genomes, via automatic annotation. Nevertheless there is no strict correspon-
dance between the information contained in the DNA and the effective biological activity
of proteins. In a more general point of view genotype and phenotype do not correspond
strictly, due to the physical specificity of genomes which has a dynamic structure (Pollack
and Iyer (2003)), and also due to environmental influences. This explains why there is
now a considerable desequilibrium between the number of identified sequences, and the
understanding of their biological functions, that remain unknown for most of the genes.
The next logical step is then to discover the underlying biological informations contained
inthesuccession ofnucleotidesthathasbeenreadthroughsequencing projects. A

  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents