Modal Clustering in a Univariate Class of Product Partition Models

Movong

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

17 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Informations

Publié par	Movong
Nombre de lectures	56
Langue	English

Extrait

MODAL CLUSTERING IN A UNIVARIATE CLASS OF
PRODUCT PARTITION MODELS
David B. Dahl
dbdahl@stat.wisc.edu
Department of Statistics, and
Department of Biostatistics & Medical Informatics
University of Wisconsin – Madison
November 3, 2003
Technical Report #1085
Department of Statistics
University of Wisconsin – Madison
The author wishes to thank his Ph.D. adviser Michael Newton for helpful advice and Philippe
Broet for access to data used in the example. The author is supported by the National Institutes of
Health grant number EY07119. Software implementing the algorithm presented in this paper will
be available shortly athttp://www.stat.wisc.edu/˜dbdahl/modal/.MODAL CLUSTERING IN A UNIVARIATE CLASS OF
PRODUCT PARTITION MODELS
David B. Dahl
Abstract
This paper presents an algorithm for ﬁnding the maximum a posteriori (MAP) clustering
in a class of univariate product partition models. While the number of possible clusterings of
n observations grows according to the Bell exponential number, the dynamic programming
2algorithm presented here exploits properties of the model to provide anO(n ) search. Hence,
the algorithm can be used to ﬁnd the MAP clustering for tens of thousands of univariate data
points, whereas previously it could only be approximated through a stochastic search. Inte
grating over the latent location variables in a Dirichlet Process mixture (DPM) model leads
to a product partition model. The paper shows that several univariate, conjugate DPM mix
ture models satisfy the conditions for the mode ﬁnding algorithm. The clustering algorithm is
demonstrated with using data from a microarray experiment to detect differential gene expres
sion involving 4,608 genes.
Key Words: Model based clustering; Product partition model; Dirichlet Process mixture
model; MAP clustering
1 Introduction
The paper is organized as follows. Section 2 reviews product partition models which have the
key feature that the likelihood, prior, and posterior for a random partition of observed data are
products over partition components. After deﬁning a class of product partition models, Section
3 proves two theorems regarding modal partitions. This motivates a dynamic programming algo
rithm to ﬁnd the mode of the partition likelihood and the partition posterior for this class of models.
Section 4 describes how integrating out the component location in a conjugate Dirichlet Process
mixture (DPM) model leads to a product partition model with a particular partition prior. Section
5 provides several speciﬁc conjugate DPM models which fall into the class of models to which6
6
our mode ﬁnding algorithm applies. Finally, Section 6 presents an illustration of the algorithm
using data from Broet,¨ Richardson, and Radvanyi (2002). The example clusters 4,608 genes from
an experiment to detect differential gene expression using microarray technology and Broet¨ et al.
(2002).
2 Product Partition Models
The mode ﬁnding algorithm presented in this paper is relevant to a class of product partition mod
els. In this section, general product partition models are reviewed and notation is adopted. The
next section deﬁnes the subclass of interest.
A partitionη ={S ,...,S } of a setS ={1,...,n} has the following properties:1 q 0
• S =∅ fori = 1,...,q,i
• S ∩S =∅ fori =j, andi j
q
• ∪ S =S .j 0j=1
The sets S ,...,S are referred to as partition components. When S = {1,2,3}, for example,1 q 0
there are ﬁve possible partitions, namely:
{1,2,3} {1},{2,3} {1,2},{3}
{1,3},{2} {1},{2},{3}.
The set partition {1,2,3} denotes that all three objects belong to the same component, while
{1},{2},{3} is the partition placing each object into its own component. The Bell number, de
notedB(n), is the number of partitions ofn objects and has the recurrence relationB(n + 1) =
Pn n B(k), whereB(0) = 1. Table 1 contains Bell numbers for variousn.
k=0 k
Product partition models, introduced by Hartigan (1990) and Barry and Hartigan (1992), assume
that observations in different components of a partition of the data are independent. That is, then 3 7 50 100 200
47 115 275B(n) 5 877 ≈ 10 ≈ 10 ≈ 10
Table 1: Bell exponential numbersB(n) for variousn. The Bell number counts the number of set partitions
of{1,...,n}.
likelihood for a partitionη ={S ,...,S } with observed datay = (y ,...,y ) is a product over1 q 1 n
the components:
q
Y
p(y|η)∝ f(y ), (1)Sj
j=1
wherey is the vector of observations corresponding to the elements of the componentS . TheS jj
component likelihoodf(y ) — that is, the likelihood contribution from aS — is de S
ﬁned for any non empty componentS ⊂S and can take any number of forms. Speciﬁc examples0
off(y ) are given in Section 4.S
The prior distribution for a partitionη is also taken to be a product over the partition components:
q
Y
p(η)∝ h(S ), (2)j
j=1
whereh(S )≥ 0 is deﬁned for each non emptyS ⊂S andc is a normalizing constant. As withj 0 2
the likelihood contributionp(y ),h(S ) is left unspeciﬁed for now, but a particular form forh(S )S j jj
is discussed in Section 4.
All inference concerningη is made from the posterior distributionp(η|y). By Bayes theorem, the
posterior is expressed as the product of partition components:
p(η|y) =p(y|η)p(η)/p(y)
q qh ih iY Y
∝ f(y ) h(S )S jj
(3)
j=1 j=1
q
Y
= f(y )h(S ).S jj
j=1
In the context of cluster analysis, a set partition η deﬁnes a clustering for observed data y. The
componentsS ,...,S deﬁne groups of data referred to as clusters.1 q3 MAP Partition in a Class of Product Partition Models
This section deﬁnes a class of product partition models, discusses a strategy to verify that a partic
ular model falls into this class, and then introduces a mode ﬁnding algorithm applicable to these
models. The algorithm is deﬁned and its validity is established. Finally, the efﬁciency of the
algorithm is discussed.
3.1 Class of Models under Consideration
This section gives two conditions related to the mode ﬁnding algorithm to be presented shortly.
Condition 1 must hold in order to be able to ﬁnd the mode of the partition likelihood using the
algorithm. If Condition 2 also holds, then the algorithm can also be used to ﬁnd the mode of the
partition posterior. Before stating the conditions related to the mode ﬁnding algorithm, the key
concept of overlapping components is introduced.
Deﬁnition 1 (Overlapping Components). Two partition components S and S are said to bei j
overlapping components ifS contains an integer which is between the smallest and largest integersi
ofS , or vice versa.j
For example,S ={1,3} andS ={2} are overlapping components, butS ={1} andS ={2,3}i j i j
are not.
Condition 1. The likelihood componentf(y ) in (1) has the property that, for some permutationS
of the datay, the product of the likelihood components of any overlapping componentsS andSi j
∗ ∗ ∗ ∗is not greater than the product from some otherS andS , whereS andS representi j i j
a reallocation of the elements of S and S such that the number of elements of S and S is,i j i j
respectively, preserved.
An immediate implication of Condition 1 is that there is a mode of the partition likelihoodp(y|η)
in (1) that does not contain overlapping components, a key feature which will be exploited by the
mode ﬁnding algorithm. In the next subsection, a strategy is presented that is useful in verifying
that a particular product partition model satisﬁes Condition 1. Before presenting the strategy, a
related condition concerning the partition prior is stated.Condition 2. The functionh(S) in (2) only depends upon the number of elements contained inS.
An immediate implication of Conditions 1 and 2 is that there exists a mode of the partition posterior
p(η|y) in (3) that does not contain overlapping clusters and, thus, the algorithm is applicable to
ﬁnd the posterior mode.
3.2 Strategy for Verifying Condition 1
Condition 2 can be veriﬁed by trivial inspection, but it is not immediately obvious how to ver-
ify Condition 1 for a particular f(y ). This subsection presents a general strategy for verifyingS
Condition 1 in the case of univariate data.
The strategy proceeds as follows. Permute the datay such thaty ≤...≤y and suppose thatS1 n i
andS are two overlapping components. Leta andc be the smallest and largest integers inS andj i
letb be the smallest integer inS . SinceS andS are overlapping components, assume (withoutj i j
loss of generality) thata<b<c (otherwise, interchangeS andS when deﬁninga,b, andc).i j
[ [ [ [LetS = (S \{a})∪{b} andS = (S \{b})∪{a}. That is, deﬁneS andS fromS andSi j i ji j i j
] ]
by simply swappinga andb. Likewise, deﬁneS andS by swappingc andb inS andS . Noticei ji j
that swapping integers preserves the number of elements in each component.
Condition 1 is satisﬁed if
f(y )f(y )≤f(y [)f(y [) (4)S Si j S Si j
or
f(y )f(y )≤f(y ])f(y ]). (5)S Si j S S
i j
Consider two mutually exclusive and exhaustive cases. The ﬁrst case assumes thatf(y )f(y )≤[ [S S
i j
f(y ])f(y ]). Exploiting properties of the particularf(y ), show that this implies (5). The secondSS S
i j
case is the complement of the ﬁrst: f(y [)f(y [)>f(y ])f(y ]). Again, using properties of theS S S Si j i j
particular f(y ), show that this second case implies (4). Thus, (4) or (5) holds and, therefore,S
Condition 1 is satisﬁe