Learning on Weighted Hypergraphs to Integrate Protein Interactions and GeneExpressions for Cancer Outcome Prediction
TaeHyun Hwang
∗
, Ze Tian
∗
, and Rui Kuang
†
Department of Computer Science and EngineeringUniversity of Minnesota Twin Citiesthwang, tianze, kuang@cs.umn.eduJeanPierre KocherBioinformatics CoreMayo Clinic College of MedicineKocher.JeanPierre@mayo.edu
Abstract
Building reliable predictive models from multiple com plementary genomic data for cancer study is a crucialstep towards successful cancer treatment and a full understanding of the underlying biological principles. To tacklethis challenging data integration problem, we propose ahypergraphbased learning algorithm called HyperGene tointegrate microarray gene expressions and proteinproteininteractions for cancer outcome prediction and biomarker identiﬁcation. HyperGene is a robust twostep iterativemethod that alternatively ﬁnds the optimal outcome prediction and the optimal weighting of the marker genes guided by a proteinprotein interaction network. Under the hypothesis that cancerrelated genes tend to interact with eachother, the HyperGene algorithm uses a proteinprotein interaction network as prior knowledge by imposing a consistent weighting of interacting genes. Our experimentalresults on two largescale breast cancer gene expressiondatasets show that HyperGene utilizing a curated protein protein interaction network achieves signiﬁcantly improved cancer outcome prediction. Moreover, HyperGene canalso retrieve many known cancer genes as highly weighted marker genes.
1. Introduction
Finding gene predictors of cancer outcome from genomic data is becoming an increasingly important focus incancer research under the assumption that the genomic information can shed light on the molecular mechanisms underlying cancer development and progression. In the pastdecade, enormous amount of largescale microarray geneexpression proﬁles have been produced to study differentcancers such as breast cancer [18, 19], lung cancer [16] and
∗
The ﬁrst two authors contributed equally to this work.
†
Author to whom correspondence should be addressed
prostate cancer [7] for the purposes of 1) detecting markergenes for cancerrelevant phenotypes and 2) building reliable predictive models for cancer prognosis or diagnosis.The two tasks are closely intervened with each other because on one hand, a predictive model built from highlypredictive marker genes is often more accurate in outcomeprediction; on the other hand, a highly accurate prediction model can also be analyzed to reveal unknown cancermarker genes. Different machine learning and data miningstrategies for feature selection have been applied to identifying a subset of genes that can maximize the predictionperformance of a classiﬁer [18].Although many interesting and promising ﬁndings havebeen reported in these studies, the reliabilities of the studieshave been questioned with the concern on the unstable andinconsistent results in crossvalidations and crossplatformcomparisons due to the relatively small sample sizes in thestudies [6]. To overcome this difﬁculty, it has been proposed to include other complementary genomic information such as pathway information or functional annotationsto aid the process of model building and biomarker discovery such that the prior knowledge from the complementarydata can improve the robustness of the model and resultin more consistent discoveries across independent datasets[4, 3, 13]. The availability of large proteinprotein interaction networks, which contain information on gene functions, pathways and modularity of gene regulations, provides a desirable source of data for this purpose. Proteinprotein interactions can be derived from a number of experimental techniques such as yeast twohybrid system andmass spectrometry [11]. The high consistency between thenetworks derived from different organisms allows integration of many small networks into a large scale network. Ithas been observed that cancer genes tend to be highly connected with each other in large scale proteinprotein interaction networks [3]. It has been shown in [4] that by incorporating proteinprotein interaction network into the modelbuilt from microarray gene expressions, the authors can improve cancer outcome prediction and get more reproducible
2008 Eighth IEEE International Conference on Data Mining
15504786/08 $25.00 © 2008 IEEEDOI 10.1109/ICDM.2008.37293
2008 Eighth IEEE International Conference on Data Mining
15504786/08 $25.00 © 2008 IEEEDOI 10.1109/ICDM.2008.37293
Sample Gene1 Gene2 Gene3 …S1 (+1) up basal down …S2 (+1) up up basal …S3 (1) up up basal …S4 (1) down down up …S5 (0) basal down down …S6 (0) down basal down …
Cost 1
: Highly connected samples should have the same label:
Cost 3
: Genes that interact with each othershould have similar weights:
Cost 2
: Supervised learning  the prediction should beconsistent with initial labeling
Gene1(up)Gene2(up)Gene3(down)Gene2(down)Gene3(up)Gene1(down)
(A)
Discretized gene expression profiles(B)
Hypergraph built from gene expression profiles(C)
Proteinprotein interaction network
Notations:
: Hypergraph
G
with vertex set
V
andhyperedge set
E
weigthed by
w
: Predicted label of sample
: Initial label of sample
: Weight of hyperedge
: Degree of sample
in hypergraph
: Degree of hyperedge
inhypergraph
:
: Degree of hyperedge
in proteinprotein interaction network
: Interaction between hyperedge
and
S1
S2
S3
S4S5S6Gene3Gene2Gene1 Gene4Gene5
Figure 1. Regularization framework of HyperGene.
results on two large scale gene expression datasets. In theirapproach, the integration of gene expressions and proteinprotein interactions is achieved by two independent procedures: discriminative subnetworks are ﬁrst identiﬁed from acurated proteinprotein interaction network and the subnetworks are then used as features to predict cancer metastasis. Authors in [13] proposed a method which ﬁrst computes the spectral graph structure of a gene network andthen, uses the spectral graph structure to smooth microarraygene expressions before used for sample classiﬁcation. Astatisticsbased method is proposed in [3] to identify cancer genes by scoring genes by their degree in a cancerspeciﬁc interaction network, their differential expressionsin microarray data and their structural, functional and evolutionary properties. However, designing a uniﬁed strategyfor integrating proteinprotein interactions and microarraygene expressions is still a challenging problem due to thecomplexity of a joint learning on two different data types.In this paper, we propose a hypergraphbased iterativelearning algorithm called HyperGene to integrate microarray gene expressions with proteinprotein interactions forrobust cancer outcome prediction and marker gene identiﬁcation. The HyperGene algorithm minimizes a cost function under a uniﬁed regularization framework which elegantly takes a proteinprotein interaction network as constraints on a hypergraph built from microarray gene expressions. The HyperGene algorithm is a natural extension of label propagation algorithms on hypergraphs [2, 1, 22]. HyperGene is based on a hypergraph in which each sample isdenoted by a vertex and each gene is denoted by two hyperedges: a “upregulated” hyperedge and a “downregulated”hyperedge. The two edges group samples by the expression state (up/down) of the gene in the samples (Figure 1A&B). Our cluster assumption on the hypergraph is that thesamples of the same type tend to have similar gene expression patterns and thus are highly connected by the hyperedges. Since the srcinal hypergraphbased learning algorithms assume uniform weighting of the hyperedges [1, 22],direct application of these algorithms to highdimensionaland noisy genomic data results in inferior prediction accuracy. The HyperGene algorithm is fundamentally differentin reformulating the optimization problem as learning labels and hyperedge weights together with the assignmentof edge weights constrained by a proteinprotein interaction network. Essentially, to avoid overﬁtting training data,the HyperGene algorithm tries to ﬁnd a weighting of hyperedges that nicely balances the twoclass separation on thehypergraph and the consistency with the proteinprotein interaction network. These properties of the HyperGene algorithm promise to improve prediction accuracy and providemore robust identiﬁcation of marker genes. Furthermore,the resulted weights on the genes can be used to discoverhighly weighted subnetworks in the proteinprotein interaction network, which might also suggest important pathwaysrelated to cancer outcomes.
2. Regularization Framework
In Figure 1, we show the regularization framework inour formulation. We ﬁrst discretize gene expression proﬁlesinto three states: basal or up/downregulated (Figure 1A),andbuildahypergraphwith(positive/negative/test)samplesas vertices and gene expression states as hyperedges (Figure1B). The regularization framework seeks for a global solu
294
294
tion to both outcome prediction and gene weighting by considering the connectivities in the hypergraph, and the incorporation of the proteinprotein interaction network providesuseful prior knowledge on weighting interacting genes withsimilar values (Figure 1C). The cost function is deﬁned onthree loss terms: 1) inconsistent labeling of samples that arehighly connected in the hypergraph; 2) inconsistent labelingof training samples with known outcomes; 3) inconsistentweighting of the hyperedges associated with the interactinggenes in the proteinprotein interaction network. Our objective is to ﬁnd a solution that can minimize the weighted sumof the three loss terms.
2.1. Learning on weighted hypergraphs
A hypergraph is a special graph which contains hyperedges. In a simple graph, each edge connects a pair of vertices, but in a hypergraph each edge can connect arbitrary number of vertices in the graph. Hypergraphs are often used with algorithms for exploring higher order correlation between objects in data mining and bioinformatics[17, 20, 21]. Let
V
=
{
v
1
,v
2
,...,v

V

}
be a set of verticesand
E
=
{
e
1
,e
2
,...,e

E

}
be a set of edges deﬁned on
V
:for any hyperedge
e
∈
E
,
e
=
{
v
(
e
)1
,v
(
e
)2
,...,v
(
e
)

e

}
, where
{
v
(
e
)1
,v
(
e
)2
,...,v
(
e
)

e

}
is a subset of
V
. A hyperedge
e
anda vertex
v
are called incident if
v
∈
e
. A nonnegative realnumber (a weight) can be assigned to each hyperedge by afunction
w
(
w
can also be deﬁned as a vector variable andwe will use both notations interchangeably). The vertex set
V
, hyperedge set
E
and the weight function
w
fully deﬁnesa weighted hypergraph denoted by
G
(
V,E,w
)
. The incidence matrix
H
for hypergraph
G
(
V,E,w
)
is a

V
×
E

matrix with elements deﬁned as
h
(
v,e
) = 1
when
v
∈
e
and 0 otherwise. The degree of a vertex
v
is deﬁned as
d
(
v
) =
e
∈
E
h
(
v,e
)
w
(
e
)
, which is the sum of the weightsof the hyperedges incident with
v
. The degree of a hyperedge
e
is deﬁned as
d
(
e
) =
{
v

v
∈
e
}
, which is the number of vertices incident with
e
. Finally, we deﬁne
W
as thediagonalmatrixwhoseelementsonthediagonalareweightsof hyperedges, and
D
v
and
D
e
as the diagonal matriceswith elements on the diagonal being the degrees of verticesand hyperedges (the row and column sum of
H
). Note fora hyperedge
i
,
D
e
(
i,i
) =
d
(
i
)
. However, for a vertex
j
,
D
v
(
j,j
) =
d
(
j
)
if and only if
e
∈
E
h
(
j,e
)
w
(
e
) =
d
(
j
)
.We use a weighted hypergraph
G
(
V,E,w
)
to model thegene expression data: each sample is denoted by a vertex
v
∈
V
and each hyperedge denotes one of the two expression states (up/downregulated) of a gene (Figure 1A).Thus, each gene will be associated with two hyperedges inthe hypergraph. The incidences between the
V
and
E
aredecided by the gene expression values on the samples. If the expression value of a gene
i
is positive for sample set
V
1
and negative on sample set
V
2
, the upstate hyperedge
e
upi
is incident with
V
1
and the downstate hyperedge
e
downi
isincident with
V
2
. Note that
V
1
∪
V
2
is a proper subset of
V
if the expression levels of the gene in some of the samplesare zero (basal). After the hypergraph is constructed, we deﬁne a function
y
to assign initial labels to the correspondingvertices in the hypergraph. If a vertex
v
is in the positivegroup,
y
(
v
) = +1
; If it is in the negative group,
y
(
v
) =
−
1
and if
v
is a test sample,
y
(
v
) = 0
(Figure 1B).For cancer outcome prediction, our goal is to ﬁnd thecorrect labels for the unlabeled vertices of the test samplesin the hypergraph. Let
f
be the objective function (vector)of labels to be learned. Intuitively, there are two criteriafor learning optimal
f
: 1) we want to assign the same labelto vertices that share many incidental hyperedges in common; 2) assignment of the labels should be similar to theinitial labeling
y
. For criteria 1), we deﬁne the followingcost function,
Ω(
f,w
) =12
e
∈
E
w
(
e
)
d
(
e
)
u,v
∈
e
(
f
(
u
)
d
(
u
)
−
f
(
v
)
d
(
v
))
2
(1)If the predicted labels on the vertices are consistent with theincidences with the hyperedges, the value of
Ω(
f
)
shouldbe minimized. For criteria 2), we directly measure the 2norm distance between the vectors of the predicted and thesrcinal labels as follows,

f
−
y

2
=
u
∈
V
(
f
(
u
)
−
y
(
u
))
2
2.2. Using interactions as constraints
To introduce protein interactions as prior knowledge intothe hypergraphbased learning, we assume that interactinggenes should receive similar weights on their associated hyperedges. We deﬁne a binary indicator
δ
ij
to capture theinteraction between a pair of hyperedge
e
i
and
e
j
. The indicator
δ
ij
= 1
if the two genes associated with
e
i
and
e
j
have the nearest distance
k
in the proteinprotein interaction network, otherwise 0. The distance
k
picked largerthan 1 can relax the deﬁnition of interaction between twogenes by allowing indirect interactions through neighbors.When
k
= 1
, it is reduced to measure the direct interaction between genes. In our experiments in this paper, we set
k
= 2
. To assign weights to hyperedges consistent with theprior knowledge in the proteinprotein interaction network,we deﬁne the following cost function over the hyperedgeweights,
Ψ(
w
) =12

E

i,j
=1
δ
i,j
(
w
(
e
i
)
σ
(
e
i
)
−
w
(
e
j
)
σ
(
e
j
))
2
,
(2)where
σ
(
e
i
) =

E

j
=1
δ
i,j
, which is the number of hyperedges interacting with the hyperedge
e
i
. Minimizing
Ψ(
w
)
295
295
ensures that hyperedges associated with interacting geneswill get similarly weighted. When there is no prior knowledge, we can simply set
Ψ(
w
) =

w

2
.
2.3. Optimization formulation
After the prior knowledge is introduced from a proteinprotein interaction network, our task is to minimize the sumof the three cost terms deﬁned as
Φ(
f,w
) = Ω(
f,w
) +
µ

f
−
y

2
+
ρ
Ψ(
w
)
,
where
µ
and
ρ
are positive real numbers. This objective canbe achieved with the following optimization problem,minimize
f,w
Φ(
f,w
)
(3)subject to
w
(
e
)
≥
0
for
∀
e
∈
E
e
∈
E
h
(
v,e
)
w
(
e
) =
d
(
v
)
for
∀
v
∈
V.
The intuition of adding
e
∈
E
h
(
v,e
)
w
(
e
) =
d
(
v
)
as constraints is to maintain the hypergraph structure. The weighting of the hyperedges should not be biased towards somesamples (such as training samples) and thus, the degree of each vertex, thesumofthehyperedgeweightson thevertex,should be kept the same as in the initial graph. Mathematically, these constraints also guarantee that the covariancematrix in
Ω(
f,w
)
is positive semideﬁnite with respect to
f
[22], which makes our learning problem solvable.Let
∆ =
I
−
D
−
1
/
2
v
HWD
−
1
e
H
T
D
−
1
/
2
v
, where
I
is theidentity matrix and
W
is the diagonal matrix with
W
ii
=
w
(
e
i
)
. We can show
Ω(
f,w
) =
f
T
∆
f
by
Ω(
f,w
) =
e
∈
E
u,v
∈
V
w
(
e
)
h
(
u,e
)
h
(
v,e
)
δ
(
e
)(
f
2
(
u
)
d
(
u
)
−
f
(
u
)
f
(
v
)
d
(
u
)
d
(
v
))=
e
∈
E
u
∈
V
w
(
e
)
h
(
u,e
)
f
2
(
u
)
d
(
u
)
v
∈
V
h
(
v,e
)
δ
(
e
)
−
e
∈
E
u,v
∈
V
w
(
e
)
h
(
u,e
)
h
(
v,e
)
δ
(
e
)
f
(
u
)
f
(
v
)
d
(
u
)
d
(
v
)=
u
∈
V
f
2
(
u
)
e
∈
E
w
(
e
)
h
(
u,e
)
d
(
u
)
−
e
∈
E
u,v
∈
V
f
(
u
)
w
(
e
)
h
(
u,e
)
h
(
v,e
)
f
(
v
)
d
(
u
)
d
(
v
)
δ
(
e
)=
u
∈
V
f
2
(
u
)
−
e
∈
E
u,v
∈
V
f
(
u
)
w
(
e
)
h
(
u,e
)
h
(
v,e
)
f
(
v
)
d
(
u
)
d
(
v
)
δ
(
e
)
.
HyperGene
(
y,H,A,α,ρ
)
1
t
= 0
,w
0
= 1
,f
0
=
y,c
0
= +
∞
2
do
3
t
=
t
+ 1
4 Use Jacobi iteration method to ﬁnd optimal
f
t
f
t
= (
I
−
αD
−
1
/
2
v
HW
t
−
1
D
−
1
e
H
T
D
−
1
/
2
v
)
−
1
y
5 Use quadratic programming to ﬁnd optimal
w
t
w
t
=
argmin
w
Ω(
f
=
f
t
−
1
,w
) +
ρ
Ψ(
w
)
subject to
Hw
=
diag
(
D
v
)
and
diag
(
W
)
0
6
c
t
= Ω(
f
t
,w
t
) +
µ

f
t
−
y

2
+
ρ
Ψ(
w
t
)
7
while
(
c
t
−
1
−
c
t
> π
)
// if the decrease of the cost function is smaller than a threshold
π
, stop iterating.
8
return
(
f
t
,
w
t
)
Figure 2. The HyperGene algorithm.
Step three in the above derivation shows that
Ω(
f,w
) =
f
T
∆
f
if and only if
e
∈
Ew
(
e
)
h
(
u,e
)
d
(
u
)
= 1
. The constraints
e
∈
E
h
(
v,e
)
w
(
e
) =
d
(
v
)
for
∀
v
∈
V
in equation 3 keep
D
v
unchanged during the optimization and thus make
∆
always positive semideﬁnite. Finally, let
A
be the adjacencymatrix deﬁned on the proteinprotein interaction network with
A
ij
=
δ
ij
, where
i
and
j
are the indexes of hyperedges, and
D
be the diagonal matrix with
D
ii
=
j
A
ij
,the optimization problem in equation (3) can be written inthe following matrix form,minimize
f,w
f
T
∆
f
+
µ

f
−
y

2
+
ρw
T
(
I
−
S
)
w
(4)subject to
Hw
=
diag
(
D
v
)
diag
(
W
)
0
,
where
S
=
D
−
1
/
2
AD
−
1
/
2
.
3. The HyperGene Algorithm
The objective function
Φ(
f,w
)
in the optimization problem deﬁned by equation 3 is not convex in (
f
,
w
). However,our formulation contains two suboptimizationproblems,both of which are convex if we independently optimize
Φ(
f,w
)
with respect to
f
or
w
. Speciﬁcally, if we ﬁx
w
to be a speciﬁc weighting
w
t
satisfying the constraints
296
296