TaeHyun Hwang, Ze Tian, Rui Kuang and Jean-Pierre Kocher- Learning onWeighted Hypergraphs to Integrate Protein Interactions and Gene Expressions for Cancer Outcome Prediction

Please download to get full document.

View again

of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Documents

Published:

Views: 42 | Pages: 10

Extension: PDF | Download: 0

Share
Description
Learning on Weighted Hypergraphs to Integrate Protein Interactions and Gene Expressions for Cancer Outcome Prediction TaeHyun Hwang ∗ , Ze Tian ∗ , and Rui Kuang † Department of Computer Science and Engineering University of Minnesota Twin Cities thwang, tianze, kuang@cs.umn.edu Jean-Pierre Kocher Bioinformatics Core Mayo Clinic College of Medicine Kocher.JeanPierre@mayo.edu Abstract Building reliable predictive models from multiple com- plementary genomic data for cancer study is
Tags
Transcript
  Learning on Weighted Hypergraphs to Integrate Protein Interactions and GeneExpressions for Cancer Outcome Prediction TaeHyun Hwang ∗ , Ze Tian ∗ , and Rui Kuang † Department of Computer Science and EngineeringUniversity of Minnesota Twin Citiesthwang, tianze, kuang@cs.umn.eduJean-Pierre KocherBioinformatics CoreMayo Clinic College of MedicineKocher.JeanPierre@mayo.edu Abstract  Building reliable predictive models from multiple com- plementary genomic data for cancer study is a crucialstep towards successful cancer treatment and a full under-standing of the underlying biological principles. To tacklethis challenging data integration problem, we propose ahypergraph-based learning algorithm called HyperGene tointegrate microarray gene expressions and protein-proteininteractions for cancer outcome prediction and biomarker identification. HyperGene is a robust two-step iterativemethod that alternatively finds the optimal outcome predic-tion and the optimal weighting of the marker genes guided by a protein-protein interaction network. Under the hypoth-esis that cancer-related genes tend to interact with eachother, the HyperGene algorithm uses a protein-protein in-teraction network as prior knowledge by imposing a con-sistent weighting of interacting genes. Our experimentalresults on two large-scale breast cancer gene expressiondatasets show that HyperGene utilizing a curated protein- protein interaction network achieves significantly improved cancer outcome prediction. Moreover, HyperGene canalso retrieve many known cancer genes as highly weighted marker genes. 1. Introduction Finding gene predictors of cancer outcome from ge-nomic data is becoming an increasingly important focus incancer research under the assumption that the genomic in-formation can shed light on the molecular mechanisms un-derlying cancer development and progression. In the pastdecade, enormous amount of large-scale microarray geneexpression profiles have been produced to study differentcancers such as breast cancer [18, 19], lung cancer [16] and ∗ The first two authors contributed equally to this work. † Author to whom correspondence should be addressed prostate cancer [7] for the purposes of 1) detecting markergenes for cancer-relevant phenotypes and 2) building reli-able predictive models for cancer prognosis or diagnosis.The two tasks are closely intervened with each other be-cause on one hand, a predictive model built from highlypredictive marker genes is often more accurate in outcomeprediction; on the other hand, a highly accurate predic-tion model can also be analyzed to reveal unknown cancermarker genes. Different machine learning and data miningstrategies for feature selection have been applied to iden-tifying a subset of genes that can maximize the predictionperformance of a classifier [18].Although many interesting and promising findings havebeen reported in these studies, the reliabilities of the studieshave been questioned with the concern on the unstable andinconsistent results in cross-validations and cross-platformcomparisons due to the relatively small sample sizes in thestudies [6]. To overcome this difficulty, it has been pro-posed to include other complementary genomic informa-tion such as pathway information or functional annotationsto aid the process of model building and biomarker discov-ery such that the prior knowledge from the complementarydata can improve the robustness of the model and resultin more consistent discoveries across independent datasets[4, 3, 13]. The availability of large protein-protein inter-action networks, which contain information on gene func-tions, pathways and modularity of gene regulations, pro-vides a desirable source of data for this purpose. Protein-protein interactions can be derived from a number of ex-perimental techniques such as yeast two-hybrid system andmass spectrometry [11]. The high consistency between thenetworks derived from different organisms allows integra-tion of many small networks into a large scale network. Ithas been observed that cancer genes tend to be highly con-nected with each other in large scale protein-protein inter-action networks [3]. It has been shown in [4] that by incor-porating protein-protein interaction network into the modelbuilt from microarray gene expressions, the authors can im-prove cancer outcome prediction and get more reproducible 2008 Eighth IEEE International Conference on Data Mining 1550-4786/08 $25.00 © 2008 IEEEDOI 10.1109/ICDM.2008.37293   2008 Eighth IEEE International Conference on Data Mining 1550-4786/08 $25.00 © 2008 IEEEDOI 10.1109/ICDM.2008.37293   Sample Gene1 Gene2 Gene3 …S1 (+1) up basal down …S2 (+1) up up basal …S3 (-1) up up basal …S4 (-1) down down up …S5 (0) basal down down …S6 (0) down basal down …        Cost 1 : Highly connected samples should have the same label:                    Cost 3 : Genes that interact with each othershould have similar weights: Cost 2 : Supervised learning - the prediction should beconsistent with initial labeling          Gene1(up)Gene2(up)Gene3(down)Gene2(down)Gene3(up)Gene1(down) (A)   Discretized gene expression profiles(B)   Hypergraph built from gene expression profiles(C)   Protein-protein interaction network Notations:  : Hypergraph G with vertex set V  andhyperedge set E  weigthed by w      : Predicted label of sample     : Initial label of sample     : Weight of hyperedge     : Degree of sample  in hypergraph  : Degree of hyperedge  inhypergraph  :        : Degree of hyperedge  in protein-protein interaction network   : Interaction between hyperedge    and         S1    S2    S3    S4S5S6Gene3Gene2Gene1 Gene4Gene5 Figure 1. Regularization framework of HyperGene. results on two large scale gene expression datasets. In theirapproach, the integration of gene expressions and protein-protein interactions is achieved by two independent proce-dures: discriminative subnetworks are first identified from acurated protein-protein interaction network and the subnet-works are then used as features to predict cancer metasta-sis. Authors in [13] proposed a method which first com-putes the spectral graph structure of a gene network andthen, uses the spectral graph structure to smooth microarraygene expressions before used for sample classification. Astatistics-based method is proposed in [3] to identify can-cer genes by scoring genes by their degree in a cancer-specific interaction network, their differential expressionsin microarray data and their structural, functional and evo-lutionary properties. However, designing a unified strategyfor integrating protein-protein interactions and microarraygene expressions is still a challenging problem due to thecomplexity of a joint learning on two different data types.In this paper, we propose a hypergraph-based iterativelearning algorithm called HyperGene to integrate microar-ray gene expressions with protein-protein interactions forrobust cancer outcome prediction and marker gene identifi-cation. The HyperGene algorithm minimizes a cost func-tion under a unified regularization framework which ele-gantly takes a protein-protein interaction network as con-straints on a hypergraph built from microarray gene expres-sions. The HyperGene algorithm is a natural extension of label propagation algorithms on hypergraphs [2, 1, 22]. Hy-perGene is based on a hypergraph in which each sample isdenoted by a vertex and each gene is denoted by two hyper-edges: a “up-regulated” hyperedge and a “down-regulated”hyperedge. The two edges group samples by the expres-sion state (up/down) of the gene in the samples (Figure 1A&B). Our cluster assumption on the hypergraph is that thesamples of the same type tend to have similar gene expres-sion patterns and thus are highly connected by the hyper-edges. Since the srcinal hypergraph-based learning algo-rithms assume uniform weighting of the hyperedges [1, 22],direct application of these algorithms to high-dimensionaland noisy genomic data results in inferior prediction accu-racy. The HyperGene algorithm is fundamentally differentin reformulating the optimization problem as learning la-bels and hyperedge weights together with the assignmentof edge weights constrained by a protein-protein interac-tion network. Essentially, to avoid overfitting training data,the HyperGene algorithm tries to find a weighting of hyper-edges that nicely balances the two-class separation on thehypergraph and the consistency with the protein-protein in-teraction network. These properties of the HyperGene algo-rithm promise to improve prediction accuracy and providemore robust identification of marker genes. Furthermore,the resulted weights on the genes can be used to discoverhighly weighted subnetworks in the protein-protein interac-tion network, which might also suggest important pathwaysrelated to cancer outcomes. 2. Regularization Framework In Figure 1, we show the regularization framework inour formulation. We first discretize gene expression profilesinto three states: basal or up/down-regulated (Figure 1A),andbuildahypergraphwith(positive/negative/test)samplesas vertices and gene expression states as hyperedges (Figure1B). The regularization framework seeks for a global solu-   294   294  tion to both outcome prediction and gene weighting by con-sidering the connectivities in the hypergraph, and the incor-poration of the protein-protein interaction network providesuseful prior knowledge on weighting interacting genes withsimilar values (Figure 1C). The cost function is defined onthree loss terms: 1) inconsistent labeling of samples that arehighly connected in the hypergraph; 2) inconsistent labelingof training samples with known outcomes; 3) inconsistentweighting of the hyperedges associated with the interactinggenes in the protein-protein interaction network. Our objec-tive is to find a solution that can minimize the weighted sumof the three loss terms. 2.1. Learning on weighted hypergraphs A hypergraph is a special graph which contains hyper-edges. In a simple graph, each edge connects a pair of vertices, but in a hypergraph each edge can connect arbi-trary number of vertices in the graph. Hypergraphs are of-ten used with algorithms for exploring higher order corre-lation between objects in data mining and bioinformatics[17, 20, 21]. Let V  = { v 1 ,v 2 ,...,v | V  | } be a set of verticesand E  = { e 1 ,e 2 ,...,e | E | } be a set of edges defined on V  :for any hyperedge e ∈ E  , e = { v ( e )1 ,v ( e )2 ,...,v ( e ) | e | } , where { v ( e )1 ,v ( e )2 ,...,v ( e ) | e | } is a subset of  V  . A hyperedge e anda vertex v are called incident if  v ∈ e . A non-negative realnumber (a weight) can be assigned to each hyperedge by afunction w ( w can also be defined as a vector variable andwe will use both notations interchangeably). The vertex set V  , hyperedge set E  and the weight function w fully definesa weighted hypergraph denoted by G ( V,E,w ) . The inci-dence matrix H  for hypergraph G ( V,E,w ) is a | V  |×| E  | matrix with elements defined as h ( v,e ) = 1 when v ∈ e and 0 otherwise. The degree of a vertex v is defined as d ( v ) =  e ∈ E h ( v,e ) w ( e ) , which is the sum of the weightsof the hyperedges incident with v . The degree of a hyper-edge e is defined as d ( e ) = |{ v | v ∈ e }| , which is the num-ber of vertices incident with e . Finally, we define W  as thediagonalmatrixwhoseelementsonthediagonalareweightsof hyperedges, and D v and D e as the diagonal matriceswith elements on the diagonal being the degrees of verticesand hyperedges (the row and column sum of  H  ). Note fora hyperedge i , D e ( i,i ) = d ( i ) . However, for a vertex j , D v (  j,j ) = d (  j ) if and only if   e ∈ E h (  j,e ) w ( e ) = d (  j ) .We use a weighted hypergraph G ( V,E,w ) to model thegene expression data: each sample is denoted by a vertex v ∈ V  and each hyperedge denotes one of the two ex-pression states (up/down-regulated) of a gene (Figure 1A).Thus, each gene will be associated with two hyperedges inthe hypergraph. The incidences between the V  and E  aredecided by the gene expression values on the samples. If the expression value of a gene i is positive for sample set V  1 and negative on sample set V  2 , the up-state hyperedge e upi is incident with V  1 and the down-state hyperedge e downi isincident with V  2 . Note that V  1 ∪ V  2 is a proper subset of  V  if the expression levels of the gene in some of the samplesare zero (basal). After the hypergraph is constructed, we de-fine a function y to assign initial labels to the correspondingvertices in the hypergraph. If a vertex v is in the positivegroup, y ( v ) = +1 ; If it is in the negative group, y ( v ) = − 1 and if  v is a test sample, y ( v ) = 0 (Figure 1B).For cancer outcome prediction, our goal is to find thecorrect labels for the unlabeled vertices of the test samplesin the hypergraph. Let f  be the objective function (vector)of labels to be learned. Intuitively, there are two criteriafor learning optimal f  : 1) we want to assign the same labelto vertices that share many incidental hyperedges in com-mon; 2) assignment of the labels should be similar to theinitial labeling y . For criteria 1), we define the followingcost function, Ω( f,w ) =12  e ∈ E w ( e ) d ( e )  u,v ∈ e ( f  ( u )   d ( u ) − f  ( v )   d ( v )) 2 (1)If the predicted labels on the vertices are consistent with theincidences with the hyperedges, the value of  Ω( f  ) shouldbe minimized. For criteria 2), we directly measure the 2-norm distance between the vectors of the predicted and thesrcinal labels as follows, || f  − y || 2 =  u ∈ V  ( f  ( u ) − y ( u )) 2 2.2. Using interactions as constraints To introduce protein interactions as prior knowledge intothe hypergraph-based learning, we assume that interactinggenes should receive similar weights on their associated hy-peredges. We define a binary indicator δ ij to capture theinteraction between a pair of hyperedge e i and e j . The in-dicator δ ij = 1 if the two genes associated with e i and e j have the nearest distance k in the protein-protein interac-tion network, otherwise 0. The distance k picked largerthan 1 can relax the definition of interaction between twogenes by allowing indirect interactions through neighbors.When k = 1 , it is reduced to measure the direct interac-tion between genes. In our experiments in this paper, we set k = 2 . To assign weights to hyperedges consistent with theprior knowledge in the protein-protein interaction network,we define the following cost function over the hyperedgeweights, Ψ( w ) =12 | E |  i,j =1 δ i,j ( w ( e i )   σ ( e i ) − w ( e j )   σ ( e j )) 2 , (2)where σ ( e i ) =  | E | j =1 δ i,j , which is the number of hyper-edges interacting with the hyperedge e i . Minimizing Ψ( w ) 295   295  ensures that hyperedges associated with interacting geneswill get similarly weighted. When there is no prior knowl-edge, we can simply set Ψ( w ) = || w || 2 . 2.3. Optimization formulation After the prior knowledge is introduced from a protein-protein interaction network, our task is to minimize the sumof the three cost terms defined as Φ( f,w ) = Ω( f,w ) + µ || f  − y || 2 + ρ Ψ( w ) , where µ and ρ are positive real numbers. This objective canbe achieved with the following optimization problem,minimize f,w Φ( f,w ) (3)subject to w ( e ) ≥ 0 for ∀ e ∈ E   e ∈ E h ( v,e ) w ( e ) = d ( v ) for ∀ v ∈ V. The intuition of adding  e ∈ E h ( v,e ) w ( e ) = d ( v ) as con-straints is to maintain the hypergraph structure. The weight-ing of the hyperedges should not be biased towards somesamples (such as training samples) and thus, the degree of each vertex, thesumofthehyperedgeweightson thevertex,should be kept the same as in the initial graph. Mathemat-ically, these constraints also guarantee that the covariancematrix in Ω( f,w ) is positive semi-definite with respect to f  [22], which makes our learning problem solvable.Let ∆ = I  − D − 1 / 2 v HWD − 1 e H  T  D − 1 / 2 v , where I  is theidentity matrix and W  is the diagonal matrix with W  ii = w ( e i ) . We can show Ω( f,w ) = f  T  ∆ f  by Ω( f,w ) =  e ∈ E  u,v ∈ V  w ( e ) h ( u,e ) h ( v,e ) δ ( e )( f  2 ( u ) d ( u ) − f  ( u ) f  ( v )   d ( u ) d ( v ))=  e ∈ E  u ∈ V  w ( e ) h ( u,e ) f  2 ( u ) d ( u )  v ∈ V  h ( v,e ) δ ( e ) −  e ∈ E  u,v ∈ V  w ( e ) h ( u,e ) h ( v,e ) δ ( e ) f  ( u ) f  ( v )   d ( u ) d ( v )=  u ∈ V  f  2 ( u )  e ∈ E w ( e ) h ( u,e ) d ( u ) −  e ∈ E  u,v ∈ V  f  ( u ) w ( e ) h ( u,e ) h ( v,e ) f  ( v )   d ( u ) d ( v ) δ ( e )=  u ∈ V  f  2 ( u ) −  e ∈ E  u,v ∈ V  f  ( u ) w ( e ) h ( u,e ) h ( v,e ) f  ( v )   d ( u ) d ( v ) δ ( e ) . HyperGene ( y,H,A,α,ρ ) 1 t = 0 ,w 0 = 1 ,f  0 = y,c 0 = + ∞ 2 do 3 t = t + 1 4 Use Jacobi iteration method to find optimal f  t f  t = ( I  − αD − 1 / 2 v HW  t − 1 D − 1 e H  T  D − 1 / 2 v ) − 1 y 5 Use quadratic programming to find optimal w t w t = argmin w Ω( f  = f  t − 1 ,w ) + ρ Ψ( w ) subject to Hw = diag ( D v ) and diag ( W  )  0 6 c t = Ω( f  t ,w t ) + µ || f  t − y || 2 + ρ Ψ( w t ) 7 while ( c t − 1 − c t > π ) // if the decrease of the cost  function is smaller than a threshold  π  , stop iterat-ing. 8 return ( f  t , w t ) Figure 2. The HyperGene algorithm. Step three in the above derivation shows that Ω( f,w ) = f  T  ∆ f  if and only if   e ∈ Ew ( e ) h ( u,e ) d ( u ) = 1 . The constraints  e ∈ E h ( v,e ) w ( e ) = d ( v ) for ∀ v ∈ V  in equation 3 keep D v unchanged during the optimization and thus make ∆ al-ways positive semi-definite. Finally, let A be the adjacencymatrix defined on the protein-protein interaction network with A ij = δ ij , where i and j are the indexes of hyper-edges, and D be the diagonal matrix with D ii =  j A ij ,the optimization problem in equation (3) can be written inthe following matrix form,minimize f,w f  T  ∆ f  + µ || f  − y || 2 + ρw T  ( I  − S  ) w (4)subject to Hw = diag ( D v ) diag ( W  )  0 , where S  = D − 1 / 2 AD − 1 / 2 . 3. The HyperGene Algorithm The objective function Φ( f,w ) in the optimization prob-lem defined by equation 3 is not convex in ( f  , w ). However,our formulation contains two sub-optimization-problems,both of which are convex if we independently optimize Φ( f,w ) with respect to f  or w . Specifically, if we fix w to be a specific weighting w t satisfying the constraints 296   296
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x