Visualize & cluster one-mode projection of a bi-partite graph


A lot of genomic knowledge are often curated in the form of genesets where a combination of genes are involved in certain biological process/function/pathway. In other words, a pathway is often enriched with different combination of genes. Similarly this can also be related to patient-genesets data structure.

This can be thought of as a bipartite graph representing relation between individual pathway to a gene. Often bioinformaticians are interested to visualize if there is any kind of relation between the pathways. This problem can be modeled as the conversion of two-mode network (bipartite graph) to one-mode network of pathways and visualize the resulting one-mode network.

# Load required libraries
library("stringr")
library("igraph")
library("RColorBrewer")
library("gplots")
library("cluster")

jColFun <- colorRampPalette(brewer.pal(n = 9, "YlOrRd"))

# Load input data file
file.dat <- url("https://dl.dropboxusercontent.com/u/30823824/dat_genesets.txt")
dat <- read.table(file.dat, sep = "\t", header = T, as.is = c("Pathway", "Genesets"))
str(dat)

## 'data.frame':    244 obs. of  4 variables:
##  $ Pathway : chr  "P-1" "P-2" "P-3" "P-4" ...
##  $ pvalue  : num  6.31e-12 8.51e-10 4.90e-08 6.03e-08 1.55e-07 ...
##  $ BHpvalue: num  1.58e-09 1.07e-07 3.72e-06 3.72e-06 7.59e-06 ...
##  $ Genesets: chr  "Gene-1,Gene-2,Gene-3,Gene-4,Gene-5,Gene-6" ...

### Filter data using some p-value cutoff. 
### (pvalue is for the significance of the geneset enrichment))
threshold <- 0.005
dat.sub <- subset(dat, dat$BHpvalue <= threshold)

pathways <- dat.sub$Pathway
genes <- sort(unique(unlist(str_split(dat.sub$Genesets, ","))), decreasing = F)

# Compute bipartite graph
bip <- matrix(0, nrow = length(pathways), ncol = length(genes))
colnames(bip) <- genes
rownames(bip) <- pathways

list.genes <- str_split(dat.sub$Genesets, ",")

list.gene.index <- lapply(list.genes, function(x) which(colnames(bip) %in% x))

for (i in 1:length(list.gene.index)) {
    bip[i, list.gene.index[[i]]] <- 1
}

dat.sub$TotalGenes <- unlist(lapply(list.gene.index, length))
dat.sub$Color <- rev(jColFun(nrow(dat.sub)))

heatmap.2(bip, 
    col = jColFun(256), 
    Colv = TRUE, Rowv = TRUE, 
    dendrogram = "both", 
    trace = "none", 
    hclustfun = function(x) hclust(x, method = "ward"), 
    distfun = function(x) dist(x, method = "binary"), 
    colsep = c(1:25), 
    rowsep = c(1:25), 
    sepcolor = "white", 
    sepwidth = c(0.05, 0.05), 
    key = "FALSE", 
    cexRow = 0.8, cexCol = 0.8)
bip_1

Heatmap visualization of the bi-partite graph. The red color indicates association of a gene with a pathway and the yellow color indicates the absence of association.

# Compute Distance Matrix
dat.dist <- dist(bip, method = "binary", diag = TRUE, upper = TRUE)
plot(hclust(dat.dist, method = "ward"), hang = 0.1)
bip_2

Dendogram of the pathway.

 

Compute Connection Specific Index (CSI)

In a bi-partite graph, the Y-type hub nodes may confer artificially high levels of interaction-profile similarity for the X-type nodes which may not be very informative and create biases when clustering the data (Green et al., 2011; Fuxman Bass et al., 2013). The connection specificity index (CSI) provides a context-dependent measure that mitigates the effect of nonspecific interactions by ranking the significance of similarity between two X-type nodes according to the specificity of their shared interaction partners (Green et al., 2011).

A Bi-Partite graph with two distinct set of nodes X and Y. The X-type nodes represents Pathways/Patient information and Y-type node represents the genesets.

A Bi-Partite graph with two distinct set of nodes X and Y. The X-type nodes represents Pathways/Patient information and Y-type node represents the genesets. (Figure source: Fauxman Bass et al.,2013)

### First compute pearson correlation
dat.pcc <- cor(t(bip), method = "pearson")

### Function to compute CSI: Connection Specific Index ###
get.csi <- function(dat) {
    mat <- matrix(0, nrow = nrow(dat), ncol = ncol(dat))
    colnames(mat) <- colnames(dat)
    rownames(mat) <- rownames(dat)

    for (i in 1:nrow(dat)) {
        a <- rownames(dat)[i]

        for (j in 1:ncol(dat)) {
            b <- colnames(dat)[j]
            pcc.ab <- dat[a, b] - 0.05

            conn.pairs.a <- colnames(dat)[which(dat[a, ] >= pcc.ab)]
            conn.pairs.b <- rownames(dat)[which(dat[, b] >= pcc.ab)]

            conn.pairs.ab <- length(union(conn.pairs.a, conn.pairs.b))
            n <- nrow(dat)

            csi <- 1 - (conn.pairs.ab/n)
            mat[i, j] <- csi
        }
    }
    return(mat)
}

dat.csi <- get.csi(dat.pcc)

heatmap.2(as.matrix(dat.csi), 
    col = jColFun(256), 
    Colv = TRUE, Rowv = TRUE, 
    dendrogram = "both", 
    trace = "none", 
    hclustfun = function(x) hclust(x, method = "ward"), 
    distfun = function(x) dist(x, method = "minkowski", p = 1), 
    colsep = c(1:25), 
    rowsep = c(1:25), 
    sepcolor = "white", 
    sepwidth = c(0.05, 0.05), 
    key = "FALSE", 
    cexRow = 0.8, cexCol = 0.8, 
    main = "CSI")
bip_3

Heatmap visualization of the CSI index correlation between the pathways. More heat means high correlation.

 

Visualizing the network

Here I’ve used igraph R package to visualize the network of pathways. The correlation matrix computed using the CSI index serves as an adjacency matrix for the graph and the value in the matrix is converted as a weight to the corresponding edge in the graph. (For more on the topic: Working with Bipartite/Affiliation Network Data in R)

# Convert the adjacency (correlation) matrix to an igraph object
g <- graph.adjacency(dat.csi, mode = "undirected", weighted = TRUE, diag = FALSE)
g <- simplify(g, edge.attr.comb = list(weight = "sum"))

# get the node size ratio
for (ctr in 1:length(V(g)$name)) {
    index <- which(dat.sub$Pathway == V(g)$name[ctr])
    V(g)$size[ctr] <- dat.sub$TotalGenes[index]
    V(g)$color[ctr] <- dat.sub$Color[index]
}
V(g)$size <- V(g)$size * 3

# Set vertex attributes
V(g)$label <- V(g)$name
V(g)$label.color <- "black"
V(g)$label.cex <- 0.8
V(g)$label.family <- "Helvetica"
V(g)$frame.color <- "grey"

#Set edge attributes
E(g)$color <- rgb(0.6, 0.6, 0, E(g)$weight)
E(g)$width <- E(g)$weight * 2

plot(g, layout = layout.kamada.kawai)
bip_4

Network visualization of the pathway correlation. Each nodes represent a pathway the size of which is proportional to the number of genes in the pathway. More significant pathway receives more heat. The width of the edge is proportional to the correlation value.

 

Using PAM Clustering with k=2

samplecenters <- 2
pr.pam <- pam(dat.csi, metric = "euclidean", k = samplecenters)

clusplot(pr.pam, shade = FALSE, labels = 2, cex.txt = 0.7, col.p = "black", 
    col.txt = "black", cex = 1.25, color = TRUE, plotchar = FALSE, span = FALSE, 
    pch = 19, main = "PAM Cluster")
bip_5

PAM clustering of the pathway.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s