Massive Data Analysis -- Network Clustering
Date: 2020/01/04 (Sat)
104020002 IPCS Ming-Yeh Chou
Methods
- To cluster several undirected graphs (networks), i'll cover two algorithms:
- Personal PageRank
- Louvain Algorithm
- [UPDATED]: i have the full MapReduce version & partial ones:
# partial MapReduce versions
# (faster, but with native random access structures)
$ python -u PPR_MR.py
$ python -u louvain_MR.py
# full MapReduce versions
# (very slow, only the small testcase recommanded)
$ python -u PPR_MR_full.py
$ python -u louvain_MR_full.py
Personal PageRank
Process Flow
- do PPR for one random seed node, and sort all nodes with the rank decreasing.
- Personal PageRank: R(v)={α∑u∈UR(u)out(u),if v≠sα∑u∈UR(u)out(u)+(1−α),if v=s , where U={u|u→v}, and s is the seed node.
- That is, we have the probability (1−α) where the rank jump back to the seed node.
- for the decreasing order of PPR, compute the cumulative conductance, and FIND a GOOD CLUSTER with local minimas.
- Conductance: ϕ(a)=|{(i,j)∈E;i∈A,j∉A}|min(vol(A),2m−vol(A))
- In linear time while traversing through, compute the delta conductance when adding a node into a cluster only: Δϕ(v)=−out(v)+in(v)vol(A)+kv , where vol(A)=∑i∈Aki, and ki is the degree of node i.
- pick the pth local minima, as the cut of the cluster. (cluster size)
- jump to (1) for another seed node, continue the loop till all nodes get clustered.
MapReduce Implementation
PPR for a random seed node
- almost the same as PageRank in hw2.
- read the file and parse as edges:
file -> (node_0, node_1)
def parse_mapper(line): line = line.split() x, y = int(line[0]), int(line[1]) return x, y
- construct neighbor list:
(node_0, node_1) -> (node, [nei_0, nei_1, ..., nei_N])
# (node_0, node_1) -> (node_0, node_1), (node_1, node_0) def nei_mapper(pair): x, y = pair return [(x, [y]), (y, [x])] # (node, neighbor) -> (node, [nei_0, nei_1, ..., nei_N]) def nei_reducer(x, y): return x + y
- init a list to store ranks:
rank = [1/NODE_NUM]*NODE_NUM
- equally send out the node's own rank to its neighbors:
# (node, [nei_0, nei_1, ..., nei_N]) -> (nei_0, rank/degree), ..., (nei_N, rank/degree) def rank_mapper(pair): node, nei_list = pair return [(nei, rank[node]/len(nei_list)) for nei in nei_list] # (nei, rank/degree) -> (nei, new_rank) def rank_reducer(x, y): return x + y
- calculate the teleport and normalization:
def PPR(target): iters = 20 for i in range(iters): rank_new = nei_list.flatMap(rank_mapper).reduceByKey(rank_reducer).collect() for node, val in rank_new: rank[node] = BETA*val + ((1-BETA) if node == target else 0) SUM = sum(rank) rank = [rank+(1-SUM)/NODE_NUM for rank in rank] return sorted(enumerate(rank), key=lambda x: x[1], reverse=True)
compute the cumulative conductance in the decreasing PPR order
- convert nei_list into the two parts of the fraction in the conductance formula
# (node, [nei_0, nei_1, ..., nei_N]) # -> ("in", degree) & ("out", # of neighbor nodes outside the cluster) # where "in" is for the denominator, and "out" is for the fraction def cond_mapper(pair): node, nei_list = pair if not clus[node]: # node is not in the cluster return [] cout = 0 # node is in the cluster for node in nei_list: if not clus[node]: cout += 1 # sum up the number of neighbor nodes outside the cluster return [("in", len(nei_list)), ("out", cout)] # ("in", degree) -> ("in", fraction of conductance) # ("out", # of neighbor nodes outside the cluster) -> ("out", denominator of conductance) def cond_reducer(x, y): return x + y
- here with MapReduce, i can only compute the whole conductance each time
- we need to set up a boolean list to store if the nodes are in current cluster.
clus = [False]*NODE_NUM
- below is the whole computation of conductance.
def conduct(rank_pair): cond_list = [] clus = [False]*NODE_NUM for node, rank in rank_pair: clus[node] = True # add a node in the cluster cond = nei_list.flatMap(cond_mapper).reduceByKey(cond_reducer) (_, cin), (_, cout) = cond.collect() # get two parts of the fraction of conductance cond_list.append(cout/cin) return cond_list
- for comparison, i put my random access version here
- init a list for each node's neighbor, which we can access directly.
d_list = [0]*NODE_NUM for pair in nei_list.collect(): d_list[pair[0]] = pair[1]
- the implementation of the SWEEP operation:
- that we can only compute the conductance changes when adding new nodes into the cluster.
def conduct(rank_pair): cluster = [False]*NODE_NUM cond_list = [] cnt_in, cnt_out = 0, 0 for node, rank in rank_pair: cluster[node] = True cnt_in += len(d_list[node]) cnt_out += len([nei for nei in d_list[node] if not cluster[nei]])*2-len(d_list[node]) cond_list.append(cnt_out/cnt_in) return cond_list
find the pth local minimas and cut out the cluster
Results
- small testcase - 17 nodes, 3 clusters
- pick 1st local minimas as the cluster



- medium testcase - 200 nodes, 20 clusters
- pick 1st local minimas as the cluster, looks good



- large testcase - 4000 nodes, 10 clusters
- pick 10th local minimas as the cluster, we can see that some clusters got infiltrated by others.



- facebook testcase from SNAP - 4039 nodes, ??? clusters
- not good at all...



Discussion
- i found the lecture slide only (and always) says "FIND GOOD CLUSTERS" with local minimas of the cumulative conductance, but never talks about the relationships between the minimas and clusters.
- After we get the first pth local minimas, should we take another p minimas and note them as the 2nd cluster ? or make a new PPR sequence with another seed node to get another first pth local minimas ? that doesn't get specified. Thus, i did the two versions and test which one is correct.
one PPR sequence for all clusters
- as mentioned above, in this version, i only do PPR once for a specific seed node. After that, we seperate the graphs into clusterzs with the minimas in the only one PPR order sequence.
- here's the experiment. i take
node 0
as the seed node, and here i list out the PPR ranks. the algorithm cut the first cluster as the orange one, i.e. the first local minima (p=1) of cumulative conductance is betweennode 4
andnode 8
. - as we can see, for the second cluster we'd like to cut, there are some nodes (such as
node 8
andnode 15
) having similar PPR but they actually belong to different clusters, which means, the PPR only indicates the farness of a node to the seed node, but doesn't specify the bewteeness between nodes with similar PPRs.
one PPR sequence per cluster
- which is the main version i implemented in this project, since it's more accurate.
- im not implementing the approximate version of PPR cuz its already fast enough (comparing to Louvain), and im running out of time for this project qAq.
Louvain Algorithm
Process Flow
- A bottom-up (merging) hierarchical clustering (greedy) algorithm
- start with one vertex per cluster.
- compute the modularity change which placing one vertex to its neighbor cluster results in.
- Modularity: Q(G,S)=12m∑s∈S∑i,j∈s(Aij−kikj2m)
- where Aij is the weight of edge(i, j), ki is the sum of weights of edge of node i, and m is num of edges
- For unweighted graphs here, we take Aij as 1 or 0, ki as degree of node i, and the other two stay the same as above.
- find such vertex and cluster which gain the most modularity, and actually do it.
- the modularity here is for the whole graph, not the one for an edge.
- thus, again we dont need to compute the whole modularity in every iterations. we just compute the changes between iterations.
- jump to (2) and continue the loop until no more modularity gains.
MapReduce Implementation
- same as PPR, we set up an RDD with
(node, list)
structure:file -> (node_0, node_1) -> (node, [nei_0, nei_1, ..., nei_N])
- omitted the implenmentation of the same mapper and reducer as the PPR's.
- compute the modularity changes
- when we place a node T into its neighbor cluster, there are two kinds of neighbor nodes of T: nodes which are in the same cluster as T, and nodes which are outside T's cluster. And the modularity changes of placing a node can be calculated as
mod_out - mod_in
. With this aspect, the mapper & reducer can be defined as follow: - first, we the multiply the degree of two nodes in each edge.
# (node, [nei_0, nei_1, ..., nei_N]) -> ( (node, nei), degree ) def degree_mapper(pair): node, nei_list = pair degree = len(nei_list) ret = [] for nei in nei_list: if node < nei: # just sort them in the same order to reduce ret.append( ((node, nei), degree) ) else: ret.append( ((nei, node), degree) ) return ret # ( (node_0, node_1), degree ) -> ( (node_0, nei_1), mult of degree of two nodes ) def degree_reducer(x, y): return x * y
- to record which nodes are in which clusters, set up a list:
cidx = [i for i in range(NODE_NUM)] ``` - then, compute the modularity change and and classify into in-modularity and out-modularity ```python # ( (node_0, node_1), mult of degree ) -> ( (node, cluster), in_mod or out_mod ) def mod_mapper(pair): (node_0, node_1), degree = pair if cidx[node_0] == cidx[node_1]: # same cluster: in_mod ret = [] for idx in list(set(cidx)): ret.append( ((node_0, idx), degree/2/EDGE_NUM-1) ) for idx in list(set(cidx)): ret.append( ((node_1, idx), degree/2/EDGE_NUM-1) ) return ret else: # different cluster: out_mod return [((node_0, cidx[node_1]), 1-degree/2/EDGE_NUM), \ ((node_1, cidx[node_0]), 1-degree/2/EDGE_NUM)] # ( (node, cluster), in_mod or out_mod ) -> ( (node, cluster), sum(out_mod)-sum(in_mod) ) def mod_reducer(x, y): return x + y
- now, we have the
modularity changes
of placing anode
into acluster
:( (node, cluster), delta_mod )
- when we place a node T into its neighbor cluster, there are two kinds of neighbor nodes of T: nodes which are in the same cluster as T, and nodes which are outside T's cluster. And the modularity changes of placing a node can be calculated as
- the whole iteration shown as below:
max_mod = 123
while True:
modul = degree.flatMap(mod_mapper).reduceByKey(mod_reducer)
max_pair = modul.max(lambda x:x[1]) # select the maxima
(node, idx), max_mod = max_pair # which is the desired operation
if max_mod > 0:
cidx[node] = idx
else:
break
Results
- small testcase - 17 nodes, 3 clusters


- medium testcase - 200 nodes, 20 clusters


- large testcase - 4000 nodes, 10 clusters


- facebook testcase from SNAP - 4039 nodes, ??? clusters


Discussion
- vertex based vs cluster based
- there are people implementing this algorithm in two different ways. Some does like mine above, placing nodes into neighbor clusters to see if th emodularity gains, while some others try combining the clusters, that is, they try placing clusters into the neighbor clusters.
- i've tried the "combining clusters" version, but i didn't get the result right. And my current version gives good results, so i think it's at least available ?
- too slow? try some pruning and random edge picking:
- if we have N nodes, then in a fully connected graph we'll have N(N-1)/2 edges. According to the algorithm, we have to iterate through all edges in the graph and perform two directions of the pair to place nodes to their neighbor clusters. Then, we have to do that operations N(N-1) times in the worst case.
- Pruning the repeatance
- To place one node in its neighbor cluster, if the node has more than one edge connecting to the neighbor cluster, we know that in the computation of modularity, we sum up the modularity of all of these edges. Thus, picking these edges results in the same modularity changes, i.e. modularity depends on
(node, cluster)
pair, not(node, node)
edges. - Without constructing another data structure to maintain these kind of node-cluster pairs (clusters changes in every iterations, the correctness maintainance would be complicated and high-cost), we can just prune them out with some conditional checks, avoiding to compute the same modularity many times.
- To place one node in its neighbor cluster, if the node has more than one edge connecting to the neighbor cluster, we know that in the computation of modularity, we sum up the modularity of all of these edges. Thus, picking these edges results in the same modularity changes, i.e. modularity depends on
- Random edges picking
- An ambiguous assumption is that not only the node-placing with largest modularity change will be operated, the ones with large (in some extent) modularity changes would all be operated eventually.
- Thus, we just have to pick some placing with modularity changes which are large enough. We can find the placing with largest modularity in just a subset of the whole set of edges. Picking edges randomly does that, approximating the result we desired, without traversing through all the edges.
- the NTR-free Assumption
- another assumption i came up with to make the algorithm faster.
- while trying placing nodes to their neighbor clusters, im wondering if it's possible that a node which belongs to one cluster with larger one nodes would gain the most modularity while getting placed to another cluster. i just call it an NTR operation.
- by intuition, i think it's impossible since mathematical induction: if one nodes have been clustered into one cluster, then it means in that iteration, placing the nodes in that cluster gains the most modularity, and none of the connectivity between that node with any other clusters should be larger than such modularity.
- however, i've tested it. i didn't seem to get it right...
Conclusion & Comparison
PPR vs Louvain
- PPR is much faster than Louvain (here, i mean without MapReduce, since my MapReduce implementation in Louvain is quite good, but the one in PPR is bad...)
- We have to specify how big a cluster should be for PPR, while Louvain is hand-free for human beings.
- For the above reason, and the size of every cluster may diverse, Louvain is thus more accurate to the ground truth.
Conductance vs Modularity
- cluster-based or graph-based?
- conductance is for a cluster, and modularity have both definitions for one cluster and the whole graph.
- Actually, i would like to try louvain with reducing the sum of conductance, but i have no time for that.
- can be applied to weighted graph? directed graph?
- both can be applied to weighted graph, or to say, the original functions of the two metrics are for weighted graphs.
- im not sure the definitions of clusters in directed graphs...
- easy to compute? both can be calculate in linear traversal, but conductance needs more information to store to compute.