Massive Data Analysis -- Network Clustering

Date: 2020/01/04 (Sat)

104020002 IPCS Ming-Yeh Chou

Methods

To cluster several undirected graphs (networks), i'll cover two algorithms:
- Personal PageRank
- Louvain Algorithm
[UPDATED]: i have the full MapReduce version & partial ones:

    # partial MapReduce versions
    # (faster, but with native random access structures)
    $ python -u PPR_MR.py

    $ python -u louvain_MR.py

    # full MapReduce versions
    # (very slow, only the small testcase recommanded)
    $ python -u PPR_MR_full.py

    $ python -u louvain_MR_full.py

Personal PageRank

Process Flow

do PPR for one random seed node, and sort all nodes with the rank decreasing.
- Personal PageRank: $$ R(v) = \begin{cases} \alpha\sum\limits_{u\in U}\frac{R(u)}{out(u)}, & \text{if } v \neq s \cr \alpha\sum\limits_{u\in U}\frac{R(u)}{out(u)}+(1-\alpha), & \text{if } v = s \cr \end{cases} $$ , where $ U = \{ u | u \rightarrow v \} $, and $s$ is the seed node.
- That is, we have the probability $(1-\alpha)$ where the rank jump back to the seed node.
for the decreasing order of PPR, compute the cumulative conductance, and FIND a GOOD CLUSTER with local minimas.
- Conductance: $$ \phi(a) = \frac{ | \{ (i,j) \in E; i \in A, j \not\in A \} | }{\min(vol(A), 2m-vol(A))} $$
- In linear time while traversing through, compute the delta conductance when adding a node into a cluster only: $$ \Delta\phi(v)=\frac{-out(v)+in(v)}{vol(A)+k_v} $$ , where $vol(A)=\sum\limits_{i\in A}k_i$, and $k_i$ is the degree of node i.
- pick the $p^{th}$ local minima, as the cut of the cluster. (cluster size)
jump to (1) for another seed node, continue the loop till all nodes get clustered.

MapReduce Implementation

PPR for a random seed node

almost the same as PageRank in hw2.
read the file and parse as edges: file -> (node_0, node_1)

        def parse_mapper(line):
            line = line.split()
            x, y = int(line[0]), int(line[1])
            return x, y

construct neighbor list: (node_0, node_1) -> (node, [nei_0, nei_1, ..., nei_N])

        # (node_0, node_1) -> (node_0, node_1), (node_1, node_0)
        def nei_mapper(pair):
            x, y = pair
            return [(x, [y]), (y, [x])]

        # (node, neighbor) -> (node, [nei_0, nei_1, ..., nei_N])
        def nei_reducer(x, y):
            return x + y

init a list to store ranks:

        rank = [1/NODE_NUM]*NODE_NUM

equally send out the node's own rank to its neighbors:

        # (node, [nei_0, nei_1, ..., nei_N]) -> (nei_0, rank/degree), ..., (nei_N, rank/degree)
        def rank_mapper(pair):
            node, nei_list = pair
            return [(nei, rank[node]/len(nei_list)) for nei in nei_list]

        # (nei, rank/degree) -> (nei, new_rank)
        def rank_reducer(x, y):
            return x + y

calculate the teleport and normalization:

        def PPR(target):
            iters = 20
            for i in range(iters):
                rank_new = nei_list.flatMap(rank_mapper).reduceByKey(rank_reducer).collect()
                for node, val in rank_new:
                    rank[node] = BETA*val + ((1-BETA) if node == target else 0)
                SUM = sum(rank)
                rank = [rank+(1-SUM)/NODE_NUM for rank in rank]
            return sorted(enumerate(rank), key=lambda x: x[1], reverse=True)

compute the cumulative conductance in the decreasing PPR order

convert nei_list into the two parts of the fraction in the conductance formula

        # (node, [nei_0, nei_1, ..., nei_N])
        # -> ("in", degree) & ("out", # of neighbor nodes outside the cluster)
        # where "in" is for the denominator, and "out" is for the fraction
        def cond_mapper(pair):
            node, nei_list = pair
            if not clus[node]:	# node is not in the cluster
                return []
            cout = 0		# node is in the cluster
            for node in nei_list:
                if not clus[node]:
                    cout += 1	# sum up the number of neighbor nodes outside the cluster
            return [("in", len(nei_list)), ("out", cout)]
        
        # ("in", degree) -> ("in", fraction of conductance)
        # ("out", # of neighbor nodes outside the cluster) -> ("out", denominator of conductance)
        def cond_reducer(x, y):
            return x + y

here with MapReduce, i can only compute the whole conductance each time
we need to set up a boolean list to store if the nodes are in current cluster.

        clus = [False]*NODE_NUM

below is the whole computation of conductance.

        def conduct(rank_pair):
            cond_list = []
            clus = [False]*NODE_NUM
            for node, rank in rank_pair:
                clus[node] = True	# add a node in the cluster
                cond = nei_list.flatMap(cond_mapper).reduceByKey(cond_reducer)
                (_, cin), (_, cout) = cond.collect()	# get two parts of the fraction of conductance
                cond_list.append(cout/cin)
            return cond_list

for comparison, i put my random access version here

init a list for each node's neighbor, which we can access directly.

            d_list = [0]*NODE_NUM
            for pair in nei_list.collect():
                d_list[pair[0]] = pair[1]

the implementation of the SWEEP operation:
that we can only compute the conductance changes when adding new nodes into the cluster.

            def conduct(rank_pair):
                cluster = [False]*NODE_NUM
                cond_list = []
                cnt_in, cnt_out = 0, 0
                for node, rank in rank_pair:
                    cluster[node] = True
                    cnt_in += len(d_list[node])
                    cnt_out += len([nei for nei in d_list[node] if not cluster[nei]])*2-len(d_list[node])
                    cond_list.append(cnt_out/cnt_in)
                return cond_list

find the $p^{th}$ local minimas and cut out the cluster

Results

small testcase - 17 nodes, 3 clusters
- pick 1st local minimas as the cluster

medium testcase - 200 nodes, 20 clusters
- pick 1st local minimas as the cluster, looks good

large testcase - 4000 nodes, 10 clusters
- pick 10th local minimas as the cluster, we can see that some clusters got infiltrated by others.

facebook testcase from SNAP - 4039 nodes, ??? clusters
- not good at all...

Discussion

i found the lecture slide only (and always) says "FIND GOOD CLUSTERS" with local minimas of the cumulative conductance, but never talks about the relationships between the minimas and clusters.
- After we get the first $p^{th}$ local minimas, should we take another $p$ minimas and note them as the 2nd cluster ? or make a new PPR sequence with another seed node to get another first $p^{th}$ local minimas ? that doesn't get specified. Thus, i did the two versions and test which one is correct.

one PPR sequence for all clusters
- as mentioned above, in this version, i only do PPR once for a specific seed node. After that, we seperate the graphs into clusterzs with the minimas in the only one PPR order sequence.
- here's the experiment. i take node 0 as the seed node, and here i list out the PPR ranks. the algorithm cut the first cluster as the orange one, i.e. the first local minima ($p=1$) of cumulative conductance is between node 4 and node 8.
- as we can see, for the second cluster we'd like to cut, there are some nodes (such as node 8 and node 15) having similar PPR but they actually belong to different clusters, which means, the PPR only indicates the farness of a node to the seed node, but doesn't specify the bewteeness between nodes with similar PPRs.
one PPR sequence per cluster
- which is the main version i implemented in this project, since it's more accurate.

im not implementing the approximate version of PPR cuz its already fast enough (comparing to Louvain), and im running out of time for this project qAq.

Louvain Algorithm

Process Flow

A bottom-up (merging) hierarchical clustering (greedy) algorithm

start with one vertex per cluster.
compute the modularity change which placing one vertex to its neighbor cluster results in.
- Modularity: $$ Q(G, S) = \frac{1}{2m}\sum_{s\in S}\sum_{i,j\in s}(A_{ij}-\frac{k_ik_j}{2m}) $$
- where $A_{ij}$ is the weight of edge(i, j), $k_i$ is the sum of weights of edge of node i, and $m$ is num of edges
- For unweighted graphs here, we take $A_{ij}$ as 1 or 0, $k_i$ as degree of node i, and the other two stay the same as above.
find such vertex and cluster which gain the most modularity, and actually do it.
- the modularity here is for the whole graph, not the one for an edge.
- thus, again we dont need to compute the whole modularity in every iterations. we just compute the changes between iterations.
jump to (2) and continue the loop until no more modularity gains.

MapReduce Implementation

same as PPR, we set up an RDD with (node, list) structure:
- file -> (node_0, node_1) -> (node, [nei_0, nei_1, ..., nei_N])
- omitted the implenmentation of the same mapper and reducer as the PPR's.

compute the modularity changes

when we place a node T into its neighbor cluster, there are two kinds of neighbor nodes of T: nodes which are in the same cluster as T, and nodes which are outside T's cluster. And the modularity changes of placing a node can be calculated as mod_out - mod_in. With this aspect, the mapper & reducer can be defined as follow:
first, we the multiply the degree of two nodes in each edge.

        # (node, [nei_0, nei_1, ..., nei_N]) -> ( (node, nei), degree )
        def degree_mapper(pair):
            node, nei_list = pair
            degree = len(nei_list)
            ret = []
            for nei in nei_list:
                if node < nei:		# just sort them in the same order to reduce
                    ret.append( ((node, nei), degree) )
                else:
                    ret.append( ((nei, node), degree) )
            return ret

        # ( (node_0, node_1), degree ) -> ( (node_0, nei_1), mult of degree of two nodes )
        def degree_reducer(x, y):
            return x * y

to record which nodes are in which clusters, set up a list:

        cidx = [i for i in range(NODE_NUM)]
```	
- then, compute the modularity change and and classify into in-modularity and out-modularity
```python
        # ( (node_0, node_1), mult of degree ) -> ( (node, cluster), in_mod or out_mod )
        def mod_mapper(pair):
            (node_0, node_1), degree = pair
            if cidx[node_0] == cidx[node_1]:	# same cluster: in_mod
                ret = []
                for idx in list(set(cidx)):
                    ret.append( ((node_0, idx), degree/2/EDGE_NUM-1) )
                for idx in list(set(cidx)):
                    ret.append( ((node_1, idx), degree/2/EDGE_NUM-1) )
                return ret
            else:								# different cluster: out_mod
                return [((node_0, cidx[node_1]), 1-degree/2/EDGE_NUM), \
                        ((node_1, cidx[node_0]), 1-degree/2/EDGE_NUM)]

        # ( (node, cluster), in_mod or out_mod ) -> ( (node, cluster), sum(out_mod)-sum(in_mod) )
        def mod_reducer(x, y):
            return x + y

now, we have the modularity changes of placing a node into a cluster: ( (node, cluster), delta_mod )

the whole iteration shown as below:

    max_mod = 123
    while True:
        modul = degree.flatMap(mod_mapper).reduceByKey(mod_reducer)
        max_pair = modul.max(lambda x:x[1])	# select the maxima
        (node, idx), max_mod = max_pair		# which is the desired operation
        if max_mod > 0:
            cidx[node] = idx
        else:
            break

Results

small testcase - 17 nodes, 3 clusters

medium testcase - 200 nodes, 20 clusters

large testcase - 4000 nodes, 10 clusters

facebook testcase from SNAP - 4039 nodes, ??? clusters

Discussion

vertex based vs cluster based
- there are people implementing this algorithm in two different ways. Some does like mine above, placing nodes into neighbor clusters to see if th emodularity gains, while some others try combining the clusters, that is, they try placing clusters into the neighbor clusters.
- i've tried the "combining clusters" version, but i didn't get the result right. And my current version gives good results, so i think it's at least available ?
too slow? try some pruning and random edge picking:
- if we have N nodes, then in a fully connected graph we'll have N(N-1)/2 edges. According to the algorithm, we have to iterate through all edges in the graph and perform two directions of the pair to place nodes to their neighbor clusters. Then, we have to do that operations N(N-1) times in the worst case.
- Pruning the repeatance
  - To place one node in its neighbor cluster, if the node has more than one edge connecting to the neighbor cluster, we know that in the computation of modularity, we sum up the modularity of all of these edges. Thus, picking these edges results in the same modularity changes, i.e. modularity depends on (node, cluster) pair, not (node, node) edges.
  - Without constructing another data structure to maintain these kind of node-cluster pairs (clusters changes in every iterations, the correctness maintainance would be complicated and high-cost), we can just prune them out with some conditional checks, avoiding to compute the same modularity many times.
- Random edges picking
  - An ambiguous assumption is that not only the node-placing with largest modularity change will be operated, the ones with large (in some extent) modularity changes would all be operated eventually.
  - Thus, we just have to pick some placing with modularity changes which are large enough. We can find the placing with largest modularity in just a subset of the whole set of edges. Picking edges randomly does that, approximating the result we desired, without traversing through all the edges.
the NTR-free Assumption
- another assumption i came up with to make the algorithm faster.
- while trying placing nodes to their neighbor clusters, im wondering if it's possible that a node which belongs to one cluster with larger one nodes would gain the most modularity while getting placed to another cluster. i just call it an NTR operation.
- by intuition, i think it's impossible since mathematical induction: if one nodes have been clustered into one cluster, then it means in that iteration, placing the nodes in that cluster gains the most modularity, and none of the connectivity between that node with any other clusters should be larger than such modularity.
- however, i've tested it. i didn't seem to get it right...

Conclusion & Comparison

PPR vs Louvain

PPR is much faster than Louvain (here, i mean without MapReduce, since my MapReduce implementation in Louvain is quite good, but the one in PPR is bad...)
We have to specify how big a cluster should be for PPR, while Louvain is hand-free for human beings.
For the above reason, and the size of every cluster may diverse, Louvain is thus more accurate to the ground truth.

Conductance vs Modularity

cluster-based or graph-based?
- conductance is for a cluster, and modularity have both definitions for one cluster and the whole graph.
- Actually, i would like to try louvain with reducing the sum of conductance, but i have no time for that.
can be applied to weighted graph? directed graph?
- both can be applied to weighted graph, or to say, the original functions of the two metrics are for weighted graphs.
- im not sure the definitions of clusters in directed graphs...
easy to compute? both can be calculate in linear traversal, but conductance needs more information to store to compute.

Massive Data Analysis -- Network Clustering

Methods

Personal PageRank

Process Flow

MapReduce Implementation

Results

Discussion

Louvain Algorithm

Process Flow

MapReduce Implementation

Results

Discussion

Conclusion & Comparison

PPR vs Louvain

Conductance vs Modularity

廢文專區

內容目錄