Developing a reliable and accurate algorithm for discovering communities in small and mid-sized networks, with optimality guarantees.
Community Detection
Community detection is the descriptive process of clustering the nodes of a network (graph). It is an unsupervised task with a wide range of applications in life sciences (e.g., biological systems), social sciences (offline/online social networks), physical sciences (VLSI design, complex systems), mathematical sciences (agent-based models, computer vision), and even health sciences (drug discovery) (Wikipedia).
There are several existing methods (algorithms) for approaching this computational problem. One family of algorithms attempts to find communities by maximizing modularity: a network-level quantity indicating the fraction of intra-community edges minus the expected fraction under a degree-preserving null model.
Do we need yet another community detection algorithm? Yes, and here's why:
Notation for the IP model: The input is graph G=(V,E) with |V|=n nodes, |E|=m edges, and adjacency matrix entries a_{ij}. d_i is the degree of node i, gamma is the resolution parameter. The term associated with each pair (i,j) is b_{ij} and referred to as the modularity matrix entry for (i,j). Using the binary decision variable x_{ij} for each pair of distinct nodes (i,j), their cluster membership is either the same (represented by x_{ij}=0) or different (represented by x_{ij}=1). K(i,j) indicates a minimum-cardinality separating set for the nodes i,j. Accordingly, the problem of maximizing the modularity of input graph G can be formulated as the IP model above (Dinh and Thai 2015).
We investigated the extent to which current heuristic modularity maximization algorithms succeed in maximizing modularity by evaluating (1) the ratio of their output modularity to the maximum modularity for a given graph (GOP) and (2) the similarity between their output partition and all modularity-maximizing partitions of that graph using Adjusted Mutual Information (AMI). The following figure shows scatterplots of GOP on the y-axis and AMI on the x-axis for the combination of each network and algorithm.
Looking at the y-axis values in the figure, we observe that there is a substantial variation in the values of GOP (i.e., extent of sub-optimality) returned by the eight heuristic algorithms. The x-axis values in the figure show the considerable dissimilarity between the sub-optimal partitions and any optimal partition for these networks. Focusing on the position of data points, we observe that they are mostly located above their corresponding 45-degree line. This indicates that sub-optimal partitions tend to be disproportionally dissimilar to any optimal partition. This result goes against the naive viewpoint that close-to-maximum modularity partitions are also close to an optimal partition.
The average modularity-based heuristic algorithm returns optimal partitions for only 19.4% of these graphs. Taken together with the low success rates of heuristic algorithms in maximizing modularity, our results indicate a crucial mismatch between the design philosophy of modularity maximization algorithms for community detection and their performance: Heuristic modularity maximization algorithms rarely return an optimal partition or a partition resembling an optimal partition.
The Bayan algorithm solves this fundamental problem and provides optimal partitions and partitions within a guaranteed proximity to an optimal partition.
How does Bayan compare with other community detection methods?
In the second part of this project, we compared Bayan with 29 existing methods for community detection in a testbed of randomly generated LFR benchmark graphs (and ABCD benchmark graphs). Regardless of using modularity or other approaches, community detection methods can be compared based on their capability in retrieving the ground-truth communities of LFR benchmark graphs using AMI. We generate LFR graphs with different values for the mixing parameter mu (fraction of inter-community edges). Increasing mu increases the disassociation of the structure and ground-truth communities.
The figure above ranks the 30 algorithms based on AMI in five experiment settings (each with a specific mu parameter value) averaged over 100 LFR graphs.
Bayan has a relatively high and stable performance rank over different values of the mu parameter. This indicates the distinctive capability of maximum-modularity partitions in retrieving the ground-truth communities. These descriptive results are confirmed by rigorous statistical tests on the performance of these algorithms (read the Bayan preprint for more info).
Interested to learn more? Check out our two manuscripts below.
Samin Aref and Mahdi Mostajabdaveh. "Analyzing modularity maximization in approximation, heuristic, and graph neural network algorithms for community detection." ArXiv, 2023 Preprint
Samin Aref, Mahdi Mostajabdaveh, and Hriday Chheda. "Heuristic modularity maximization algorithms for community detection rarely return an optimal partition or anything similar." in Computational Science - ICCS 2023, LNCS 10476, Springer, 2023 doi.org/10.1007/978-3-031-36027-5_48Preprint
Samin Aref, Hriday Chheda, and Mahdi Mostajabdaveh. "The Bayan algorithm: detecting communities in networks through exact and approximate optimization of modularity." ArXiv, 2022 Preprint
Running Bayan for optimization models with more than 2000 variables or constraints (input graphs with more than 44 nodes) requires a free academic license to be installed for the mathematical solver Gurobi (which solves the linear programming relaxation problems within the Bayan algorithm). You may follow the instructions in the GitHub Repository to request and install a free academic license for Gurobi if your usage is not commercial and you are affiliated with an academic institution.
The Bayan algorithm can be used as follows in Jupyter Python:
Project Members
Prof. Samin Aref
Department of Mechanical & Industrial Engineering, University of Toronto
Mahdi Mostajabdaveh, Ph.D.
Huawei Technologies Canada
Hriday Chheda
Department of Mechanical & Industrial Engineering, University of Toronto
Origin of the algorithm name "Bayan"?
Bayan is a village in the Southeast of Iran located at the midpoint of the hometowns of the developers of this project.
Bayan is a small "community" of 22 people in 5 families as of the 2016 census.
Acknowledgments
The Bayan project is supported by the Data Sciences Institute at the University of Toronto.