Distributed Gradient Clustering: Convergence and the Effect of Initialization

Published in 58th Asilomar Conference on Signals, Systems, and Computers [To appear], 2024

Abstract: We study the effects of center initialization on the performance of a family of distributed gradient-based clustering algorithms introduced in Armacki et al. (2025), that work over connected networks of users. In the considered scenario, each user contains a local dataset and communicates only with its immediate neighbours, with the aim of finding a global clustering of the joint data. We perform extensive numerical experiments, evaluating the effects of center initialization on the performance of our family of methods, demonstrating that our methods are more resilient to the effects of initialization, compared to centralized gradient clustering. Next, inspired by the K-means++ initialization, we propose a novel distributed center initialization scheme, which is shown to improve the performance of our methods, compared to the baseline random initialization.

Recommended citation: Armacki, A., Sharma, H., Bajović, D., Jakovetić, D., Chakraborty, M., & Kar, S. (2024). Distributed Gradient Clustering: Convergence and the Effect of Initialization. 58th Asilomar Conference on Signals, Systems, and Computers [To appear] https://github.com/aarmacki/aarmacki.github.io/blob/master/publications/Asilomar_2024.pdf

Large Deviations and Improved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry

Published in arXiv preprint, 2024

Abstract: We study large deviations and mean-squared error (MSE) guarantees of a general framework of nonlinear stochastic gradient methods in the online setting, in the presence of heavy-tailed noise. Unlike existing works that rely on the closed form of a nonlinearity (typically clipping), our framework treats the nonlinearity in a black-box manner, allowing us to provide unified guarantees for a broad class of bounded nonlinearities, including many popular ones, like sign, quantization, normalization, as well as component-wise and joint clipping. We provide several strong results for a broad range of step-sizes in the presence of heavy-tailed noise with symmetric probability density function, positive in a neighbourhood of zero and potentially unbounded moments. In particular, for non-convex costs we provide a large deviation upper bound for the minimum norm-squared of gradients, showing an asymptotic tail decay on an exponential scale, at a rate $\sqrt{t} / \log(t)$. We establish the accompanying rate function, showing an explicit dependence on the choice of step-size, nonlinearity, noise and problem parameters. Next, for non-convex costs and the minimum norm-squared of gradients, we derive the optimal MSE rate $\widetilde{\mathcal{O}}(t^{-1/2})$. Moreover, for strongly convex costs and the last iterate, we provide an MSE rate that can be made arbitrarily close to the \emph{optimal} rate $\mathcal{O}(t^{-1})$, improving on the state-of-the-art results in the presence of heavy-tailed noise. Finally, we establish almost sure convergence of the minimum norm-squared of gradients, providing an explicit rate, which can be made arbitrarily close to $o(t^{-1/4})$.

Recommended citation: Armacki, A., Yu, S., Bajović, D., Jakovetić, D., & Kar, S. (2024). Large Deviations and Improved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry. arxiv preprint, arxiv:2410.15637 https://arxiv.org/abs/2410.15637

Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees

Published in arXiv preprint, 2024

Abstract: We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-\zeta})$, where $\zeta \in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, demonstrating noise symmetry in real-life settings and showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

Recommended citation: Armacki, A., Yu, S., Sharma, P., Joshi, G., Bajović, D., Jakovetić, D., & Kar, S. (2024). Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees. arxiv preprint, arxiv:2410.13954. https://arxiv.org/abs/2410.13954

A Unified Framework for Gradient-based Clustering of Distributed Data

Published in arXiv preprint, 2024

Abstract: We develop a family of distributed clustering algorithms that work over networks of users. In the proposed scenario, users contain a local dataset and communicate only with their immediate neighbours, with the aim of finding a clustering of the full, joint data. The proposed family is parametrized by $\rho \geq 1$, controling the proximity of users’ center estimates. Specialized to popular clustering losses like $K$-means and Huber loss, our family gives rise to novel distributed clustering algorithms, as well as novel clustering algorithms, e.g., based on the logistic function. We provide a unified analysis and establish several strong results, under mild assumptions. First, the sequence of centers generated by the methods converges to a well-defined notion of fixed point, under any center initialization and value of $\rho$. Second, as $\rho$ increases, the family of fixed points produced by our methods converges to a notion of consensus fixed points. We show that consensus fixed points are equivalent to fixed points of gradient clustering over the full data, guaranteeing a clustering of the full data is produced. For the special case of Bregman losses, we show that our fixed points converge to the set of Lloyd points. Numerical experiments on real data confirm our theoretical findings and demonstrate strong performance of the methods.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D., & Kar, S. (2024). A Unified Framework for Gradient-based Clustering of Distributed Data. arxiv preprint, arxiv:2402.01302 https://arxiv.org/abs/2402.01302

A One-shot Framework for Distributed Clustered Learning in Heterogeneous Environments

Published in IEEE Transactions on Signal Processing, 2024

Abstract: The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments in which users obtain data from one of K different distributions. In the proposed setup, the grouping of users (based on the data distributions they sample), as well as the underlying statistical properties of the distributions, are apriori unknown. A family of One-shot Distributed Clustered Learning methods (ODCL-$\mathcal{C}$) is proposed, parametrized by the set of admissible clustering algorithms $\mathcal{C}$, with the objective of learning the true model at each user. The admissible clustering methods include $K$-means (KM) and convex clustering (CC), giving rise to various one-shot methods within the proposed family, such as ODCL-KM and ODCL-CC. The proposed one-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees. In particular, for strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error (MSE) rates in terms of the sample size. An explicit characterization of the threshold is provided in terms of problem parameters. The trade-offs with respect to selecting various clustering methods (ODCL-CC, ODCL-KM) are discussed and significant improvements over state-of-the-art are demonstrated. Numerical experiments illustrate the findings and corroborate the performance of the proposed methods.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D., & Kar, S. "A One-Shot Framework for Distributed Clustered Learning in Heterogeneous Environments," in IEEE Transactions on Signal Processing, vol. 72, pp. 636-651, 2024, doi: 10.1109/TSP.2023.3343561. https://ieeexplore.ieee.org/abstract/document/10373766

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

Published in arXiv preprint, 2023

Abstract: We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong results. First, for non-convex costs and component-wise nonlinearities, we establish a convergence rate arbitrarily close to $\mathcal{O}\left(t^{-1/4}\right)$, whose exponent is independent of noise and problem parameters. Second, for strongly convex costs and component-wise nonlinearities, we establish a rate arbitrarily close to $\mathcal{O}\left(t^{-1/2}\right)$ for the weighted average of iterates, with exponent again independent of noise and problem parameters. Finally, for strongly convex costs and a broader class of nonlinearities, we establish convergence of the last iterate, with a rate $\mathcal{O}\left(t^{-\zeta} \right)$, where $\zeta \in (0,1)$ depends on problem parameters, noise and nonlinearity. As we show analytically and numerically, $\zeta$ can be used to inform the preferred choice of nonlinearity for given problem settings. Compared to state-of-the-art, who only consider clipping, require bounded noise moments of order $\eta \in (1,2]$, and establish convergence rates whose exponents go to zero as $\eta \rightarrow 1$, we provide high-probability guarantees for a much broader class of nonlinearities and symmetric density noise, with convergence rates whose exponents are bounded away from zero, even when the noise has \emph{finite first moment only}. Moreover, in the case of strongly convex functions, we demonstrate analytically and numerically that clipping is not always the optimal nonlinearity, further underlining the value of our general framework.

Recommended citation: Armacki, A., Sharma, P., Joshi, G., Bajović, D., Jakovetić, D., & Kar, S. (2023). High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise. arxiv preprint, arxiv:2310.18784. https://arxiv.org/abs/2310.18784

Communication Efficient Model-aware Federated Learning for Visual Crowd Counting and Density Estimation in Smart Cities

Published in EUSIPCO 2023 31st European Signal Processing Conference, 2023

Abstract: Federated learning (FL) is an attractive paradigm where a number of users can improve their local models via sharing trained models or model increments with a central server, while the users’ data is kept private. However, when model sizes are huge, FL incurs a significant communication overhead. In such scenarios, strategies that perform user sampling, so that only informative users communicate their models to the server, are desired. In this paper, we make several contributions on user sampling in FL. On the theoretical side, we consider a general framework that exhibits user heterogeneity across several dimensions: activation probabilities, gradient noise variance, number of updates per epoch, and communication channel quality. In this setting, we derive convergence rate of the FedAvg method. The rate explicitly characterizes the effects of heterogeneity and enables us to derive optimal user sampling probabilities in an offline setting, when the sampling probabilities are pre-computed. We then show how these derived probabilities naturally connect with existing optimized sampling strategies in an adaptive-online setting. On the practical side, we study visual crowd counting (VCC) as a representative deep learning application with hugesized models. We provide an implementation of the FL system over real-world data across three pilot sites–Valetta, Trento and Novi Sad. The evaluation results demonstrate significant model accuracy benefits through employing FL over the multiple cities, and significant communication savings via non-uniform user sampling strategies.

Recommended citation: Armacki, A., Milošević, N., Bajović, D., Kar, S., Jakovetić, D., Bakhtiarnia, A., Esterle, L., Muscat, A., & Festi, T. (2023). Communication Efficient Model-aware Federated Learning for Visual Crowd Counting and Density Estimation in Smart Cities. EUSIPCO 2023 31st European Signal Processing Conference, Helsinki, Finland. https://eurasip.org/Proceedings/Eusipco/Eusipco2023/pdfs/0000875.pdf

Personalized Federated Learning via Convex Clustering

Published in 2022 IEEE International Smart Cities Conference (ISC2) , 2022

Abstract: We propose a parametric family of algorithms for personalized federated learning with locally convex user costs. The proposed framework is based on a generalization of convex clustering in which the differences between different users’ models are penalized via a sum-of-norms penalty, weighted by a penalty parameter $\lambda$. The proposed approach enables “automatic” model clustering, without prior knowledge of the hidden cluster structure, nor the number of clusters. Analytical bounds on the weight parameter, that lead to simultaneous personalization, generalization and automatic model clustering are provided. The solution to the formulated problem enables personalization, by providing different models across different clusters, and generalization, by providing models different than the per-user models computed in isolation. We then provide an efficient algorithm based on the Parallel Direction Method of Multipliers (PDMM) to solve the proposed formulation in a federated server-users setting. Numerical experiments corroborate our findings. As an interesting byproduct, our results provide several generalizations to convex clustering.

Recommended citation: A. Armacki, D. Bajović, D. Jakovetić and S. Kar, "Personalized Federated Learning via Convex Clustering," 2022 IEEE International Smart Cities Conference (ISC2), Pafos, Cyprus, 2022, pp. 1-7, doi: 10.1109/ISC255366.2022.9921863. https://ieeexplore.ieee.org/abstract/document/9921863

Gradient Based Clustering

Published in Proceedings of the 39th International Conference on Machine Learning, PMLR 162:929-947, 2022

Abstract: We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality with respect to cluster assignments and cluster center positions. The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions, satisfying some mild assumptions. The main advantage of the proposed approach is a simple and computationally cheap update rule. Unlike previous methods that specialize to a specific formulation of the clustering problem, our approach is applicable to a wide range of costs, including non-Bregman clustering methods based on the Huber loss. We analyze the convergence of the proposed algorithm, and show that it converges to the set of appropriately defined fixed points, under arbitrary center initialization. In the special case of Bregman cost functions, the algorithm converges to the set of centroidal Voronoi partitions, which is consistent with prior works. Numerical experiments on real data demonstrate the effectiveness of the proposed method.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D. & Kar, S. (2022). Gradient Based Clustering. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:929-947. https://proceedings.mlr.press/v162/armacki22a.html

Distributed Trust-region Method with First Order Models

Published in IEEE EUROCON 2019-18th International Conference on Smart Technologies, 2019

Abstract: In this paper, we introduce the trust region concept for distributed optimization. A large class of globally convergent methods of this type is used efficiently in centralized optimization, both constrained and unconstrained. The methods of this class are built on the idea of modeling the objective function at each iteration and taking the new iteration as the minimizer of the model in a certain area, called the trust region. The trust region size, the minimization method and the model function depend on the properties of the objective function. In this paper we propose a general framework and concentrate on the first order methods, i.e., the gradient methods. Using the trust-region mechanism for generating the step size we end up with a fully distributed method with node varying step sizes. Numerical results presented in the paper demonstrate the efficiency of the proposed approach.

Recommended citation: A. Armacki, D. Jakovetić, N. Krejić and N. K. Jerinkić, "Distributed Trust-Region Method With First Order Models," IEEE EUROCON 2019 -18th International Conference on Smart Technologies, Novi Sad, Serbia, 2019, pp. 1-6, doi: 10.1109/EUROCON.2019.8861739. https://ieeexplore.ieee.org/abstract/document/8861739