Publications (in chronological order, * = equal contribution)

Improved High-probability Convergence Guarantees of Decentralized SGD

Published in arXiv preprint, 2025

Abstract: Convergence in high-probability (HP) has been receiving increasing interest, due to its attractive properties, such as exponentially decaying tail bounds and strong guarantees for each individual run of an algorithm. While HP guarantees are extensively studied in centralized settings, much less is understood in the decentralized, networked setup. Existing HP studies in decentralized settings impose strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise, resulting in a significant gap between assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, even for vanilla Decentralized Stochastic Gradient Descent (DSGD) algorithm. This is contrary to centralized settings, where it is known that SGD converges in HP under the same conditions on the cost function as needed to guarantee MSE convergence. Motivated by this observation, we revisit HP guarantees for DSGD in the presence of light-tailed noise. We show that DSGD converges in HP under the same conditions on the cost as in the MSE sense, removing uniformly bounded gradients and other restrictive assumptions, while simultaneously achieving order-optimal rates for both non-convex and strongly convex costs. Moreover, our improved analysis yields linear speed-up in the number of users, demonstrating that DSGD maintains strong performance in the HP sense and matches existing MSE guarantees. Our improved results stem from a careful analysis of the MGF of quantities of interest (norm-squared of gradient or optimality gap) and the MGF of the consensus gap between users’ models. To achieve linear speed-up, we provide a novel result on the variance-reduction effect of decentralized methods in the HP sense and more fine-grained bounds on the MGF for strongly convex costs, which are both of independent interest.

Recommended citation: Armacki, A., & Sayed, A. H. (2025). Improved High-probability Convergence Guarantees of Decentralized SGD. In arXiv:2507.09093 https://arxiv.org/abs/2510.06141

Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization

Published in arXiv preprint, 2025

Abstract: We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $\widetilde{\mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p \in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D., & Kar, S. (2025). Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization. In arXiv:2507.09093 https://arxiv.org/abs/2507.09093

Toward Understanding the Improved Robustness to Initialization in Distributed Clustering

Published in 33rd European Signal Processing Conference, 2025

Abstract: It is well-known that the performance of many clustering algorithms is strongly affected by center initialization. Recently, a number of papers empirically showed that distributed clustering algorithms exhibit improved robustness to center initialization compared to their centralized counterparts. In this paper we provide a theoretical justification for this phenomena, by studying the statistical guarantees of a distributed center initialization method, inspired by the celebrated $K$-means++ approach. In the presence of $K$ distinct data populations and the client-server setup, where each user contains a chunk of the data and communicates directly with the server, we establish mean-squared error (MSE) guarantees of the proposed initialization scheme, in terms of the gap from the population means. We show under mild assumptions that the probability of a population not being represented in the center initialization decays exponentially in the number of users, implying that the MSE approaches the optimal error exponentially fast as the number of users grows. Our result provides the first theoretical explanation for the improved robustness to center initialization exhibited by distributed clustering algorithms in practice.

Recommended citation: Armacki, A., & Kar, S. (2025). Toward Understanding the Improved Robustness to Initialization in Distributed Clustering. In 33rd European Signal Processing Conference. https://github.com/aarmacki/aarmacki.github.io/blob/master/publications/dist_clust_init.pdf

High-probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise

Published in 28th International Conference on Artificial Intelligence and Statistics, 2025

Abstract: We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-\zeta})$, where $\zeta \in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

Recommended citation: Armacki, A., Yu, S., Sharma, P., Joshi, G., Bajović, D., Jakovetić, D., & Kar, S. (2025). High-probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise. In 28th International Conference on Artificial Intelligence and Statistics. https://proceedings.mlr.press/v258/armacki25a.html

A Unified Framework for Gradient-based Clustering of Distributed Data

Published in IEEE Transactions on Signal Processing, 2025

Abstract: We develop a family of distributed center-based clustering algorithms that work over connected networks of users. In the proposed scenario, users contain a local dataset and communicate only with their immediate neighbours, with the aim of finding a clustering of the full, joint data. The proposed family, termed Distributed Gradient Clustering (DGC-$\mathcal{F}_\rho$), is parametrized by $\rho \geq 1$, controlling the proximity of users’ center estimates, with $\mathcal{F}$ determining the clustering loss. Our framework allows for a broad class of smooth convex loss functions, including popular clustering losses like K-means and Huber loss. Specialized to K-means and Huber loss, DGC-$\mathcal{F}_{\rho}$ gives rise to novel distributed clustering algorithms DGC-KM$_{\rho}$ and DGC-HL$_{\rho}$, while novel clustering losses based on the logistic and fair loss lead to DGC-LL$_{\rho}$ and DGC-FL$_{\rho}$. We provide a unified analysis and establish several strong results, under mild assumptions. First, the sequence of centers generated by the methods converges to a well-defined notion of fixed point, under any center initialization and value of $\rho$. Second, as $\rho$ increases, the family of fixed points produced by DGC-$\mathcal{F}_{\rho}$ converges to a notion of consensus fixed points. We show that consensus fixed points of DGC-$\mathcal{F}_{\rho}$ are equivalent to fixed points of gradient clustering over the full data, guaranteeing a clustering of the full data is produced. For the special case of Bregman losses, we show that our fixed points converge to the set of Lloyd points. Numerical experiments on real data confirm our theoretical findings and demonstrate strong performance of the methods.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D., & Kar, S. (2025). Distributed Center-Based Clustering: A Unified Framework. In IEEE Transactions on Signal Processing, vol. 73, pp. 903-918, doi: 10.1109/TSP.2025.3531292. https://ieeexplore.ieee.org/abstract/document/10847582

Distributed Gradient Clustering: Convergence and the Effect of Initialization

Published in 58th Asilomar Conference on Signals, Systems, and Computers, 2024

Abstract: We study the effects of center initialization on the performance of a family of distributed gradient-based clustering algorithms introduced in Armacki et al. (2025), that work over connected networks of users. In the considered scenario, each user contains a local dataset and communicates only with its immediate neighbours, with the aim of finding a global clustering of the joint data. We perform extensive numerical experiments, evaluating the effects of center initialization on the performance of our family of methods, demonstrating that our methods are more resilient to the effects of initialization, compared to centralized gradient clustering. Next, inspired by the K-means++ initialization, we propose a novel distributed center initialization scheme, which is shown to improve the performance of our methods, compared to the baseline random initialization.

Recommended citation: Armacki, A.*, Sharma, H.*, Bajović, D., Jakovetić, D., Chakraborty, M., & Kar, S. (2024). Distributed Gradient Clustering: Convergence and the Effect of Initialization. In 58th Asilomar Conference on Signals, Systems, and Computers. https://ieeexplore.ieee.org/document/10942834

Large Deviations and Improved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry

Published in arXiv preprint, 2024

Abstract: We study large deviations and mean-squared error (MSE) guarantees of a general framework of nonlinear stochastic gradient methods in the online setting, in the presence of heavy-tailed noise. Unlike existing works that rely on the closed form of a nonlinearity (typically clipping), our framework treats the nonlinearity in a black-box manner, allowing us to provide unified guarantees for a broad class of bounded nonlinearities, including many popular ones, like sign, quantization, normalization, as well as component-wise and joint clipping. We provide several strong results for a broad range of step-sizes in the presence of heavy-tailed noise with symmetric probability density function, positive in a neighbourhood of zero and potentially unbounded moments. In particular, for non-convex costs we provide a large deviation upper bound for the minimum norm-squared of gradients, showing an asymptotic tail decay on an exponential scale, at a rate $\sqrt{t} / \log(t)$. We establish the accompanying rate function, showing an explicit dependence on the choice of step-size, nonlinearity, noise and problem parameters. Next, for non-convex costs and the minimum norm-squared of gradients, we derive the optimal MSE rate $\widetilde{\mathcal{O}}(t^{-1/2})$. Moreover, for strongly convex costs and the last iterate, we provide an MSE rate that can be made arbitrarily close to the \emph{optimal} rate $\mathcal{O}(t^{-1})$, improving on the state-of-the-art results in the presence of heavy-tailed noise. Finally, we establish almost sure convergence of the minimum norm-squared of gradients, providing an explicit rate, which can be made arbitrarily close to $o(t^{-1/4})$.

Recommended citation: Armacki, A.*, Yu, S.*, Bajović, D., Jakovetić, D., & Kar, S. (2024). Large Deviations and Improved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry. In arXiv:2410.15637 https://arxiv.org/abs/2410.15637

A One-shot Framework for Distributed Clustered Learning in Heterogeneous Environments

Published in IEEE Transactions on Signal Processing, 2024

Abstract: The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments in which users obtain data from one of K different distributions. In the proposed setup, the grouping of users (based on the data distributions they sample), as well as the underlying statistical properties of the distributions, are apriori unknown. A family of One-shot Distributed Clustered Learning methods (ODCL-$\mathcal{C}$) is proposed, parametrized by the set of admissible clustering algorithms $\mathcal{C}$, with the objective of learning the true model at each user. The admissible clustering methods include $K$-means (KM) and convex clustering (CC), giving rise to various one-shot methods within the proposed family, such as ODCL-KM and ODCL-CC. The proposed one-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees. In particular, for strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error (MSE) rates in terms of the sample size. An explicit characterization of the threshold is provided in terms of problem parameters. The trade-offs with respect to selecting various clustering methods (ODCL-CC, ODCL-KM) are discussed and significant improvements over state-of-the-art are demonstrated. Numerical experiments illustrate the findings and corroborate the performance of the proposed methods.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D., & Kar, S. (2024). A One-Shot Framework for Distributed Clustered Learning in Heterogeneous Environments. In IEEE Transactions on Signal Processing, vol. 72, pp. 636-651, doi: 10.1109/TSP.2023.3343561. https://ieeexplore.ieee.org/abstract/document/10373766

Communication Efficient Model-aware Federated Learning for Visual Crowd Counting and Density Estimation in Smart Cities

Published in EUSIPCO 31st European Signal Processing Conference, 2023

Abstract: Federated learning (FL) is an attractive paradigm where a number of users can improve their local models via sharing trained models or model increments with a central server, while the users’ data is kept private. However, when model sizes are huge, FL incurs a significant communication overhead. In such scenarios, strategies that perform user sampling, so that only informative users communicate their models to the server, are desired. In this paper, we make several contributions on user sampling in FL. On the theoretical side, we consider a general framework that exhibits user heterogeneity across several dimensions: activation probabilities, gradient noise variance, number of updates per epoch, and communication channel quality. In this setting, we derive convergence rate of the FedAvg method. The rate explicitly characterizes the effects of heterogeneity and enables us to derive optimal user sampling probabilities in an offline setting, when the sampling probabilities are pre-computed. We then show how these derived probabilities naturally connect with existing optimized sampling strategies in an adaptive-online setting. On the practical side, we study visual crowd counting (VCC) as a representative deep learning application with hugesized models. We provide an implementation of the FL system over real-world data across three pilot sites–Valetta, Trento and Novi Sad. The evaluation results demonstrate significant model accuracy benefits through employing FL over the multiple cities, and significant communication savings via non-uniform user sampling strategies.

Recommended citation: Armacki, A., Milošević, N., Bajović, D., Kar, S., Jakovetić, D., Bakhtiarnia, A., Esterle, L., Muscat, A., & Festi, T. (2023). Communication Efficient Model-aware Federated Learning for Visual Crowd Counting and Density Estimation in Smart Cities. In 31st European Signal Processing Conference, Helsinki, Finland. https://eurasip.org/Proceedings/Eusipco/Eusipco2023/pdfs/0000875.pdf

Personalized Federated Learning via Convex Clustering

Published in IEEE International Smart Cities Conference (ISC2) , 2022

Abstract: We propose a parametric family of algorithms for personalized federated learning with locally convex user costs. The proposed framework is based on a generalization of convex clustering in which the differences between different users’ models are penalized via a sum-of-norms penalty, weighted by a penalty parameter $\lambda$. The proposed approach enables “automatic” model clustering, without prior knowledge of the hidden cluster structure, nor the number of clusters. Analytical bounds on the weight parameter, that lead to simultaneous personalization, generalization and automatic model clustering are provided. The solution to the formulated problem enables personalization, by providing different models across different clusters, and generalization, by providing models different than the per-user models computed in isolation. We then provide an efficient algorithm based on the Parallel Direction Method of Multipliers (PDMM) to solve the proposed formulation in a federated server-users setting. Numerical experiments corroborate our findings. As an interesting byproduct, our results provide several generalizations to convex clustering.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D., & Kar, S. (2022). Personalized Federated Learning via Convex Clustering. In IEEE International Smart Cities Conference (ISC2), Pafos, Cyprus, pp. 1-7, doi: 10.1109/ISC255366.2022.9921863. https://ieeexplore.ieee.org/abstract/document/9921863

Gradient Based Clustering

Published in Proceedings of the 39th International Conference on Machine Learning, PMLR, 2022

Abstract: We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality with respect to cluster assignments and cluster center positions. The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions, satisfying some mild assumptions. The main advantage of the proposed approach is a simple and computationally cheap update rule. Unlike previous methods that specialize to a specific formulation of the clustering problem, our approach is applicable to a wide range of costs, including non-Bregman clustering methods based on the Huber loss. We analyze the convergence of the proposed algorithm, and show that it converges to the set of appropriately defined fixed points, under arbitrary center initialization. In the special case of Bregman cost functions, the algorithm converges to the set of centroidal Voronoi partitions, which is consistent with prior works. Numerical experiments on real data demonstrate the effectiveness of the proposed method.

Recommended citation: Armacki, A., Bajović, D., Jakovetić, D., & Kar, S. (2022). Gradient Based Clustering. In proceedings of the 39th International Conference on Machine Learning, PMLR, 162:929-947. https://proceedings.mlr.press/v162/armacki22a.html

Distributed Trust-region Method with First Order Models

Published in IEEE EUROCON 18th International Conference on Smart Technologies, 2019

Abstract: In this paper, we introduce the trust region concept for distributed optimization. A large class of globally convergent methods of this type is used efficiently in centralized optimization, both constrained and unconstrained. The methods of this class are built on the idea of modeling the objective function at each iteration and taking the new iteration as the minimizer of the model in a certain area, called the trust region. The trust region size, the minimization method and the model function depend on the properties of the objective function. In this paper we propose a general framework and concentrate on the first order methods, i.e., the gradient methods. Using the trust-region mechanism for generating the step size we end up with a fully distributed method with node varying step sizes. Numerical results presented in the paper demonstrate the efficiency of the proposed approach.

Recommended citation: Armacki, A., Jakovetić, D., Krejić, N., & Jerinkić, N. K. (2019). Distributed Trust-Region Method With First Order Models. In IEEE EUROCON 18th International Conference on Smart Technologies, Novi Sad, Serbia, pp. 1-6, doi: 10.1109/EUROCON.2019.8861739. https://ieeexplore.ieee.org/abstract/document/8861739

Aleksandar Armacki

Publications (in chronological order, * = equal contribution)