Appeared at:  European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2022Authors: Sambhav Solanki, Sujit Gujar, Sankarshan Damle, Samhita Kanaparthy

In multi-agent online learning settings, federated learning (FL) is a valuable tool. However, the learning agents can be competitive, and privacy concerns can pose a barrier to engagement in FL. Each agent would like to learn from others, but the part of the information it shares for others to learn from could be sensitive; thus, it desires its privacy.

This work investigates a group of agents working concurrently to solve similar combinatorial bandit problems while maintaining quality constraints. We propose that by employing differential privacy, these agents can collectively learn while keeping their sensitive information confidential. However, differential privacy techniques for protecting sensitive information makes the data noisy and may deteriorate rather than help improve regret. Hence, our work notes that deciding when to communicate and what shared data to learn is essential to strike a functional balance between regret and privacy. For such a federated combinatorial MAB setting, our work proposes a Privacy-preserving Federated Combinatorial Bandit algorithm, P-FCB.

Say an agent wants to procure and utilize a component manufactured by several producers but can only learn the component’s quality (dependent on the producer) upon procurement. This learning is similar to the exploration and exploitation problem, popularly known as Multi-armed Bandit (MAB). It needs sequential interactions between sets of producers and the learning agent. We propose that the learning agents communicate their history of procurements with each other to better estimate the qualities helping them maximize their revenues.

P-FCB comprises a novel communication algorithm among agents, while each agent is learning the qualities of the producers to cooperate in the learning process. Crucially in P-FCB, the agent only communicates within a specific time frame. It is not beneficial to communicate in (i) very early rounds (estimates have high error probability) or (ii) very late rounds (value added by communicating is minimal). Additionally, P-FCB minimizes privacy loss by limiting the number of rounds agents communicate. Further, to ensure the privacy of the shared information, the agents add calibrated noise to sanitize the information a priori. P-FCB also uses error bounds generated for UCB exploration (a commonly used learning technique in bandit literature) to determine if shared information is worth learning.

Our work shows the efficacy of P-FCB through simulations. Further, the algorithm provides an improvement in terms of regret while upholding quality threshold and meaningful privacy guarantees.