ICML 2026
Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. We propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework that disentangles diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that lets clients leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on HH-RLHF demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.
Integrates variational inference into FL to capture conflicting user preferences, overcoming the sub-optimality of monolithic reward models while preserving data privacy.
A Federated Mixture Prior (with learnable Gumbel-Softmax weights) plus an Orthogonal Loss combat data sparsity and posterior collapse, separating distinct preference prototypes.
On HH-RLHF, FedVPA-GP beats FedBiscuit and FedDPO across client scales, generalizes to unseen clients, and adds negligible compute/communication overhead.
FedVPA-GP treats personalization as distributed latent-variable inference. Each client encodes a difference embedding $\Delta h = h_{\text{chosen}} - h_{\text{rejected}}$ into a local posterior $q_\phi(z \mid \mathcal{D}_i) = \mathcal{N}(z;\mu_i,\sigma_i^2 I)$, and conditions a frozen base LLM's choice logits on the sampled latent $z$:
Federated Mixture Prior. Instead of a fixed $\mathcal{N}(0,I)$ prior, each client uses a weighted mixture of peer posteriors as a dynamic prior, $p^{(i)}_{\text{mixture}}(z) = \sum_{j \in \mathcal{S}} w_j\,\mathcal{N}_j(z)$, transferring population-level knowledge without sharing raw data. The weights $w_j$ are learned per client via Gumbel-Softmax relaxation, so each client up-weights compatible peers and filters conflicting ones.
Orthogonal Loss. $M$ QR-initialized orthonormal prototypes structure the latent space. The server assigns each client a prototype via balanced $k$-means, and clients pull their $z$ toward the assigned prototype while enforcing orthonormality:
Two-stage training. Stage 1 federally trains the $z$-conditioned binary selector (base LLM frozen, only ~0.9M extra parameters). Stage 2 runs centralized DPO on the server, using the frozen selector as a $z$-conditioned reward model — avoiding the communication cost of federated generation.
FedVPL — posterior collapse: latent codes cluster indistinguishably.
FedVPA-GP (ours) — distinct modes for different client groups.
GPT-4o win-rate (%) on HH-RLHF across client counts $N \in \{10, 50, 100\}$. FedVPA-GP achieves a Pareto improvement — higher win-rates on both helpfulness and harmlessness — where monolithic baselines trade one off against the other.
| Model | Method | 10 Clients | 50 Clients | 100 Clients | |||
|---|---|---|---|---|---|---|---|
| Helpful | Harmless | Helpful | Harmless | Helpful | Harmless | ||
| Qwen-2 0.5B | FedDPO | 48.12 | 77.34 | 43.05 | 69.22 | 41.48 | 67.15 |
| FedBiscuit | 48.85 | 75.12 | 44.21 | 71.45 | 42.33 | 69.42 | |
| FedVPL | 62.24 | 84.56 | 54.18 | 78.12 | 53.05 | 77.34 | |
| FedVPA-GP | 66.45 | 89.21 | 58.32 | 84.05 | 55.18 | 82.31 | |
| Gemma-2B | FedDPO | 52.34 | 83.12 | 44.15 | 78.45 | 41.22 | 75.33 |
| FedBiscuit | 51.65 | 82.45 | 46.21 | 78.12 | 43.44 | 76.05 | |
| FedVPL | 66.82 | 89.15 | 56.41 | 84.34 | 53.25 | 80.42 | |
| FedVPA-GP | 73.21 | 96.34 | 64.48 | 95.12 | 60.15 | 92.45 | |
| Method | Seen | Unseen | ||
|---|---|---|---|---|
| Helpful | Harmless | Helpful | Harmless | |
| FedDPO | 46.35 | 78.62 | 47.27 | 79.15 |
| FedBiscuit | 47.32 | 79.25 | 47.62 | 78.42 |
| FedVPL | 56.23 | 83.82 | 49.25 | 75.21 |
| FedVPA-GP | 65.28 | 94.25 | 63.16 | 91.23 |
(a) Qwen-2 0.5B
(b) Gemma-2B
Both the Orthogonal Loss and the Federated Mixture Prior contribute; the full FedVPA-GP gives the best helpfulness/harmlessness trade-off.
Overhead: ~0.18% extra parameters, 256 bytes/client/round, and ~1.18× the per-round wall-clock of FedDPO.
@inproceedings{koo2026fedvpagp,
title = {Federated Variational Preference Alignment with Gumbel-Softmax
Prior for Personalized User Preferences},
author = {Koo, Jabin and Kim, Hoyoung and Jang, Minwoo and Ok, Jungseul},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
eprint = {2605.30873},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}