ICML 2026

Federated Variational Preference Alignment with
Gumbel-Softmax Prior for Personalized User Preferences

1Department of CSE, POSTECH   2National AI Research Lab   3Graduate School of AI, POSTECH
FedVPA-GP framework overview
Overview of FedVPA-GP. (a) Federated training of the variational binary selector; (b) the local variational objective; (c) the preference-alignment stage using the trained selector.

Abstract

Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. We propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework that disentangles diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that lets clients leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on HH-RLHF demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.

Federated Learning Preference Alignment RLHF Variational Inference LLMs

Key Contributions

Federated Variational Alignment

Integrates variational inference into FL to capture conflicting user preferences, overcoming the sub-optimality of monolithic reward models while preserving data privacy.

Stable Inference & Disentanglement

A Federated Mixture Prior (with learnable Gumbel-Softmax weights) plus an Orthogonal Loss combat data sparsity and posterior collapse, separating distinct preference prototypes.

Empirical Validation

On HH-RLHF, FedVPA-GP beats FedBiscuit and FedDPO across client scales, generalizes to unseen clients, and adds negligible compute/communication overhead.

Method

FedVPA-GP treats personalization as distributed latent-variable inference. Each client encodes a difference embedding $\Delta h = h_{\text{chosen}} - h_{\text{rejected}}$ into a local posterior $q_\phi(z \mid \mathcal{D}_i) = \mathcal{N}(z;\mu_i,\sigma_i^2 I)$, and conditions a frozen base LLM's choice logits on the sampled latent $z$:

$$\text{logits}(s_A, s_B \mid z_i) = \text{logits}_{\text{base}}(s_A, s_B) + f_\theta(z_i)$$

Federated Mixture Prior. Instead of a fixed $\mathcal{N}(0,I)$ prior, each client uses a weighted mixture of peer posteriors as a dynamic prior, $p^{(i)}_{\text{mixture}}(z) = \sum_{j \in \mathcal{S}} w_j\,\mathcal{N}_j(z)$, transferring population-level knowledge without sharing raw data. The weights $w_j$ are learned per client via Gumbel-Softmax relaxation, so each client up-weights compatible peers and filters conflicting ones.

Orthogonal Loss. $M$ QR-initialized orthonormal prototypes structure the latent space. The server assigns each client a prototype via balanced $k$-means, and clients pull their $z$ toward the assigned prototype while enforcing orthonormality:

$$\mathcal{L}_{\text{orthogonal}}(z) = \lVert z - \mathbf{p}_{y_i^*}\rVert_2^2 + \gamma\,\lVert \mathbf{P}\mathbf{P}^\top - \mathbf{I}_M\rVert_F^2$$

Two-stage training. Stage 1 federally trains the $z$-conditioned binary selector (base LLM frozen, only ~0.9M extra parameters). Stage 2 runs centralized DPO on the server, using the frozen selector as a $z$-conditioned reward model — avoiding the communication cost of federated generation.

Latent-Space Disentanglement

FedVPL latent space

FedVPL — posterior collapse: latent codes cluster indistinguishably.

FedVPA-GP latent space

FedVPA-GP (ours) — distinct modes for different client groups.

t-SNE evolution of latent z
Evolution of client-specific latent distributions $z$ across rounds for FedVPL (top) and FedVPA-GP (bottom). Red = harmlessness, blue = helpfulness; stars = orthogonal prototypes. FedVPA-GP progressively separates the two preference types.

Main Results

GPT-4o win-rate (%) on HH-RLHF across client counts $N \in \{10, 50, 100\}$. FedVPA-GP achieves a Pareto improvement — higher win-rates on both helpfulness and harmlessness — where monolithic baselines trade one off against the other.

Model Method 10 Clients 50 Clients 100 Clients
HelpfulHarmless HelpfulHarmless HelpfulHarmless
Qwen-2 0.5B FedDPO48.1277.3443.0569.2241.4867.15
FedBiscuit48.8575.1244.2171.4542.3369.42
FedVPL62.2484.5654.1878.1253.0577.34
FedVPA-GP66.4589.2158.3284.0555.1882.31
Gemma-2B FedDPO52.3483.1244.1578.4541.2275.33
FedBiscuit51.6582.4546.2178.1243.4476.05
FedVPL66.8289.1556.4184.3453.2580.42
FedVPA-GP73.2196.3464.4895.1260.1592.45

Generalization to Unseen Clients

MethodSeenUnseen
HelpfulHarmlessHelpfulHarmless
FedDPO46.3578.6247.2779.15
FedBiscuit47.3279.2547.6278.42
FedVPL56.2383.8249.2575.21
FedVPA-GP65.2894.2563.1691.23

Ablation Study

Ablation Qwen-2

(a) Qwen-2 0.5B

Ablation Gemma-2B

(b) Gemma-2B

Both the Orthogonal Loss and the Federated Mixture Prior contribute; the full FedVPA-GP gives the best helpfulness/harmlessness trade-off.

Overhead: ~0.18% extra parameters, 256 bytes/client/round, and ~1.18× the per-round wall-clock of FedDPO.

BibTeX

@inproceedings{koo2026fedvpagp,
  title     = {Federated Variational Preference Alignment with Gumbel-Softmax
               Prior for Personalized User Preferences},
  author    = {Koo, Jabin and Kim, Hoyoung and Jang, Minwoo and Ok, Jungseul},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
  eprint    = {2605.30873},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}