FedVPA-GP: Federated Variational Preference Alignment with Gumbel-Softmax Prior

Federated Variational Preference Alignment with
Gumbel-Softmax Prior for Personalized User Preferences

¹Department of CSE, POSTECH ²National AI Research Lab ³Graduate School of AI, POSTECH

^†Corresponding author

Abstract

Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. We propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework that disentangles diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that lets clients leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on HH-RLHF demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.

Federated Learning Preference Alignment RLHF Variational Inference LLMs

Method

FedVPA-GP treats personalization as distributed latent-variable inference. Each client encodes a difference embedding $\Delta h = h_{\text{chosen}} - h_{\text{rejected}}$ into a local posterior $q_\phi(z \mid \mathcal{D}_i) = \mathcal{N}(z;\mu_i,\sigma_i^2 I)$, and conditions a frozen base LLM's choice logits on the sampled latent $z$:

$$\text{logits}(s_A, s_B \mid z_i) = \text{logits}_{\text{base}}(s_A, s_B) + f_\theta(z_i)$$

Federated Mixture Prior. Instead of a fixed $\mathcal{N}(0,I)$ prior, each client uses a weighted mixture of peer posteriors as a dynamic prior, $p^{(i)}_{\text{mixture}}(z) = \sum_{j \in \mathcal{S}} w_j\,\mathcal{N}_j(z)$, transferring population-level knowledge without sharing raw data. The weights $w_j$ are learned per client via Gumbel-Softmax relaxation, so each client up-weights compatible peers and filters conflicting ones.

Orthogonal Loss. $M$ QR-initialized orthonormal prototypes structure the latent space. The server assigns each client a prototype via balanced $k$-means, and clients pull their $z$ toward the assigned prototype while enforcing orthonormality:

$$\mathcal{L}_{\text{orthogonal}}(z) = \lVert z - \mathbf{p}_{y_i^*}\rVert_2^2 + \gamma\,\lVert \mathbf{P}\mathbf{P}^\top - \mathbf{I}_M\rVert_F^2$$

Two-stage training. Stage 1 federally trains the $z$-conditioned binary selector (base LLM frozen, only ~0.9M extra parameters). Stage 2 runs centralized DPO on the server, using the frozen selector as a $z$-conditioned reward model — avoiding the communication cost of federated generation.

Latent-Space Disentanglement

FedVPL — posterior collapse: latent codes cluster indistinguishably.

FedVPA-GP (ours) — distinct modes for different client groups.

Evolution of client-specific latent distributions $z$ across rounds for FedVPL (top) and FedVPA-GP (bottom). Red = harmlessness, blue = helpfulness; stars = orthogonal prototypes. FedVPA-GP progressively separates the two preference types.

Main Results

GPT-4o win-rate (%) on HH-RLHF across client counts $N \in \{10, 50, 100\}$. FedVPA-GP achieves a Pareto improvement — higher win-rates on both helpfulness and harmlessness — where monolithic baselines trade one off against the other.

Model	Method	10 Clients		50 Clients		100 Clients
Model	Method	Helpful	Harmless	Helpful	Harmless	Helpful	Harmless
Qwen-2 0.5B	FedDPO	48.12	77.34	43.05	69.22	41.48	67.15
	FedBiscuit	48.85	75.12	44.21	71.45	42.33	69.42
	FedVPL	62.24	84.56	54.18	78.12	53.05	77.34
	FedVPA-GP	66.45	89.21	58.32	84.05	55.18	82.31
Gemma-2B	FedDPO	52.34	83.12	44.15	78.45	41.22	75.33
	FedBiscuit	51.65	82.45	46.21	78.12	43.44	76.05
	FedVPL	66.82	89.15	56.41	84.34	53.25	80.42
	FedVPA-GP	73.21	96.34	64.48	95.12	60.15	92.45

Overhead: ~0.18% extra parameters, 256 bytes/client/round, and ~1.18× the per-round wall-clock of FedDPO.

BibTeX

@inproceedings{koo2026fedvpagp, title = {Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences}, author = {Koo, Jabin and Kim, Hoyoung and Jang, Minwoo and Ok, Jungseul}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)}, year = {2026}, eprint = {2605.30873}, archivePrefix = {arXiv}, primaryClass = {cs.LG} }