AIBullisharXiv – CS AI · 7h ago6/10
🧠
T-POP: Test-Time Personalization with Online Preference Feedback
Researchers introduce T-POP, a novel algorithm that personalizes large language models in real-time by learning from user preference feedback during text generation, without requiring parameter updates or extensive pre-existing user data. The method combines test-time alignment with dueling bandits to efficiently balance exploration and exploitation, addressing the cold-start problem in LLM personalization.