AIBullisharXiv – CS AI · Feb 277/103
🧠Researchers introduce α-GFNs, an enhanced version of Generative Flow Networks that allows tunable control over exploration-exploitation dynamics through a parameter α. The method achieves up to 10× improvement in mode discovery across various benchmarks by addressing constraints in traditional GFlowNet objectives through Markov chain theory.
$LINK
AINeutralarXiv – CS AI · Jun 116/10
🧠TreeSeeker is a new inference-time framework that improves deep web search by using tree-structured trial-and-error navigation. The system balances exploration and exploitation through textual UCB signals, demonstrating consistent improvements over baseline models on multiple benchmarks.
AINeutralarXiv – CS AI · Jun 86/10
🧠Researchers evaluated current large language models' effectiveness at solving exploration-exploitation tradeoffs in decision-making tasks. The study found that while reasoning models show promise for exploitation tasks, they remain impractical due to cost and speed constraints, and all tested LLMs underperform simple linear regression—though LLMs do excel at exploring large action spaces with semantic structure.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers introduce ReMax Actor-Critic (ReMAC), extending retry-based policy gradient methods from discrete to continuous action spaces. The approach uses pathwise derivative estimators to optimize pass@K and max@K objectives, promoting exploration through policy-gradient landscape reshaping rather than explicit entropy bonuses, achieving performance comparable to SAC.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers propose an improved Nash Learning from Human Feedback (NLHF) algorithm that addresses exploration challenges in preference alignment for large language models. The new method achieves better regret bounds without exponential dependence on regularization parameters and demonstrates empirical improvements when fine-tuning Llama-3-8B.
🧠 Llama
AINeutralarXiv – CS AI · Jun 25/10
🧠Researchers compare dynamic entropy tuning in stochastic reinforcement learning policies versus deterministic policies for quadcopter control, finding that dynamic entropy adjustment in the Soft Actor-Critic algorithm prevents catastrophic forgetting and improves exploration efficiency compared to static entropy or purely deterministic approaches using TD3.
AINeutralarXiv – CS AI · Jun 16/10
🧠Researchers propose an uncertainty-aware reinforcement learning framework for autonomous driving that uses expert guidance to enable safer exploration while avoiding over-dependence on advice. The method combines epistemic and aleatoric uncertainty thresholds with a regulated commitment-cooldown strategy, demonstrating 5-7% improvements in success rates and reduced failures in CARLA simulations for unsignalized intersection navigation.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce EAPO, an exploration-aware reinforcement learning framework that enables LLM agents to selectively explore uncertain scenarios before acting. The method uses fine-grained reward functions and adaptive exploration mechanisms to improve decision-making across text and GUI-based agent benchmarks.
🏢 Hugging Face
AINeutralarXiv – CS AI · May 46/10
🧠Researchers compared how large language models, humans, and algorithms approach the exploration-exploitation tradeoff in multi-armed bandit decision-making tasks. The study finds that enabling thinking processes in LLMs makes them behave more like humans in simple environments, but LLMs fail to match human adaptability in complex, non-stationary settings despite similar regret outcomes.
AIBullisharXiv – CS AI · Apr 206/10
🧠Researchers propose Adaptive Entropy Regularization (AER), a dynamic framework that addresses policy entropy collapse in LLM reinforcement learning by adjusting exploration intensity based on task difficulty. The method improves upon fixed entropy regularization approaches, demonstrating consistent gains in mathematical reasoning benchmarks while maintaining balanced exploration-exploitation tradeoffs.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers propose Policy Split, a novel reinforcement learning approach for LLMs that uses dual-mode entropy regularization to balance exploration with task accuracy. By bifurcating policy into normal and high-entropy modes, the method enables diverse behavioral patterns while maintaining performance, showing improvements over existing entropy-guided RL baselines.