y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#exploration-exploitation News & Analysis

11 articles tagged with #exploration-exploitation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles
AIBullisharXiv – CS AI · Feb 277/103
🧠

Controlling Exploration-Exploitation in GFlowNets via Markov Chain Perspectives

Researchers introduce α-GFNs, an enhanced version of Generative Flow Networks that allows tunable control over exploration-exploitation dynamics through a parameter α. The method achieves up to 10× improvement in mode discovery across various benchmarks by addressing constraints in traditional GFlowNet objectives through Markov chain theory.

$LINK
AINeutralarXiv – CS AI · Jun 116/10
🧠

TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

TreeSeeker is a new inference-time framework that improves deep web search by using tree-structured trial-and-error navigation. The system balances exploration and exploitation through textual UCB signals, demonstrating consistent improvements over baseline models on multiple benchmarks.

AINeutralarXiv – CS AI · Jun 86/10
🧠

Should You Use Your Large Language Model to Explore or Exploit?

Researchers evaluated current large language models' effectiveness at solving exploration-exploitation tradeoffs in decision-making tasks. The study found that while reasoning models show promise for exploitation tasks, they remain impractical due to cost and speed constraints, and all tested LLMs underperform simple linear regression—though LLMs do excel at exploring large action spaces with semantic structure.

AINeutralarXiv – CS AI · Jun 56/10
🧠

Retry Policy Gradients in Continuous Action Spaces

Researchers introduce ReMax Actor-Critic (ReMAC), extending retry-based policy gradient methods from discrete to continuous action spaces. The approach uses pathwise derivative estimators to optimize pass@K and max@K objectives, promoting exploration through policy-gradient landscape reshaping rather than explicit entropy bonuses, achieving performance comparable to SAC.

AINeutralarXiv – CS AI · Jun 26/10
🧠

Efficient Exploration for Iterative Nash Preference Optimization

Researchers propose an improved Nash Learning from Human Feedback (NLHF) algorithm that addresses exploration challenges in preference alignment for large language models. The new method achieves better regret bounds without exponential dependence on regularization parameters and demonstrates empirical improvements when fine-tuning Llama-3-8B.

🧠 Llama
AINeutralarXiv – CS AI · Jun 25/10
🧠

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

Researchers compare dynamic entropy tuning in stochastic reinforcement learning policies versus deterministic policies for quadcopter control, finding that dynamic entropy adjustment in the Soft Actor-Critic algorithm prevents catastrophic forgetting and improves exploration efficiency compared to static entropy or purely deterministic approaches using TD3.

AINeutralarXiv – CS AI · Jun 16/10
🧠

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

Researchers propose an uncertainty-aware reinforcement learning framework for autonomous driving that uses expert guidance to enable safer exploration while avoiding over-dependence on advice. The method combines epistemic and aleatoric uncertainty thresholds with a regulated commitment-cooldown strategy, demonstrating 5-7% improvements in success rates and reduced failures in CARLA simulations for unsignalized intersection navigation.

AIBullisharXiv – CS AI · May 126/10
🧠

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Researchers introduce EAPO, an exploration-aware reinforcement learning framework that enables LLM agents to selectively explore uncertain scenarios before acting. The method uses fine-grained reward functions and adaptive exploration mechanisms to improve decision-making across text and GUI-based agent benchmarks.

🏢 Hugging Face
AINeutralarXiv – CS AI · May 46/10
🧠

Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Experiments

Researchers compared how large language models, humans, and algorithms approach the exploration-exploitation tradeoff in multi-armed bandit decision-making tasks. The study finds that enabling thinking processes in LLMs makes them behave more like humans in simple environments, but LLMs fail to match human adaptability in complex, non-stationary settings despite similar regret outcomes.

AIBullisharXiv – CS AI · Apr 206/10
🧠

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Researchers propose Adaptive Entropy Regularization (AER), a dynamic framework that addresses policy entropy collapse in LLM reinforcement learning by adjusting exploration intensity based on task difficulty. The method improves upon fixed entropy regularization approaches, demonstrating consistent gains in mathematical reasoning benchmarks while maintaining balanced exploration-exploitation tradeoffs.

AINeutralarXiv – CS AI · Apr 146/10
🧠

Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

Researchers propose Policy Split, a novel reinforcement learning approach for LLMs that uses dual-mode entropy regularization to balance exploration with task accuracy. By bifurcating policy into normal and high-entropy modes, the method enables diverse behavioral patterns while maintaining performance, showing improvements over existing entropy-guided RL baselines.