y0news
← Feed
Back to feed
🤖 AI × Crypto NeutralImportance 7/10

Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

arXiv – CS AI|Zhuoran Pan, Yue Li, Zhi Guan, Jianbin Hu, Zhong Chen|
🤖AI Summary

Researchers introduce Intent2Tx, a benchmark dataset of nearly 32,000 real-world Ethereum transactions designed to evaluate how well large language models can translate natural language instructions into executable blockchain transactions. Testing 16 state-of-the-art LLMs reveals a critical gap: while models generate syntactically valid code, they frequently fail to achieve intended on-chain state transitions, exposing fundamental limitations in current AI's ability to reliably bridge user intent and blockchain execution.

Analysis

Intent2Tx addresses a fundamental problem in Web3 infrastructure: the inability of current language models to reliably convert user intentions into correct blockchain transactions. The benchmark's significance lies not in synthetic test cases but in real-world Ethereum mainnet data spanning 300 days, capturing authentic protocol interactions across 11 DeFi categories including long-tail primitives. This grounding in reality provides substantially more value than previous synthetic benchmarks that fail to capture the state-dependent complexity of on-chain execution.

The research reveals a troubling disconnect between syntactic correctness and functional correctness. Models may produce code that parses and deploys without errors yet fails to execute the user's actual intent—a distinction that matters enormously when financial transactions are at stake. This execution-aware evaluation methodology using differential state analysis on forked networks represents a meaningful advance in blockchain AI benchmarking, moving beyond simple text matching to verify actual transaction outcomes.

For the Web3 ecosystem, these findings highlight why autonomous agents cannot yet be trusted with unsupervised transaction generation. The struggle with out-of-distribution generalization and multi-step planning suggests current models lack the reasoning depth needed for complex DeFi sequences. However, the benchmark itself serves as critical infrastructure for future development—providing the training data and evaluation framework necessary for building genuinely reliable AI agents. Developers and researchers now have a standardized way to measure progress toward trustworthy intent-to-execution systems.

Key Takeaways
  • Intent2Tx contains 31,496 real-world Ethereum transactions derived from actual mainnet activity, providing far more realistic evaluation data than synthetic benchmarks.
  • State-of-the-art LLMs pass syntactic validation yet frequently fail to execute intended state transitions, exposing a critical reasoning-to-execution gap.
  • Execution-aware evaluation using forked mainnet environments reveals that syntactically valid code does not guarantee functional correctness for blockchain transactions.
  • Current models struggle significantly with out-of-distribution generalization and multi-step transaction planning across complex DeFi protocols.
  • The benchmark establishes a foundation for developing trustworthy autonomous Web3 agents by providing standardized evaluation methodology and real-world training data.
Mentioned Tokens
$ETH$2,284+1.1%
Let AI manage these →
Non-custodial · Your keys, always
Read Original →via arXiv – CS AI
Act on this with AI
This article mentions $ETH.
Let your AI agent check your portfolio, get quotes, and propose trades — you review and approve from your device.
Connect Wallet to AI →How it works
Related Articles