#testing News & Analysis

19 articles tagged with #testing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles

AINeutralarXiv – CS AI · Jun 197/10

🧠

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

Amazon researchers introduced StaminaBench, a benchmark that evaluates coding agents' ability to handle extended multi-turn interactions (up to 100 consecutive change requests), revealing that current LLMs fail within 5-6 turns and that test feedback can improve performance up to 12x.

AIBearisharXiv – CS AI · May 297/10

🧠

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

A physicist supervised Claude AI models over 12 days to build CLAX-PT, a physics simulation module, documenting how AI agents struggle with architectural redesign and distinguishing symptom-fixes from root-cause solutions. The study reveals that supervision design and human domain expertise, rather than model capability alone, determine whether AI-generated scientific code produces trustworthy results.

🧠 Claude

AINeutralarXiv – CS AI · Mar 57/10

🧠

SaFeR: Safety-Critical Scenario Generation for Autonomous Driving Test via Feasibility-Constrained Token Resampling

Researchers propose SaFeR, a new AI system for generating safety-critical scenarios to test autonomous driving systems. The approach uses transformer-based models with a novel resampling strategy to balance adversarial testing, physical feasibility, and realistic behavior in autonomous vehicle simulations.

AI × CryptoBullishWu Blockchain · Feb 207/103

🤖

OpenAI Releases Smart Contract Benchmark Test: What Does It Mean?

OpenAI has released a benchmark test specifically designed to evaluate smart contract capabilities of AI systems. The test is positioned as a comprehensive evaluation tool for AI agents operating in blockchain environments, suggesting increased focus on AI-blockchain integration.

CryptoBullishEthereum Foundation Blog · Mar 147/102

⛓️

Announcing the Kiln Merge Testnet

The Kintsugi merge testnet launched in December has successfully tested Ethereum's transition to proof-of-stake through various test suites and multi-client implementations. The testing phase has resulted in stable protocol specifications, with clients now having implemented the necessary changes for The Merge.

AINeutralTechCrunch – AI · Jun 246/10

🧠

Facebook rolls out an AI companion app for creators

Facebook is launching an AI companion app for creators that integrates its recently-developed AI creator assistant. Currently in limited testing with select creators, the app represents Meta's push to embed AI tools directly into creator-focused products.

CryptoBullishU.Today · Apr 116/10

⛓️

Ethereum Devs Signal Glamsterdam Devnet Launch Next Week as Upgrade Progresses

Ethereum developers are planning to launch the first generalized Glamsterdam devnet next week, marking progress on a significant protocol upgrade. This milestone demonstrates continued momentum in Ethereum's development roadmap and brings the community closer to testing new network capabilities.

$ETH

AIBullisharXiv – CS AI · Mar 266/10

🧠

LLMLOOP: Improving LLM-Generated Code and Tests through Automated Iterative Feedback Loops

Researchers have developed LLMLOOP, a framework that automatically refines LLM-generated code and test cases through five iterative loops addressing compilation errors, static analysis issues, test failures, and quality improvements. The tool was evaluated on HUMANEVAL-X benchmark and demonstrated effectiveness in improving the quality of AI-generated code outputs.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Test-Driven AI Agent Definition (TDAD): Compiling Tool-Using Agents from Behavioral Specifications

Researchers introduce Test-Driven AI Agent Definition (TDAD), a methodology that compiles AI agent prompts from behavioral specifications using automated testing. The approach addresses production deployment challenges by ensuring measurable behavioral compliance and preventing silent regressions in tool-using LLM agents.

AIBullisharXiv – CS AI · Mar 37/107

🧠

MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning

Researchers propose MIST-RL, a reinforcement learning framework that improves AI code generation by creating more efficient test suites. The method achieves 28.5% higher fault detection while using 19.3% fewer test cases, demonstrating significant improvements in AI code verification efficiency.

AINeutralarXiv – CS AI · Mar 36/103

🧠

OBsmith: LLM-Powered JavaScript Obfuscator Testing

Researchers introduce OBsmith, an LLM-powered framework that tests JavaScript obfuscators for correctness bugs that can silently alter program functionality. The tool discovered 11 previously unknown bugs that existing JavaScript fuzzers failed to detect, highlighting critical gaps in obfuscation quality assurance.

AI × CryptoBullishCoinTelegraph – AI · Feb 276/106

🤖

Pantera, Franklin Templeton join Sentient Arena to test AI agents

Sentient has launched Arena, a production-style platform designed to test AI agents on enterprise tasks. Major financial firms Pantera and Franklin Templeton have joined the initial cohort to participate in testing these AI agents.

AIBullishGoogle DeepMind Blog · Dec 96/106

🧠

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

The FACTS Benchmark Suite has been introduced as a systematic evaluation framework for assessing the factual accuracy of large language models. This standardized testing methodology aims to provide reliable metrics for measuring how well AI models adhere to factual information across various domains.

CryptoBullishEthereum Foundation Blog · Mar 236/102

⛓️

Finalized no. 34

Kiln testnet is now operational as part of Ethereum's merge testing initiative. The #TestingTheMerge campaign is actively encouraging community participation in testing the transition to proof-of-stake.

AINeutralOpenAI News · Dec 35/106

🧠

Procgen Benchmark

OpenAI has released Procgen Benchmark, a collection of 16 procedurally-generated environments designed to test reinforcement learning agents' ability to develop generalizable skills. The benchmark provides a standardized way to measure how quickly AI agents can learn and adapt to new scenarios.

CryptoNeutralEthereum Foundation Blog · Sep 165/102

⛓️

Ethereum Wallet - Developer Preview

Ethereum announces the first developer preview of their Ethereum Wallet ÐApp, seeking community feedback and code auditing. This is an early preview release focused on testing and improvement rather than production use.

$ETH

AIBullishSimon Willison Blog · Jun 235/10

🧠

OPFS + Pyodide test harness

The article discusses OPFS (Origin Private File System) combined with Pyodide, a Python runtime for WebAssembly, as a test harness for web-based development. This technical integration enables developers to run Python code directly in browsers with persistent local file storage, improving development workflows and testing capabilities.

AINeutralarXiv – CS AI · Mar 54/10

🧠

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

SpotIt+ is a new open-source tool that evaluates Text-to-SQL systems through verification-based testing, actively searching for database instances that reveal differences between generated and ground truth SQL queries. The tool incorporates constraint-mining that combines rule-based specification mining with LLM validation to generate more realistic test scenarios.

CryptoNeutralEthereum Foundation Blog · Apr 24/103

⛓️

Finalized no. 25

This appears to be a brief technical update or newsletter issue (#25) related to Ethereum development, mentioning Rayonism, the Merge, BLST security advisory, and Beacon Chain security testing. The content is fragmented and lacks specific details about the developments mentioned.