AIBullisharXiv – CS AI · 4d ago7/10
🧠Researchers introduce VULPO, an on-policy LLM optimization framework for vulnerability detection that achieves 203% improvement over baseline models by incorporating context-aware reasoning and multidimensional reward signals. The approach combines a new ContextVul dataset with specialized fine-tuning to create more effective security analysis tools that reason through complex code interactions.
AIBearisharXiv – CS AI · May 117/10
🧠A comprehensive survey of 87 machine learning vulnerability detection studies reveals that the field has stalled despite a decade of research, trapped in self-reinforcing feedback loops that optimize for narrow, artificial problems. Researchers identify twelve interconnected pain points spanning datasets, formulations, metrics, and evaluation approaches that perpetuate focus on binary C/C++ function-level classification while neglecting vulnerability type prediction, multilingual support, and broader detection granularities.
AIBearisharXiv – CS AI · Apr 107/10
🧠Researchers evaluated Cursor, an AI-powered IDE, on its ability to generate large-scale software projects and found it achieves 91% functional correctness but produces significant design issues including code duplication, complexity violations, and framework best-practice breaches that threaten long-term maintainability.
AI × CryptoBearishDL News · Mar 267/10
🤖Cybercriminals are leveraging AI language models like ChatGPT and Claude to rapidly scan thousands of lines of code per second, identifying vulnerabilities in legacy systems. This represents a significant escalation in automated hacking capabilities, potentially exposing millions of dollars worth of cryptocurrency assets to sophisticated AI-powered attacks.
🧠 ChatGPT🧠 Claude
AIBearisharXiv – CS AI · Mar 127/10
🧠Researchers have developed a risk assessment framework for open-source Model Context Protocol (MCP) servers, revealing significant security vulnerabilities through static code analysis. The study found many MCP servers contain exploitable weaknesses that compromise confidentiality, integrity, and availability, highlighting the need for secure-by-design development as these tools become widely adopted for LLM agents.
AIBullisharXiv – CS AI · May 126/10
🧠Researchers introduce VulTriage, an LLM-based framework that enhances vulnerability detection in source code through triple-path context augmentation combining control flow analysis, vulnerability knowledge retrieval, and semantic summarization. The approach achieves state-of-the-art results on benchmark datasets and demonstrates strong generalization to low-resource scenarios.
AINeutralarXiv – CS AI · Apr 76/10
🧠Research study reveals that when Claude Opus 4.6 deobfuscates JavaScript code, poisoned identifier names from the original string table consistently survive in the reconstructed code, even when the AI demonstrates correct understanding of the code's semantics. Changing the task framing from 'deobfuscate' to 'write fresh implementation' significantly reduced this persistence while maintaining algorithmic accuracy.
🧠 Claude🧠 Haiku🧠 Opus
AIBullisharXiv – CS AI · Mar 276/10
🧠Researchers introduce TRAJEVAL, a diagnostic framework that breaks down AI code agent performance into three stages (search, read, edit) to identify specific failure points rather than just binary pass/fail outcomes. The framework analyzed 16,758 trajectories and found that real-time feedback based on trajectory signals improved state-of-the-art models by 2.2-4.6 percentage points while reducing costs by 20-31%.
🧠 GPT-5
AIBullishThe Register – AI · Mar 267/10
🧠Linux kernel czar Linus Torvalds reports that AI-generated bug reports have dramatically improved in quality, transforming from mostly useless submissions to legitimate and valuable contributions overnight. This represents a significant milestone in AI's ability to assist with complex software development and code analysis tasks.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers conducted the first comprehensive evaluation of parameter-efficient fine-tuning (PEFT) for multi-task code analysis, showing that a single PEFT module can match full fine-tuning performance while reducing computational costs by up to 85%. The study found that even 1B-parameter models with multi-task PEFT outperform large general-purpose LLMs like DeepSeek and CodeLlama on code analysis tasks.
AIBullishOpenAI News · Mar 65/10
🧠Codex Security, an AI-powered application security agent, has launched in research preview to help developers detect, validate, and patch complex vulnerabilities. The tool analyzes project context to provide more accurate security assessments with reduced false positives.
AINeutralarXiv – CS AI · Mar 36/107
🧠Researchers introduce Theory of Code Space (ToCS), a new benchmark that evaluates AI agents' ability to understand software architecture across multi-file codebases. The study reveals significant performance gaps between frontier LLM agents and rule-based baselines, with F1 scores ranging from 0.129 to 0.646.
AIBullisharXiv – CS AI · Mar 37/108
🧠Researchers introduce FastCode, a new framework for AI-assisted software engineering that improves code understanding and reasoning efficiency. The system uses structural scouting to navigate codebases without full-text ingestion, significantly reducing computational costs while maintaining accuracy across multiple benchmarks.
AIBullisharXiv – CS AI · Mar 36/105
🧠Researchers introduce 'semi-formal reasoning' for LLM agents to analyze code semantics without execution, showing significant accuracy improvements across multiple tasks. The methodology achieves 88-93% accuracy on patch verification and 87% on code question answering, potentially enabling practical applications in automated code review and static analysis.
AINeutralarXiv – CS AI · Mar 36/103
🧠Researchers introduce OBsmith, an LLM-powered framework that tests JavaScript obfuscators for correctness bugs that can silently alter program functionality. The tool discovered 11 previously unknown bugs that existing JavaScript fuzzers failed to detect, highlighting critical gaps in obfuscation quality assurance.
AINeutralarXiv – CS AI · Apr 205/10
🧠Researchers demonstrate that Chain-of-Thought prompting significantly improves large language models' ability to deobfuscate control flow code, with GPT-5 achieving 16-20% performance gains over zero-shot prompting. The approach offers a potential alternative to expensive manual reverse engineering, though practical deployment remains limited to research benchmarks.
🧠 GPT-5
AINeutralarXiv – CS AI · Apr 75/10
🧠Researchers developed TRACE, a framework to evaluate how LLMs allocate trust between conflicting software artifacts like code, documentation, and tests. The study found that current LLMs are better at identifying natural-language specification issues than detecting subtle code-level problems, with models showing systematic blind spots when implementations drift while documentation remains plausible.