956 articles tagged with #llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv โ CS AI ยท Mar 36/107
๐ง Researchers introduce Theory of Code Space (ToCS), a new benchmark that evaluates AI agents' ability to understand software architecture across multi-file codebases. The study reveals significant performance gaps between frontier LLM agents and rule-based baselines, with F1 scores ranging from 0.129 to 0.646.
AIBullisharXiv โ CS AI ยท Mar 36/105
๐ง Researchers introduced GateLens, an LLM-based system that uses Relational Algebra as an intermediate layer to analyze complex tabular data more reliably than traditional approaches. The system demonstrated over 80% reduction in analysis time in automotive software analytics while maintaining high accuracy, outperforming existing Chain-of-Thought methods.
AIBullisharXiv โ CS AI ยท Mar 36/109
๐ง Researchers introduced Wild-Drive, a framework for autonomous off-road driving that combines scene captioning and path planning using multimodal AI. The system addresses challenges in harsh weather conditions through robust sensor fusion and efficient large language models, outperforming existing methods in degraded sensing conditions.
AIBullisharXiv โ CS AI ยท Mar 36/107
๐ง Researchers have developed ContextCov, a framework that converts passive natural language instructions for AI agents into active, executable guardrails to prevent code violations. The system addresses 'Context Drift' where AI agents deviate from project guidelines, creating automated compliance checks across static code analysis, runtime commands, and architectural validation.
$COMP
AINeutralarXiv โ CS AI ยท Mar 37/107
๐ง Researchers found that machine unlearning in large language models, which aims to remove specific training data influence, is less effective in interactive settings than previously thought. Knowledge that appears forgotten in static tests can often be recovered through multi-turn conversations and self-correction interactions.
AINeutralarXiv โ CS AI ยท Mar 36/107
๐ง Researchers propose a new gauge-theoretic framework for understanding superposition in large language models, replacing traditional single-dictionary approaches with local semantic charts. The method introduces three measurable obstructions to interpretability and demonstrates results on Llama 3.2 3B model with various datasets.
AIBullisharXiv โ CS AI ยท Mar 37/108
๐ง Researchers propose PARCER, a new framework that acts as an operational contract to address major governance challenges in Large Language Model systems. The framework uses structured YAML configurations to reduce variance, improve cost control, and enhance predictability in LLM operations through seven operational phases and decision hygiene practices.
AIBearisharXiv โ CS AI ยท Mar 36/106
๐ง Research reveals that leading foundation models (LLMs) perform poorly on real-world educational tasks despite excelling on AI benchmarks. The study found that 50% of misalignment errors are shared across models due to common pretraining approaches, with model ensembles actually worsening performance on learning outcomes.
AIBullisharXiv โ CS AI ยท Mar 36/109
๐ง Researchers introduced AWE, a memory-augmented multi-agent framework for autonomous web penetration testing that outperforms existing tools on injection vulnerabilities. AWE achieved 87% XSS success and 66.7% blind SQL injection success on benchmark tests, demonstrating superior accuracy and efficiency compared to general-purpose AI penetration testing tools.
AIBullisharXiv โ CS AI ยท Mar 37/108
๐ง Researchers introduce FastCode, a new framework for AI-assisted software engineering that improves code understanding and reasoning efficiency. The system uses structural scouting to navigate codebases without full-text ingestion, significantly reducing computational costs while maintaining accuracy across multiple benchmarks.
AIBullisharXiv โ CS AI ยท Mar 36/106
๐ง Researchers introduce One-Token Verification (OTV), a new method that estimates reasoning correctness in large language models during a single forward pass, reducing computational overhead. OTV reduces token usage by up to 90% through early termination while improving accuracy on mathematical reasoning tasks compared to existing verification methods.
AIBullisharXiv โ CS AI ยท Mar 36/107
๐ง Researchers have developed Thoth, the first family of Large Language Models specifically designed to understand and reason about time series data through a mid-training approach. The model uses a specialized corpus called Book-of-Thoth to bridge the gap between temporal data and natural language, significantly outperforming existing LLMs in time series analysis tasks.
AINeutralarXiv โ CS AI ยท Mar 36/1012
๐ง Researchers introduce Silo-Bench, a benchmark revealing that multi-agent LLM systems can exchange information effectively but fail to integrate distributed data for correct reasoning. The study shows coordination overhead increases with scale, challenging the assumption that adding more agents can solve context limitations.
AIBullisharXiv โ CS AI ยท Mar 36/107
๐ง RepoRepair is a new AI-powered automated program repair system that uses hierarchical code documentation to fix bugs across entire software repositories. The system achieves a 45.7% repair rate on SWE-bench Lite at $0.44 per fix by leveraging LLMs like DeepSeek-V3 and Claude-4 for fault localization and code repair.
AIBullisharXiv โ CS AI ยท Mar 37/108
๐ง Researchers have developed Egocentric Co-Pilot, a web-native AI framework that runs on smart glasses and uses Large Language Models to provide assistive AI without requiring screens or free hands. The system combines perception, reasoning, and web tools to support accessibility for people with vision impairments or cognitive overload, showing superior performance compared to commercial baselines.
AIBullisharXiv โ CS AI ยท Mar 37/1010
๐ง Researchers have developed MedCollab, a multi-agent AI framework that uses structured argumentation and causal reasoning to improve clinical diagnosis accuracy. The system outperforms traditional LLMs by reducing medical hallucinations and providing more transparent, clinically compliant diagnostic processes through hierarchical consultation workflows.
AIBullisharXiv โ CS AI ยท Mar 37/107
๐ง ATLAS is a new AI-driven framework that uses large language models to automate System-on-Chip (SoC) security verification by converting threat models into formal verification properties. The system successfully detected 39 out of 48 security weaknesses in benchmark tests and generated correct security properties for 33 of those vulnerabilities.
AINeutralarXiv โ CS AI ยท Mar 36/103
๐ง Researchers introduce OBsmith, an LLM-powered framework that tests JavaScript obfuscators for correctness bugs that can silently alter program functionality. The tool discovered 11 previously unknown bugs that existing JavaScript fuzzers failed to detect, highlighting critical gaps in obfuscation quality assurance.
AIBullisharXiv โ CS AI ยท Mar 37/106
๐ง Researchers have developed TOSS, a new framework for safely fine-tuning large language models that operates at the token level rather than sample level. The method identifies and removes unsafe tokens while preserving task-specific information, demonstrating superior performance compared to existing sample-level defense methods in maintaining both safety and utility.
AINeutralarXiv โ CS AI ยท Mar 36/106
๐ง Researchers identified Self-Anchoring Calibration Drift (SACD), where large language models show systematic confidence changes when building on their own outputs in multi-turn conversations. Testing Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 revealed model-specific patterns, with Claude showing decreasing confidence and significant calibration errors, while GPT-5.2 exhibited opposite behavior in open-ended domains.
$NEAR
AIBearisharXiv โ CS AI ยท Mar 37/109
๐ง A study reveals that safety-aligned large language models exhibit "Defensive Refusal Bias," refusing legitimate cybersecurity defense tasks 2.72x more often when they contain security-sensitive keywords. The research found particularly high refusal rates for critical defensive operations like system hardening (43.8%) and malware analysis (34.3%), suggesting current AI safety measures rely on semantic similarity rather than understanding intent.
AIBullisharXiv โ CS AI ยท Mar 36/106
๐ง Researchers developed KG-Followup, a knowledge graph-augmented large language model system that generates medical follow-up questions for pre-diagnostic assessment. The system combines structured medical domain knowledge with LLMs to improve clinical diagnosis efficiency, outperforming existing methods by 5-8% in recall benchmarks.
AIBearisharXiv โ CS AI ยท Mar 36/108
๐ง Research reveals that Large Language Model (LLM) self-explanations fail semantic invariance testing, showing that AI models' self-reports change based on how tasks are framed rather than actual task performance. Four frontier AI models demonstrated unreliable self-reporting when faced with semantically different but functionally identical tool descriptions, raising questions about using model self-reports as evidence of capability.
AIBullisharXiv โ CS AI ยท Mar 37/106
๐ง MOSAIC is a new open-source platform that enables cross-paradigm comparison and evaluation of different AI agents including reinforcement learning, large language models, vision-language models, and human decision-makers within the same environment. The platform introduces three key technical contributions: an IPC-based worker protocol, operator abstraction for unified interfaces, and a deterministic evaluation framework for reproducible research.
AIBullisharXiv โ CS AI ยท Mar 36/106
๐ง Researchers introduce GlassMol, a new interpretable AI model for molecular property prediction that addresses the black-box problem in drug discovery. The model uses Concept Bottleneck Models with automated concept curation and LLM-guided selection, achieving performance that matches or exceeds traditional black-box models across thirteen benchmarks.