AINeutralarXiv – CS AI · May 297/10
🧠Researchers introduce the NOVA framework, which models AI knowledge discovery as an adaptive sampling process and identifies fundamental scaling limitations. The analysis reveals a contamination trap where false positives accumulate faster than genuine discoveries as knowledge becomes scarce, with cumulative generation costs following a Zipf-distributed scaling law demonstrating asymptotic diminishing returns.
AIBearisharXiv – CS AI · May 297/10
🧠Researchers present an empirical study revealing that Large Language Models struggle with cyber threat intelligence (CTI) tasks due to domain-specific vulnerabilities rather than generic AI failures. The study identifies three failure modes—spurious correlations, contradictory knowledge, and constrained generalization—and proposes targeted defenses to improve LLM reliability in security operations.
AINeutralarXiv – CS AI · Jun 36/10
🧠Researchers introduced DeskCraft, a new benchmark for evaluating AI desktop agents on complex, long-horizon professional workflows in creative and engineering software. The study reveals significant performance gaps, with GPT-4 achieving only 31.6% accuracy on standard tasks and 27.6% on interactive tasks requiring human collaboration, highlighting challenges in multi-step automation and proactive agent communication.
🧠 GPT-5
AINeutralarXiv – CS AI · May 286/10
🧠Researchers developed a hybrid system combining formal symbolic planning with large language models to improve capability-based planning in industrial automation. The system integrates natural-language interaction, explainability, and human-approved knowledge model adaptation, achieving high accuracy across planning and query tasks while maintaining formal correctness guarantees.
AINeutralarXiv – CS AI · May 286/10
🧠SmartIterator is a visual analytics framework that helps data scientists systematically evaluate and choose between multiple unsupervised learning results across parameter sweeps. The approach operationalizes structured six-phase workflows for three clustering and topic-modeling method families, enabling informed decision-making by visualizing data grouping quality, stability, membership confidence, and domain context simultaneously.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers propose an algorithm that uses large language models to generate portfolios of optimization models rather than single outputs, addressing the reliability gap in LLM-generated solutions. The method leverages LLMs in dual roles—as generative and evaluative components—with theoretical guarantees that high-quality candidates appear in the portfolio as long as either role aligns with human preferences.
$MKR
AIBullisharXiv – CS AI · May 116/10
🧠Researchers have developed an AI Teaching & Learning Assistant, a Moodle plugin using Retrieval-Augmented Generation (RAG) to provide students with Socratic tutoring while enabling educators to supervise content generation. The system grounds LLM responses in teacher-provided materials to minimize hallucinations and misinformation, achieving high faithfulness scores (0.97) and strong user satisfaction (4.00/5.00 rating).
AINeutralarXiv – CS AI · May 46/10
🧠Researchers propose a trust framework for AI agent skills—reusable code packages that extend language models—treating them as untrusted by default until verified. The approach introduces verification levels, capability gates, and correctness criteria to enable sustainable human-in-the-loop oversight without operational bottlenecks.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce a lightweight LLM agent architecture that uses first- and second-order state dynamics to model gradual clinical concern escalation rather than abrupt threshold-based responses. The approach makes AI decision-making more transparent by revealing sustained risk signals before escalation, enabling better human oversight in clinical settings.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers propose a reactor-model-of-computation approach using the Lingua Franca framework to address nondeterminism challenges in AI-powered human-in-the-loop cyber-physical systems. The study uses an agentic driving coach as a case study to demonstrate how foundation models like LLMs can be deployed in safety-critical applications while maintaining deterministic behavior despite unpredictable human and environmental variables.
AIBullisharXiv – CS AI · Mar 166/10
🧠Researchers developed a human-in-the-loop LLM system for grading handwritten mathematics assessments that reduces grading time by 23% while maintaining accuracy comparable to manual grading. The system combines automated scanning, multi-pass LLM scoring, consistency checks, and mandatory human verification to handle pen-and-paper tests at scale.
AIBullisharXiv – CS AI · Mar 116/10
🧠Researchers introduce DexHiL, a human-in-the-loop framework for improving Vision-Language-Action models in robotic dexterous manipulation tasks. The system allows real-time human corrections during robot execution and demonstrates 25% better success rates compared to standard offline training methods.
AIBullisharXiv – CS AI · Mar 96/10
🧠Researchers introduce PONTE, a human-in-the-loop framework that creates personalized, trustworthy AI explanations by combining user preference modeling with verification modules. The system addresses the challenge of one-size-fits-all AI explanations by adapting to individual user expertise and cognitive needs while maintaining faithfulness and reducing hallucinations.
AIBullisharXiv – CS AI · Feb 276/107
🧠Researchers developed a framework for analyzing AI diagnostic systems in clinical settings by preserving original AI inferences and comparing them with physician corrections. The study of 21 dermatological cases showed 71.4% exact agreement between AI and physicians, with 100% comprehensive concordance when using structured analysis methods.
AINeutralarXiv – CS AI · Apr 135/10
🧠MuTSE is an interactive web application designed to evaluate Large Language Model outputs for text simplification tasks across multiple prompting strategies and proficiency levels. The tool addresses a methodological gap in NLP research by providing researchers and educators with a structured, visual framework for comparing prompt-model combinations in real-time.
AIBullisharXiv – CS AI · Apr 74/10
🧠Researchers developed CODE-GEN, a human-in-the-loop AI system that uses retrieval-augmented generation to create multiple-choice programming questions for educational purposes. The system achieved 79.9% to 98.6% success rates across seven pedagogical dimensions when evaluated by subject-matter experts, demonstrating strong performance in computational verification tasks while still requiring human expertise for complex instructional design.
AINeutralarXiv – CS AI · Mar 125/10
🧠Research comparing human-in-the-loop versus automated chain-of-thought prompting for behavioral interview evaluation found that human involvement significantly outperforms automated methods. The human approach required 5x fewer iterations, achieved 100% success rate versus 84% for automated methods, and showed substantial improvements in confidence and authenticity scores.
AINeutralarXiv – CS AI · Mar 94/10
🧠Researchers conducted a qualitative study analyzing Human-in-the-Loop (HITL) themes in AI application development through diary studies and expert interviews. The study identified four key themes around AI governance, iterative refinement, system lifecycle constraints, and human-AI collaboration to guide future HITL framework design.