AIBearisharXiv – CS AI · Jun 17/10
🧠A position paper challenges current ECG representation learning benchmarking practices, arguing that evaluation methods are too narrow and miss clinically meaningful objectives. The authors demonstrate that random encoder baselines surprisingly match state-of-the-art pre-training on many tasks, suggesting the field's conclusions about model performance are unreliable without proper evaluation frameworks.
AINeutralarXiv – CS AI · May 297/10
🧠Researchers introduced MedCase-Structured, a synthetic dataset that converts unstructured clinical text into standardized HL7 FHIR format for evaluating large language models in realistic healthcare settings. The study reveals that LLMs perform significantly worse on structured clinical data than plain text, highlighting a critical gap between academic benchmarks and real-world deployment requirements.
AINeutralarXiv – CS AI · May 127/10
🧠Researchers introduce a controlled-invariance methodology to distinguish whether hallucination detection in large language models actually evaluates reasoning quality or merely exploits surface-level answer cues. Their lightweight TRACT model demonstrates that effective detection relies primarily on lexical trajectory features rather than complex learned representations, suggesting current detection methods conflate endpoint artifacts with genuine reasoning validation.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers present the RGF-AFFD, an integrated governance framework for AI-driven fraud detection in U.S. banking that unifies compliance requirements from four regulatory bodies (OCC, SR 11-7, CFPB, FinCEN). The framework includes a Regulatory Digital Twin meta-model that benchmarks six AI architectures, with an LSTM+XGBoost ensemble achieving 0.9289 ROC-AUC, and establishes continuous monitoring protocols to satisfy fragmented regulatory requirements simultaneously.
AIBearisharXiv – CS AI · May 17/10
🧠Researchers systematically tested whether large language models can maintain assigned adversarial roles when analyzing political statements, discovering that models frequently fail to sustain their epistemic stance due to training knowledge overriding role instructions. The study identifies "Epistemic Role Override" as the mechanism behind role failures, with significant performance variance between models (Mistral Large achieving 67% role fidelity versus Claude Sonnet's 39%), raising critical concerns about the reliability of multi-agent LLM systems designed to provide balanced political discourse analysis.
🏢 Perplexity🧠 Claude
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose Vision-Language Logical Consistency Metric (VL-LCM), a novel evaluation framework for multimodal large language models that assesses logical coherence without requiring ground-truth annotations. Testing 11 MLLMs across benchmarks including MMMU and NaturalBench reveals that while accuracy has improved significantly, logical consistency substantially lags, suggesting current models make confident but logically inconsistent predictions.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers demonstrate that incorporating think-aloud verbal protocols alongside behavioral data significantly improves automated cognitive model discovery using large language models. The approach shifts discovered models toward different structural classes, revealing decision-making mechanisms invisible to behavior-only analysis, particularly in risky decision-making contexts.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers systematically analyze the design space of LLM-based social simulations, examining how different architectural choices—particularly base model selection and network topology—affect simulated agent behavior and opinion formation. The study reveals non-trivial interactions between parameters and identifies the choice of underlying LLM as the most critical factor determining simulation outcomes.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers evaluated 17 large language models on their ability to implement agent-based models from standardized specifications, finding that while GPT-4.1 and Claude 3.7 Sonnet produce statistically valid implementations, executability alone doesn't guarantee scientific reliability. The study reveals both significant promise and critical limitations in using LLMs as automated tools for scientific model engineering and replication.
🧠 GPT-4🧠 Claude
AIBullisharXiv – CS AI · Mar 36/103
🧠Researchers introduce DINCO (Distractor-Normalized Coherence), a method to improve confidence calibration in large language models by using self-generated alternative claims to reduce overconfidence bias. The approach addresses LLM suggestibility issues that cause models to express high confidence on low-accuracy outputs, potentially improving AI safety and trustworthiness.