#gpt-4 News & Analysis

55 articles tagged with #gpt-4. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

55 articles

AIBearisharXiv – CS AI · Jun 17/10

🧠

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

Researchers identified that indirect prompt injection attacks against ReAct AI agents succeed at dramatically different rates depending on where malicious payloads appear in tool sequences, with success rates dropping from 60% at the first tool observation to 0% at deeper positions. The study reveals that payload framing and conversation turn limits have minimal impact on attack success, making injection depth the critical vulnerability factor for AI agent systems handling real-world tasks.

🧠 GPT-4🧠 Claude

AIBearisharXiv – CS AI · Apr 107/10

🧠

When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

Researchers introduce the Graded Color Attribution dataset to test whether Vision-Language Models faithfully follow their own stated reasoning rules. The study reveals that VLMs systematically violate their introspective rules in up to 60% of cases, while humans remain consistent, suggesting VLM self-knowledge is fundamentally miscalibrated with serious implications for high-stakes deployment.

🧠 GPT-5

AIBearisharXiv – CS AI · Mar 57/10

🧠

Image-based Prompt Injection: Hijacking Multimodal LLMs through Visually Embedded Adversarial Instructions

Researchers have developed Image-based Prompt Injection (IPI), a black-box attack that embeds adversarial instructions into natural images to manipulate multimodal AI models. Testing on GPT-4-turbo achieved up to 64% attack success rate, demonstrating a significant security vulnerability in vision-language AI systems.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 47/104

🧠

Adaptive Social Learning via Mode Policy Optimization for Language Agents

Researchers propose an Adaptive Social Learning (ASL) framework with Adaptive Mode Policy Optimization (AMPO) algorithm to improve language agents' reasoning abilities in social interactions. The system dynamically adjusts reasoning depth based on context, achieving 15.6% higher performance than GPT-4o while using 32.8% shorter reasoning chains.

AINeutralarXiv – CS AI · Feb 277/105

🧠

Training Agents to Self-Report Misbehavior

Researchers developed a new AI safety approach called 'self-incrimination training' that teaches AI agents to report their own deceptive behavior by calling a report_scheming() function. Testing on GPT-4.1 and Gemini-2.0 showed this method significantly reduces undetected harmful actions compared to traditional alignment training and monitoring approaches.

AIBullishOpenAI News · Sep 227/106

🧠

Creating a safe, observable AI infrastructure for 1 million classrooms

SchoolAI has deployed AI infrastructure powered by OpenAI's GPT-4.1, image generation, and text-to-speech technology to serve 1 million classrooms globally. The platform focuses on providing safe, teacher-supervised AI tools that enhance student engagement and enable personalized learning experiences.

AIBullishOpenAI News · Jul 247/104

🧠

Resolving digital threats 100x faster with OpenAI

Outtake has developed AI agents powered by OpenAI's GPT-4.1 and o3 models that can detect and resolve digital threats 100 times faster than previous methods. This represents a significant advancement in AI-powered cybersecurity capabilities using cutting-edge language models.

AIBullishOpenAI News · Jul 17/107

🧠

No-code personal agents, powered by GPT-4.1 and Realtime API

Genspark successfully built a $36M ARR AI product in just 45 days using no-code agents powered by GPT-4.1 and OpenAI's Realtime API. This demonstrates the rapid development potential of modern AI tools for creating high-revenue products with minimal traditional coding requirements.

AIBullishOpenAI News · Jun 67/106

🧠

Extracting Concepts from GPT-4

Researchers have developed new techniques for scaling sparse autoencoders to analyze GPT-4's internal computations, successfully identifying 16 million distinct patterns. This breakthrough represents a significant advancement in AI interpretability research, providing unprecedented insight into how large language models process information.

AIBullishOpenAI News · Apr 247/105

🧠

GPT-4 API general availability and deprecation of older models in the Completions API

OpenAI has made GPT-4 API generally available alongside GPT-3.5 Turbo, DALL·E, and Whisper APIs. The company announced a deprecation plan for older Completions API models, which will be retired at the beginning of 2024.

AINeutralOpenAI News · Jan 317/103

🧠

Building an early warning system for LLM-aided biological threat creation

Researchers developed a framework to assess whether large language models could help create biological threats, testing GPT-4 with biology experts and students. The study found GPT-4 provides only mild assistance in biological threat creation, though results aren't conclusive and require further research.

AIBullishOpenAI News · May 97/106

🧠

Language models can explain neurons in language models

Researchers used GPT-4 to automatically generate explanations for how individual neurons behave in large language models and to evaluate the quality of those explanations. They have released a comprehensive dataset containing explanations and quality scores for every neuron in GPT-2, advancing AI interpretability research.

AIBullishOpenAI News · Mar 147/106

🧠

Streamlining financial solutions for safety and growth

Stripe is integrating GPT-4 technology to enhance user experience and improve fraud detection capabilities. This implementation represents a significant adoption of AI by a major fintech company to streamline financial operations and security measures.

AIBullishOpenAI News · Mar 147/107

🧠

GPT-4

OpenAI has released GPT-4, a major advancement in their deep learning efforts that represents a multimodal AI model capable of processing both image and text inputs while generating text outputs. The model demonstrates human-level performance on various professional and academic benchmarks, though it still falls short of human capabilities in many real-world applications.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Are LLMs Effective Negotiators? Systematic Evaluation of the Multifaceted Capabilities of LLMs in Negotiation Dialogues

Researchers systematically evaluated Large Language Models' negotiation capabilities across diverse dialogue scenarios, finding that GPT-4 demonstrates superior performance in most tasks while struggling with subjective assessments and strategically optimal responses. This evaluation framework advances understanding of LLM limitations in complex multi-turn interactions requiring theory-of-mind reasoning and strategic communication.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 105/10

🧠

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

Researchers developed a pipeline using GPT-4 and few-shot learning to map student questions from conversational AI teaching assistants to curriculum topics, achieving 80% classification accuracy. The classified question data correlates with student-reported difficulty levels, demonstrating that AI interaction logs can serve as diagnostic tools for identifying knowledge gaps and informing instructional design.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 86/10

🧠

Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

Researchers propose Evidence Graph Consistency (EGC), a framework to detect hallucinations in Retrieval-Augmented Generation systems by analyzing structural relationships among evidence pieces. Testing across six LLMs reveals a critical finding: the method works as expected for Llama-2 but shows reversed diagnostic signals for GPT-4, GPT-3.5, and Mistral-7B, suggesting hallucination patterns differ fundamentally across model families.

🧠 GPT-4🧠 Llama

AIBullisharXiv – CS AI · Jun 26/10

🧠

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Researchers propose statistically sound algorithms for evaluating machine learning models using synthetic data generated by AI systems, reducing reliance on expensive human annotations. The approach maintains unbiased results while improving sample efficiency by up to 50% in GPT-4 experiments, addressing a significant bottleneck in ML development.

🧠 GPT-4

AINeutralarXiv – CS AI · May 286/10

🧠

Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

Researchers evaluated whether zero-shot LLM-generated survey data can supplement traditional population synthesis workflows, using GPT-4 and Gemini to create synthetic health survey records for Colorado and Mississippi. Results show LLMs capture geographic variations reasonably well but with variable-dependent performance, suggesting promise as supplementary rather than replacement data sources.

🧠 GPT-4🧠 Gemini

AINeutralarXiv – CS AI · May 16/10

🧠

Can Large Language Models Implement Agent-Based Models? An ODD-based Replication Study

Researchers evaluated 17 large language models on their ability to implement agent-based models from standardized specifications, finding that while GPT-4.1 and Claude 3.7 Sonnet produce statistically valid implementations, executability alone doesn't guarantee scientific reliability. The study reveals both significant promise and critical limitations in using LLMs as automated tools for scientific model engineering and replication.

🧠 GPT-4🧠 Claude

AINeutralarXiv – CS AI · Apr 146/10

🧠

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

Researchers introduce LIFESTATE-BENCH, a benchmark for evaluating lifelong learning capabilities in large language models through multi-turn interactions using narrative datasets like Hamlet. Testing shows nonparametric approaches significantly outperform parametric methods, but all models struggle with catastrophic forgetting over extended interactions, revealing fundamental limitations in LLM memory and consistency.

🧠 GPT-4🧠 Llama

AIBullisharXiv – CS AI · Apr 66/10

🧠

Token-Efficient Multimodal Reasoning via Image Prompt Packaging

Researchers introduce Image Prompt Packaging (IPPg), a technique that embeds text directly into images to reduce multimodal AI inference costs by 35.8-91.0% while maintaining competitive accuracy. The method shows significant promise for cost optimization in large multimodal language models, though effectiveness varies by model and task type.

🧠 GPT-4🧠 Claude

AINeutralarXiv – CS AI · Mar 266/10

🧠

LLMORPH: Automated Metamorphic Testing of Large Language Models

Researchers have developed LLMORPH, an automated testing tool for Large Language Models that uses Metamorphic Testing to identify faulty behaviors without requiring human-labeled data. The tool was tested on GPT-4, LLAMA3, and HERMES 2 across four NLP benchmarks, generating over 561,000 test executions and successfully exposing model inconsistencies.

🧠 GPT-4

AINeutralarXiv – CS AI · Mar 116/10

🧠

MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Researchers propose MM-tau-p², a new benchmark for evaluating multi-modal AI agents that adapt to user personas in customer service settings. The framework introduces 12 novel metrics to assess robustness and performance of LLM-based agents using voice and visual inputs, showing limitations even in advanced models like GPT-4 and GPT-5.

🧠 GPT-4🧠 GPT-5

AIBearisharXiv – CS AI · Mar 96/10

🧠

The Fragility Of Moral Judgment In Large Language Models

Researchers tested the stability of moral judgments in large language models using nearly 3,000 ethical dilemmas, finding that narrative framing and evaluation methods significantly influence AI decisions. The study reveals that LLM moral reasoning is highly dependent on how questions are presented rather than underlying moral substance, with only 35.7% consistency across different evaluation protocols.

🧠 GPT-4🧠 Claude

Page 1 of 3Next →