#domain-specific-ai News & Analysis

28 articles tagged with #domain-specific-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

28 articles

AIBullisharXiv – CS AI · May 297/10

🧠

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Researchers introduced Compass, an LLM agent framework that extracts marine lead data from 230,000+ academic papers without fine-tuning, successfully creating the largest integrated marine lead database with 3,751 previously uncatalogued records and 92% accuracy. The expert-guided approach demonstrates how domain-specific knowledge can overcome LLM hallucinations in high-stakes scientific applications.

AIBearisharXiv – CS AI · May 117/10

🧠

What if AI systems weren't chatbots?

A research paper argues that the AI industry's convergence toward chatbot interfaces represents a specific value choice with significant structural downsides, including inadequate performance in complex contexts, workforce deskilling, knowledge homogenization, and environmental costs. The authors propose alternative development paths emphasizing domain-specific tools, pluralistic design, and stronger institutional oversight rather than one-size-fits-all conversational systems.

AIBullishThe Verge – AI · May 17/10

🧠

Microsoft wants lawyers to trust its new AI agent in Word documents

Microsoft has launched a specialized AI agent within Word designed specifically for legal teams to streamline contract review and document management tasks. The Legal Agent follows structured workflows based on real legal practice rather than general AI models, handling document edits, negotiation history, and clause-by-clause contract analysis.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Researchers demonstrate that physics simulators can generate synthetic training data for large language models, enabling them to learn physical reasoning without relying on scarce internet QA pairs. Models trained on simulated data show 5-10 percentage point improvements on International Physics Olympiad problems, suggesting simulators offer a scalable alternative for domain-specific AI training.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Multi-Model Synthetic Training for Mission-Critical Small Language Models

Researchers demonstrate a cost-effective approach to training specialized small language models by using LLMs as one-time teachers to generate synthetic training data. By converting 3.2 billion maritime vessel tracking records into 21,543 QA pairs, they fine-tuned Qwen2.5-7B to achieve 75% accuracy on maritime tasks at a fraction of the cost of deploying larger models, establishing a reproducible framework for domain-specific AI applications.

🧠 GPT-4

AINeutralarXiv – CS AI · Mar 267/10

🧠

From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs

Researchers developed a graph-based evaluation framework that transforms clinical guidelines into dynamic benchmarks for testing domain-specific language models. The system addresses key evaluation challenges by providing contamination resistance, comprehensive coverage, and maintainable assessment tools that reveal systematic capability gaps in current AI models.

AINeutralarXiv – CS AI · Mar 177/10

🧠

An Alternative Trajectory for Generative AI

Researchers propose shifting from large monolithic AI models to domain-specific superintelligence (DSS) societies due to unsustainable energy costs and physical constraints of current generative AI scaling approaches. The alternative involves smaller, specialized models working together through orchestration agents, potentially enabling on-device deployment while maintaining reasoning capabilities.

AIBullisharXiv – CS AI · Mar 47/102

🧠

Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification

Researchers have enhanced the Saarthi AI framework for formal verification, achieving 70% better accuracy in generating SystemVerilog assertions and 50% fewer iterations to reach coverage closure. The framework uses multi-agent collaboration and improved RAG techniques to move toward domain-specific AI intelligence for verification tasks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation

Researchers have developed POTracker, a fine-tuned large language model optimized for generating machine-readable power outage reports that comply with U.S. energy sector regulatory standards. The model achieves 86.47% structural accuracy and 51% improvement over existing fine-tuning methods by using a novel loss function that balances textual and structural similarity.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Learning Bug Context for PyTorch-to-JAX Translation with LLMs

Researchers introduce T2J, a benchmark dataset of PyTorch-to-JAX translation bugs paired with developer fixes, addressing the challenge of translating deep-learning code between frameworks. By training LLMs on this curated bug-fix data through in-context learning, they achieve up to 20% improvement in translation accuracy, demonstrating that domain-specific bug datasets can significantly enhance code generation reliability.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 86/10

🧠

ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers

ChemQuests is a new curated dataset containing 952 question-answer pairs extracted from chemistry research papers, designed to advance chemistry-focused natural language processing. The dataset bridges the gap between rapidly expanding chemistry literature and the need for domain-specific training data for AI models and retrieval systems.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 56/10

🧠

CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

Researchers introduce CTIConnect, a benchmark for evaluating retrieval-augmented large language models on cyber threat intelligence tasks. The study integrates five heterogeneous CTI sources into 1,860 expert-verified QA pairs across nine tasks, revealing that different task categories require fundamentally different retrieval strategies and that domain-specific approaches outperform generic retrieval methods.

AINeutralarXiv – CS AI · Jun 26/10

🧠

On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral

Researchers at FETCH have developed a legal triage system using low-cost LLMs to generate follow-up questions that refine legal problem classification, but found that higher-cost models like GPT-4 are necessary for generating quality plain-language questions that elicit relevant applicant information and improve classification accuracy.

🧠 GPT-5

AIBullisharXiv – CS AI · Jun 26/10

🧠

Agentic Authoring of Interactive Multiview Visualizations in Genomics

Researchers developed agentic LLM-based systems to democratize the authoring of complex genomics visualizations through natural-language interfaces. By testing six different agent architectures across 159 test cases, they found that agentic iteration substantially improves visualization quality over baseline approaches, though more complex agent configurations provide diminishing returns.

AINeutralarXiv – CS AI · Jun 26/10

🧠

RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation

Researchers introduce RadioMaster, a multi-agent AI framework that automates the conversion of user instructions into physical radio signals, addressing a critical gap in wireless prototyping. The system combines domain-specific knowledge retrieval, collaborative agent coordination, and hardware verification to outperform existing approaches in signal generation accuracy and configuration viability.

AINeutralarXiv – CS AI · Jun 26/10

🧠

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Researchers introduced LocalSearchBench, a comprehensive benchmark for testing AI agents in local life services, revealing significant performance gaps even among state-of-the-art large reasoning models. The benchmark comprises 1.3M merchant entries and 900 multi-hop reasoning tasks, exposing critical weaknesses in completeness and faithfulness that underscore the need for domain-specific AI agent development.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Agentic Physical AI toward a Domain-Specific Foundation Model for Energy Systems: A Case Study on Nuclear Reactor Control

Researchers propose a domain-specific foundation model for safety-critical physical systems using a compact 360M-parameter language model trained on synthetic nuclear reactor simulations rather than general-purpose vision-language models. The approach demonstrates significant reliability improvements in controlled environments but is positioned as one component within a broader verification architecture, not a standalone safety solution.

AINeutralarXiv – CS AI · May 286/10

🧠

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

Researchers have developed PetroBench, a comprehensive benchmark for evaluating large language models in petroleum engineering, testing eight mainstream LLMs across 1,200 domain-specific questions. The evaluation reveals significant performance gaps, with leading models achieving 72-74% accuracy overall but struggling particularly with factual discrimination in objective questions, suggesting LLMs need substantial improvement before widespread deployment in critical petroleum industry applications.

🧠 Claude🧠 Gemini

AIBearisharXiv – CS AI · May 286/10

🧠

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

A comprehensive study reveals that multimodal large language models exhibit significant hallucination problems in agricultural imaging tasks, with image interpretation achieving only 63-75% zero-shot accuracy and text-to-image generation producing up to 91% biologically inconsistent scenes. These findings highlight critical reliability gaps that could undermine the trustworthiness of AI-driven agricultural platforms.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · May 286/10

🧠

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

Researchers introduce BenGER, a comprehensive benchmark dataset for evaluating large language models on German legal reasoning tasks, comprising 596 exam-style cases and 531 doctrinal reasoning problems. The study demonstrates that LLM-as-a-Judge frameworks can achieve near-human consistency in legal assessment, with human-AI collaboration substantially outperforming unaided human performance.

AIBullisharXiv – CS AI · May 286/10

🧠

Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks

Researchers introduce PolyBench, a benchmark dataset containing 125K+ polymer design tasks backed by 13M data points, along with a knowledge-augmented reasoning method to improve LLM performance in materials science. Small and mid-sized language models trained on PolyBench achieve competitive results with frontier models, demonstrating practical advancement in AI4Science applications.

AIBullisharXiv – CS AI · May 276/10

🧠

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

Researchers developed Chat-ISV, an LLM-enhanced knowledge graph system that organizes fragmented steel industry VOCs literature into a queryable database with 27,180 nodes and 81,779 semantic edges. The system achieved 96.93% precision in answering specialized industrial questions, demonstrating a scalable approach to deploying reliable LLMs in domain-specific applications where hallucination risks are high.

AINeutralarXiv – CS AI · May 46/10

🧠

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

Researchers introduce LEGIT, a 24K-instance legal reasoning dataset with hierarchical argument trees that serve as evaluation rubrics for LLM-generated legal reasoning. The study reveals that LLM legal reasoning performance depends critically on both issue coverage and correctness, with RAG and reinforcement learning offering complementary improvements.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Automating Structural Analysis Across Multiple Software Platforms Using Large Language Models

Researchers developed a multi-agent LLM system that automates structural analysis workflows across multiple finite element analysis (FEA) platforms including ETABS, SAP2000, and OpenSees. Using a two-stage architecture that interprets engineering specifications and translates them into platform-specific code, the system achieved over 90% accuracy in 20 representative frame problems, addressing a critical gap in practical AI-assisted engineering deployment.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Domain-Specific Data Generation Framework for RAG Adaptation

RAGen is a new framework for generating domain-specific training data to improve Retrieval-Augmented Generation (RAG) systems. The system creates question-answer-context triples using semantic chunking, concept extraction, and Bloom's Taxonomy principles, enabling faster adaptation of LLMs to specialized domains like scientific research and enterprise knowledge bases.

Page 1 of 2Next →