AIBullisharXiv – CS AI · May 287/10
🧠Researchers demonstrate that knowledge graphs extracted from a single neuroscience textbook can be converted into high-quality training data to fine-tune language models, enabling expert-level reasoning that outperforms larger LLMs while using far fewer parameters. This approach challenges the prevailing assumption that domain expertise requires massive, diverse datasets, showing instead that structured, curated knowledge can produce superior specialized AI systems.
AIBearisharXiv – CS AI · May 277/10
🧠Researchers developed the Stakeholder Grounding Exercise, a method to evaluate whether text embeddings align with human expert understanding. Studies on Danish policy and US AI use cases reveal neural embeddings underperform human experts by 16-26 percentage points, with misalignment directly impacting downstream clustering tasks.
AINeutralarXiv – CS AI · Apr 107/10
🧠Researchers demonstrate that standard LLM-as-a-judge methods achieve only 52% accuracy in detecting hallucinations and omissions in mental health chatbots, failing in high-risk healthcare contexts. A hybrid framework combining human domain expertise with machine learning features achieves significantly higher performance (0.717-0.849 F1 scores), suggesting that transparent, interpretable approaches outperform black-box LLM evaluation in safety-critical applications.
AIBullisharXiv – CS AI · Mar 46/104
🧠Researchers have developed EvoSkill, an automated framework that enables AI agents to discover and refine domain-specific skills through iterative failure analysis. The system demonstrated significant performance improvements on specialized tasks, with accuracy gains of 7.3% on financial data analysis and 12.1% on search-augmented QA, while showing transferable capabilities across different domains.
AIBullisharXiv – CS AI · 6d ago6/10
🧠Frontier large language models from Anthropic and OpenAI have demonstrated competitive performance with human experts at annotating natural phenotypes to ontology terms, a previously labor-intensive bottleneck in biological research. When evaluated against the same Gold Standard benchmark used in a 2018 study, these AI agents performed within the range of trained human curators and substantially outperformed prior NLP tools, suggesting significant potential to scale phenotype annotation workflows.
🏢 OpenAI🏢 Anthropic
AINeutralarXiv – CS AI · May 286/10
🧠Researchers introduce a hybrid prediction market combining algorithmic agents and human experts to forecast scientific replicability, demonstrating that collaborative approaches outperform either humans or AI alone. The system trains AI on historical replication data while humans contribute domain expertise through real-time trading, producing more accurate replication forecasts than single-modality baselines.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers have developed Maat, a specialized AI agent designed to assist competition law experts with legal research by leveraging retrieval-augmented generation (RAG) and tool orchestration. Unlike general-purpose AI assistants, Maat addresses critical gaps in competition law analysis by providing reliable official citations, reducing hallucinations, and offering domain-specific expertise through iterative design with legal professionals.
🧠 ChatGPT🧠 Claude
AINeutralarXiv – CS AI · Mar 126/10
🧠Researchers propose Nurture-First Development (NFD), a new paradigm for building domain-expert AI agents through progressive growth via conversational interaction rather than traditional code-first or prompt-first approaches. The method uses a Knowledge Crystallization Cycle to convert operational dialogue into structured knowledge assets, demonstrated through a financial research agent case study.
AIBullishOpenAI News · Aug 215/106
🧠Blue J is transforming tax research by leveraging GPT-4.1 and Retrieval-Augmented Generation to provide AI-powered tools that deliver fast, accurate, and fully-cited tax answers. The company serves tax professionals across the US, Canada, and the UK, combining domain expertise with advanced AI technology for regulated industry applications.