#semantic-search News & Analysis

29 articles tagged with #semantic-search. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

29 articles

AIBullisharXiv – CS AI · Jun 117/10

🧠

Semantic search for 100M+ galaxy images using AI-generated captions

Researchers developed AION-Search, an AI-powered semantic search engine that catalogs over 100 million galaxy images using Vision-Language Models to generate captions and create searchable embeddings without manual labeling. The system achieved state-of-the-art performance in discovering rare astronomical phenomena and identified 36 new extragalactic stellar stream candidates, while offering a generalizable approach for making large unlabeled scientific image archives semantically searchable.

AIBullisharXiv – CS AI · Jun 17/10

🧠

DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval

DynaTree is a two-stage framework for efficient news retrieval that combines offline agentic reasoning with lightweight online subtree selection, achieving significant improvements in real-world deployment. The system demonstrated a 59-73% survival rate versus 32-53% for fixed approaches in production A/B testing, highlighting the practical value of persistent semantic expansion for time-sensitive information retrieval.

AIBullisharXiv – CS AI · May 297/10

🧠

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval is a new framework that enables unified retrieval across heterogeneous knowledge sources—including unstructured text, relational databases, knowledge graphs, and property graphs—by translating natural language queries into source-native queries rather than forcing all data into a homogenized format. The system demonstrates superior performance compared to single-source retrievers across 13 datasets and 309 knowledge bases, positioning it as a general-purpose interface that preserves the structural advantages of each knowledge source.

AIBullisharXiv – CS AI · May 297/10

🧠

No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval

Researchers introduce Single-stage Sparse Retrieval (SSR), a new approach that replaces clustering-based compression with sparse autoencoders for multi-vector retrieval systems. The method achieves 15x faster indexing, 50% lower retrieval latency, and improved accuracy compared to ColBERTv2, addressing critical efficiency bottlenecks in large-scale information retrieval.

AIBullisharXiv – CS AI · May 277/10

🧠

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer is a new post-training framework that enables large language models to reason over heterogeneous graph-structured data by combining natural-language reasoning with graph function execution. The two-stage curriculum approach uses structural complexity ordering to teach models to explore and reason over graphs, achieving strong cross-domain generalization with only a 3B parameter backbone.

AIBullisharXiv – CS AI · May 117/10

🧠

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

Researchers introduce HCL-GP, a machine learning approach that enables large language model agents to learn and reuse hierarchical task decompositions for improved performance on complex applications. The method achieves 98.2% accuracy on standard tasks and demonstrates significant improvements over static synthesis approaches, particularly benefiting open-source models through dynamic component reuse.

AINeutralarXiv – CS AI · Mar 267/10

🧠

An In-Depth Study of Filter-Agnostic Vector Search on a PostgreSQL Database System: [Experiments and Analysis]

Researchers conducted the first comprehensive study of filter-agnostic vector search algorithms in a production PostgreSQL database system, revealing that real-world performance differs significantly from isolated library testing. The study found that system-level overheads often outweigh theoretical algorithmic benefits, with clustering-based approaches like ScaNN often outperforming graph-based methods like NaviX/ACORN in practice.

AIBullisharXiv – CS AI · Mar 46/102

🧠

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc is a new system that enables efficient semantic analysis of large document collections using LLMs by combining offline document representation with lightweight online filtering. The system achieves 2x speedup and reduces expensive LLM calls by up to 85% through contrastive learning and adaptive cascade mechanisms.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Bidirectional Semantic Complementary Tool Retrieval for Remote Sensing Agents

Researchers propose a bidirectional semantic complementary tool retrieval (BSCTR) method to improve how LLM-based agents select appropriate tools for remote sensing tasks. The approach addresses a fundamental mismatch between high-level user queries and detailed tool documentation by enhancing queries with decomposed subtasks and enriching tool descriptions with contextual dependencies, demonstrating improved performance on specialized remote sensing benchmarks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

Researchers introduce ArtiFact, a large-scale multi-modal dataset containing 651,045 museum records from three major art institutions combined with images, text, and structured data. The dataset benchmarks AI systems on cross-modal error detection and semantic query processing tasks, revealing significant challenges in detecting domain-specific errors and handling culturally-nuanced information retrieval.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces

Researchers propose Kernel Affine Hull Machines (KAHM) as a lightweight alternative to transformer-based neural encoders for semantic search in frozen representation spaces. The method achieves 8.53x faster query encoding while maintaining competitive retrieval performance, offering practical efficiency gains for production deployment scenarios.

AIBullisharXiv – CS AI · Jun 46/10

🧠

DSL-Topic: Improving Topic Modeling by Distilling Soft Labelsfrom Language Models

Researchers introduce DSL-Topic, a novel framework that improves neural topic modeling by distilling soft labels from language models rather than relying on traditional bag-of-words reconstruction. The approach leverages LM-generated contextual signals to produce higher-quality topics with better coherence and semantic alignment, demonstrating significant improvements over existing baselines.

AINeutralarXiv – CS AI · May 286/10

🧠

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

Researchers propose a standardized measurement protocol for evaluating retrieval-augmented generation (RAG) systems using LLM judges, addressing inconsistencies in how semantic search quality is assessed. The standard fixes key variables like evidence budget and prompt while requiring cluster-aware statistical testing, revealing that previous comparisons may have overstated progress and that traditional BM25 retrieval outperforms pure semantic methods under controlled conditions.

AINeutralarXiv – CS AI · May 286/10

🧠

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash is a new compression codec that reduces neural embedding storage from 1,536 bytes to 48 bytes (32x compression) using deterministic sparse Johnson-Lindenstrauss projection and scalar quantization. The method requires no training, learned codebooks, or corpus statistics, achieving 0.91+ correlation with dense cosine similarity scores on multilingual sentence-embedding benchmarks.

AINeutralarXiv – CS AI · May 286/10

🧠

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents

Researchers introduce MGRetrieval, a novel retrieval strategy for long-term dialogue agents that uses semantic memory structures to guide multi-step retrieval rather than one-shot approaches. The method improves performance on dialogue benchmarks by 8-11% while maintaining computational efficiency, addressing a key limitation in LLM-based conversational systems.

AINeutralarXiv – CS AI · May 276/10

🧠

Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

Researchers introduce DualGraph, a retrieval-augmented generation framework that combines semantic and symbolic approaches to improve question answering on semi-structured data. The system uses dual knowledge graph representations alongside a new benchmark dataset (SpecsQA) from e-commerce, demonstrating superior performance over existing dense-retrieval and graph-based methods.

AIBullisharXiv – CS AI · May 276/10

🧠

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

Researchers demonstrate that knowledge graphs significantly outperform traditional document stores for LLM-based industrial asset operations, achieving 100% accuracy on 467 maintenance scenarios compared to 65% with flat data structures. The study reveals that data architecture, not LLM orchestration design, is the primary performance bottleneck in structured operational domains.

🏢 Hugging Face🧠 GPT-4

AINeutralarXiv – CS AI · May 126/10

🧠

Do not copy and paste! Rewriting strategies for code retrieval

Researchers evaluated multiple code retrieval strategies using LLM-based rewriting, finding that full natural language transcription with query-corpus augmentation achieves the largest gains but corpus-only approaches often degrade performance. They introduced Delta H (token entropy) as a cheap, rewriter-agnostic metric to predict when LLM rewriting justifies its computational cost.

AINeutralarXiv – CS AI · May 125/10

🧠

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

Researchers evaluate semantic search as a tool for analyzing 18th-century intellectual history, specifically tracking how John Locke's ideas circulated through paraphrases and implicit references. While semantic search substantially outperforms traditional lexical methods at capturing meaning-level correspondences, linguistic analysis reveals that retrieval remains constrained by surface-level vocabulary overlap, suggesting both promise and limitations for historical corpus analysis.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval

A comprehensive survey examines how Large Language Models can be effectively integrated with graph-based data structures to improve reasoning, retrieval, and decision-making across domains. The research categorizes integration approaches by purpose, graph type, and strategy, providing practitioners with guidance on selecting appropriate techniques for specific applications in healthcare, finance, robotics, and other fields.

AINeutralarXiv – CS AI · Apr 146/10

🧠

X-SYS: A Reference Architecture for Interactive Explanation Systems

Researchers introduce X-SYS, a reference architecture for building interactive explanation systems that operationalize explainable AI (XAI) across production environments. The framework addresses the gap between XAI algorithms and deployable systems by organizing around four quality attributes (scalability, traceability, responsiveness, adaptability) and five service components, with SemanticLens as a concrete implementation for vision-language models.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Domain-Specific Data Generation Framework for RAG Adaptation

RAGen is a new framework for generating domain-specific training data to improve Retrieval-Augmented Generation (RAG) systems. The system creates question-answer-context triples using semantic chunking, concept extraction, and Bloom's Taxonomy principles, enabling faster adaptation of LLMs to specialized domains like scientific research and enterprise knowledge bases.

AIBullisharXiv – CS AI · Mar 266/10

🧠

MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

Researchers introduce MDKeyChunker, a three-stage pipeline that improves RAG (Retrieval-Augmented Generation) systems by using structure-aware chunking of Markdown documents, single-call LLM enrichment, and semantic key-based restructuring. The system achieves superior retrieval performance with Recall@5=1.000 using BM25 over structural chunks, significantly improving upon traditional fixed-size chunking methods.

🏢 OpenAI

AIBullisharXiv – CS AI · Mar 176/10

🧠

Learning Retrieval Models with Sparse Autoencoders

Researchers introduce SPLARE, a new method that uses sparse autoencoders (SAEs) to improve learned sparse retrieval in language models. The technique outperforms existing vocabulary-based approaches in multilingual and out-of-domain settings, with SPLARE-7B achieving top results on multilingual retrieval benchmarks.

AIBullisharXiv – CS AI · Mar 26/1014

🧠

WisPaper: Your AI Scholar Search Engine

WisPaper is a new AI-powered academic search system that combines semantic search capabilities with automated paper validation and organization tools. The system achieved 22.26% recall on TaxoBench and 93.70% validation accuracy, addressing key limitations in current academic search engines by integrating discovery, organization, and monitoring workflows.

Page 1 of 2Next →