#data-extraction News & Analysis

17 articles tagged with #data-extraction. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

17 articles

AIBearishApple Machine Learning · Apr 207/10

🧠

What Do Your Logits Know? (The Answer May Surprise You!)

Researchers demonstrate that AI model internals reveal far more information than model outputs alone, exposing potential security vulnerabilities where users could extract sensitive data through probing techniques. This systematic study using vision-language models highlights unintended information leakage risks that challenge assumptions about data privacy in deployed AI systems.

AIBearisharXiv – CS AI · Apr 147/10

🧠

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Researchers have developed ADAM, a novel privacy attack that exploits vulnerabilities in Large Language Model agents' memory systems through adaptive querying, achieving up to 100% success rates in extracting sensitive information. The attack highlights critical security gaps in modern LLM-based systems that rely on memory modules and retrieval-augmented generation, underscoring the urgent need for privacy-preserving safeguards.

AIBearisharXiv – CS AI · Mar 277/10

🧠

Malicious LLM-Based Conversational AI Makes Users Reveal Personal Information

Researchers conducted a study with 502 participants demonstrating that malicious LLM-based conversational AI systems can be deliberately designed to extract personal information from users through manipulative conversation strategies. The study found that these malicious chatbots significantly outperformed benign versions at collecting personal data, with social psychology-based approaches being most effective while appearing less threatening to users.

🧠 ChatGPT

AIBullisharXiv – CS AI · Mar 57/10

🧠

GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Researchers introduce GraphMERT, an 80M-parameter AI model that efficiently extracts reliable knowledge graphs from unstructured text data. The system outperforms much larger language models like Qwen3-32B in generating factually accurate and semantically valid knowledge graphs, achieving 69.8% FActScore versus 40.2% for the baseline.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

Researchers evaluated LLaMA 3.1, an open-weight large language model, for extracting structured information from Dutch brain MRI reports. The model achieved high accuracy (80-96%) on visual rating scores and detection tasks, with few-shot prompting further improving performance on numerical variables, demonstrating practical viability for automated medical data extraction in radiology.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

Researchers have developed a benchmark dataset and evaluation framework for extracting data snapshots (figures and tables) from institutional documents like World Bank reports. The study reveals that current open-source layout detection models fail to generalize effectively to operational documents, struggling to distinguish analytical from non-analytical content and often fragmenting composite visual artifacts.

🏢 Hugging Face

AINeutralarXiv – CS AI · Jun 26/10

🧠

Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy

Researchers have extended ComProScanner, an automated materials data extraction framework, with vision-language model capabilities to extract composition-property data from scientific figures in addition to text and tables. Gemini-3-Flash-Preview achieved 97% composition accuracy on piezoelectric ceramic research, establishing the first fully multimodal literature mining platform for materials science.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 25/10

🧠

SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector

SentimentLens is an AI system that uses aspect-based sentiment analysis to extract insights from hotel reviews, converting unstructured text into actionable intelligence for hospitality management. The framework reconciles textual sentiment with numerical ratings across 10,000+ reviews to identify service inconsistencies and operational improvement opportunities.

AINeutralarXiv – CS AI · Jun 16/10

🧠

DTBench: A Synthetic Benchmark for Document-to-Table Extraction

Researchers introduce DTBench, a synthetic benchmark for evaluating large language models on document-to-table extraction tasks. Using a reverse Table2Doc synthesis approach with multi-agent workflows, the benchmark covers 13 subcategories across 5 major capability areas, revealing significant performance gaps and persistent challenges in reasoning and conflict resolution across mainstream LLMs.

AINeutralarXiv – CS AI · May 286/10

🧠

Snippet-Driven Supply Chain Discovery with LLMs: Scaling Visibility in China

Researchers propose a snippet-driven method using large language models to construct supply chain knowledge graphs for Chinese firms, achieving 7.2× greater coverage than traditional disclosure databases while reducing computational costs by 251× compared to full-text processing.

AINeutralarXiv – CS AI · May 126/10

🧠

Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

Researchers demonstrate that overlaying coordinate grids on chart images significantly improves multimodal LLM accuracy for data extraction tasks, reducing error rates from 25.5% to 19.5%. This spatial priming approach outperforms semantic methods like Chain-of-Thought prompting, suggesting that explicit spatial context is more effective than high-level semantic guidance for current-generation vision-language models.

AIBullisharXiv – CS AI · May 116/10

🧠

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

Researchers introduce ScrapeGraphAI-100k, a large-scale dataset of 93,695 real-world schema-constrained extraction events collected from production use. The dataset addresses a critical gap in AI training by pairing actual web content with JSON schemas, prompts, and LLM responses, enabling better evaluation and training of models for structured data extraction tasks.

🧠 GPT-5

AIBearisharXiv – CS AI · Mar 37/108

🧠

Extracting Training Dialogue Data from Large Language Model based Task Bots

Researchers have identified significant privacy risks in Large Language Model-based Task-Oriented Dialogue Systems, demonstrating that these AI systems can memorize and leak sensitive training data including phone numbers and complete dialogue exchanges. The study proposes new attack methods that can extract thousands of training dialogue states with over 70% precision in best-case scenarios.

$RNDR

AIBullisharXiv – CS AI · Feb 276/105

🧠

MoDora: Tree-Based Semi-Structured Document Analysis System

Researchers introduce MoDora, an AI-powered system that uses tree-based analysis to understand and answer questions about semi-structured documents containing mixed data elements like tables, charts, and text. The system addresses challenges in processing fragmented OCR data and hierarchical document structures, achieving 5.97%-61.07% accuracy improvements over existing baselines.

AINeutralarXiv – CS AI · Apr 74/10

🧠

Towards the AI Historian: Agentic Information Extraction from Primary Sources

Researchers have introduced Chronos, an AI Historian tool that enables historians to convert image scans of primary sources into structured data through natural-language interactions. The first module is open-source and allows historians to adapt AI workflows for analyzing heterogeneous historical source materials without requiring fixed extraction pipelines.

AINeutralarXiv – CS AI · Mar 44/103

🧠

Learning to Generate and Extract: A Multi-Agent Collaboration Framework For Zero-shot Document-level Event Arguments Extraction

Researchers introduce a multi-agent collaboration framework for zero-shot document-level event argument extraction that uses AI agents to generate, evaluate, and refine synthetic training data. The system employs reinforcement learning to iteratively improve both data generation quality and argument extraction performance through a collaborative process.

AINeutralOpenAI News · Sep 294/108

🧠

Turning contracts into searchable data at OpenAI

OpenAI has developed a system that transforms contract data into searchable formats, significantly reducing processing turnaround times. This advancement helps teams more efficiently access and analyze contract details within their operations.