y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

CTIConnect: A Benchmark for Retrieval-Augmented LLMs over Heterogeneous Cyber Threat Intelligence

arXiv – CS AI|Yutong Cheng, Yang Liu, Changze Li, Dawn Song, Peng Gao|
🤖AI Summary

Researchers introduce CTIConnect, a benchmark for evaluating retrieval-augmented large language models on cyber threat intelligence tasks. The study integrates five heterogeneous CTI sources into 1,860 expert-verified QA pairs across nine tasks, revealing that different task categories require fundamentally different retrieval strategies and that domain-specific approaches outperform generic retrieval methods.

Analysis

CTIConnect addresses a critical gap in AI evaluation by establishing the first comprehensive benchmark for retrieval-augmented LLMs operating within cybersecurity threat intelligence ecosystems. Organizations increasingly rely on LLMs to process massive volumes of heterogeneous CTI data—including CVE databases, MITRE ATT&CK frameworks, and unstructured threat reports—that humans cannot manually analyze at scale. This research moves beyond theoretical LLM capabilities by testing models against realistic, domain-specific workloads where retrieval and reasoning directly impact security outcomes.

The benchmark's architecture is particularly significant: 1,860 expert-verified QA pairs spanning nine tasks across entity linking, multi-document synthesis, and entity attribution expose performance bottlenecks that vary by task type. Testing ten state-of-the-art LLMs reveals that no single retrieval strategy works universally across CTI domains. The cross-source semantic gap—the mismatch between different CTI data formats and structures—manifests differently depending on whether models perform entity linking versus attribution tasks. Some tasks are constrained by retrieval infrastructure limitations, while others fail during evidence utilization.

This distinction matters economically and operationally. Security teams investing in LLM-powered threat intelligence platforms need architectural guidance tailored to their specific use cases rather than applying generic retrieval-augmented generation (RAG) approaches. The finding that domain-specific strategies consistently outperform stronger general-purpose methods like retrieve-then-rerank and IRCoT suggests that cybersecurity AI vendors will need specialized engineering rather than relying on off-the-shelf foundation models. The temporal stability across 2008-2025 data splits validates these conclusions across evolving threat landscapes, providing confidence that findings remain relevant as security threats evolve.

Key Takeaways
  • Different CTI task categories require fundamentally different retrieval strategies, not a one-size-fits-all approach.
  • Domain-specific retrieval methods outperform generic techniques like retrieve-then-rerank, indicating that structural interventions are necessary for heterogeneous data.
  • Performance bottlenecks vary by task: some are constrained by retrieval infrastructure while others fail during evidence utilization by the LLM.
  • Expert-verified benchmark spanning 1,860 QA pairs across nine CTI tasks provides actionable guidance for security AI architects.
  • Findings hold consistently across ten LLMs and remain stable under temporal data splits from 2008-2025, validating robustness.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles