Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds
Researchers demonstrate a zero-shot knowledge graph construction pipeline using local open-source LLMs on consumer hardware, achieving 0.70 F1 on document relations and 0.55 exact match on multi-hop reasoning through ensemble methods. The study reveals that strong model consensus often signals collective hallucination rather than accuracy, challenging traditional ensemble assumptions while maintaining low computational costs and carbon footprint.
This research addresses a critical challenge in making advanced AI systems accessible and sustainable: executing complex NLP tasks entirely on local hardware without fine-tuning. The zero-shot pipeline achieves competitive results (0.70 F1) against supervised baselines (0.80 F1) through intelligent orchestration of multiple models, demonstrating that architectural diversity can partially compensate for lack of task-specific training data.
The work builds on established AI trends toward efficiency and reproducibility. The adoption of open evaluation frameworks (RAGAS, DocRED, HotpotQA) and emphasis on local inference reflects growing industry concerns about computational costs, data privacy, and model interpretability. The 5-hour execution time on consumer-grade RTX 3090 hardware with 0.09 kg CO2 equivalent carbon footprint directly challenges the resource-intensive paradigm of cloud-based AI systems.
The agreement paradox finding carries significant implications for practitioners deploying ensemble methods in production. The counterintuitive discovery that high consensus correlates with hallucination—not accuracy—fundamentally questions how organizations should weight multiple model outputs. This insight aligns with broader research on model calibration and uncertainty quantification, suggesting that confidence scores require careful interpretation.
The confidence-routing cascade mechanism achieving 0.55 exact match represents a practical advancement in handling difficult multi-hop reasoning. The finding that V3 prompt engineering exhibits model-specific effects rather than universal transferability underscores the importance of empirical validation before deployment. This work establishes benchmarks for efficient local inference systems, creating a foundation for organizations seeking to balance performance with sustainability and cost constraints.
- →Zero-shot local LLM pipeline achieves 0.70 F1 on knowledge graph tasks, closing the gap with supervised systems while eliminating training costs
- →Strong model consensus frequently indicates collective hallucination rather than correctness, contradicting conventional ensemble wisdom
- →Confidence-routing cascade mechanism improves multi-hop reasoning from 0.46 to 0.55 exact match by selectively rerouting 45% of questions
- →Complete pipeline executes in 5 hours on RTX 3090 with minimal carbon footprint, demonstrating practical viability of local-only inference
- →Prompt engineering gains are model-specific and do not transfer universally, requiring empirical validation for each architecture