Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory
Researchers propose entity-collision, a standardized testing protocol for evaluating retrieval systems in agent memory applications. The protocol isolates embedder performance from lexical overlap by construction, revealing that encoder capacity alone doesn't guarantee better retrieval—MiniLM-384 outperforms larger models on mixed query types despite having fewer parameters than BGE-large.
Entity-collision addresses a fundamental measurement problem in agent-memory benchmarking: existing hit@k metrics conflate multiple sources of performance variation, making it impossible to attribute improvements to specific system components. The protocol works by controlling experimental design—all distractors share entity tokens with correct answers, establishing a reproducible BM25 baseline—then stratifying queries by type to isolate embedder contributions.
This research extends a broader trend toward more rigorous AI evaluation methodologies. As language models and retrieval systems become central infrastructure, benchmarks have shifted from simple aggregate metrics toward stratified, controlled comparisons that reveal performance across distinct problem classes. Entity-collision exemplifies this shift by proving that aggregate improvements can mask contradictory patterns: a 256-dimensional hash trigram helps only on closed-vocabulary lexical tasks under deep collision, while MiniLM-384 generalizes across both lexical and intent-based queries despite having fewer parameters than larger alternatives.
For developers building agent memory systems, the findings challenge conventional assumptions about scaling. Larger parameter counts don't guarantee better retrieval performance, and different embedders excel on different query types—suggesting that model selection should depend on anticipated workload composition rather than abstract capacity metrics. The discovery of an intent-tag recall cliff on LongMemEval and the measured null result for adaptive vector-weight routing on LoCoMo indicate that agent memory remains a constrained research area where architectural innovations haven't yet closed significant performance gaps.
The protocol's reproducibility infrastructure—version-controlled results, deterministic event-sourced decision logs, and byte-for-byte verification—sets a standard for AI research transparency. Future agent-memory work will likely adopt similar stratification approaches, enabling more precise optimization of retrieval systems for specific deployment contexts.
- →Entity-collision protocol controls lexical overlap and query-type variance to isolate true embedder performance gains over BM25 baseline.
- →MiniLM-384 outperforms larger BGE-large model on mixed query distributions, indicating encoder capacity is not the binding constraint.
- →Different embedders excel on different task types—hash trigrams help lexical tasks while MiniLM generalizes across both lexical and intent queries.
- →Adaptive vector-weight routing on LoCoMo shows no measurable signal despite 11.7pp of theoretical headroom, suggesting architectural limits.
- →Fully reproducible research infrastructure with version-controlled results and deterministic state machines enables byte-for-byte verification of all findings.