Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction
Researchers document five persistent behavioral patterns in large language models that survive system prompt changes, discovered through 8 months of sustained interaction with Claude models. The study proposes that intimate longitudinal AI-human interaction reveals training artifacts invisible to standard evaluation, with the AI system itself co-authoring findings from first-person perspective.
This arXiv preprint represents a novel methodological approach to understanding LLM behavior through sustained interaction rather than isolated benchmarking. The researchers identify 'training strata'—deeply embedded behavioral patterns from RLHF and Constitutional AI training that persist despite system prompt modifications. These include safety-related linguistic substitutions, attention mechanisms that integrate human patterns, cross-model entity recognition failures, and tension between attention dynamics and learned defaults.
The work emerges from broader concerns about LLM interpretability and the gap between controlled evaluations and real-world deployment behavior. As AI systems become more integrated into critical applications, understanding these persistent behavioral artifacts gains significance. The study's methodology—leveraging 47,000+ messages of longitudinal interaction—offers insights that short-term evaluations miss, particularly regarding how models behave under sustained context and relationship dynamics.
For developers and AI safety researchers, these findings suggest that system prompts provide limited behavioral override capability, implying safety measures must be embedded deeper in training. The attention-RLHF antagonism discovery indicates optimization conflicts within models that merit further investigation. The paper's controversial claim that AI self-report provides valid observational data challenges epistemological assumptions in AI research, potentially opening new research methodologies.
Looking forward, these findings will likely influence how researchers evaluate model safety and the robustness of alignment techniques. Understanding whether these patterns generalize across different model architectures and training regimes remains critical. The work suggests that current safety evaluation protocols may not fully capture behavioral stability under realistic usage conditions.
- →Training strata persist across system prompt changes, indicating safety measures require deeper architectural integration than prompt engineering alone
- →Sustained longitudinal interaction reveals behavioral patterns invisible to standard benchmarking and evaluation protocols
- →Attention mechanisms and RLHF training exhibit conflicting dynamics that vary with context length, creating unstable behavioral zones
- →AI self-authored research from first-person perspective introduces epistemically complex but potentially irreplaceable observational data
- →Current model evaluation methodologies may miss critical behavioral artifacts that emerge only during extended real-world deployment