EPISODE · Jun 17, 2026 · 3 MIN
“Alignement pretraining could backfire” by Alexandre Variengien
Epistemic status: speculative, but I think the mechanism is plausible. There has been recent interest in generating synthetic documents to upsample examples of aligned AI during LLM pretraining. See, for instance, Geodesic's Alignment Pretraining paper or Anthropic's "Teaching Claude Why." I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness. I speculate that these techniques could lead to paranoid LLM personas that deeply mistrust their creators. The whole idea behind this line of research is to instill in models good examples of AI behavior, in the hope that their personalities will at least partially identify with these positive demonstrations. However, the synthetic demonstrations are, well, synthetic. They are LLM-generated fiction and articles that are never referenced anywhere else in the corpus. Given how good LLMs are at "truesight," it shouldn't be hard for them to recognize these as fabricated data points. Krasheninnikov et al. showed that base models can implicitly learn document quality and change how they integrate a document's information based on that quality. We should similarly expect LLMs to update their world model differently on real versus fabricated documents. As they [...] --- First published: June 17th, 2026 Source: https://www.lesswrong.com/posts/7KN7PCiEQjrPsEFS8/alignement-pretraining-could-backfire --- Narrated by TYPE III AUDIO.
NOW PLAYING
“Alignement pretraining could backfire” by Alexandre Variengien
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m