EPISODE · Jun 22, 2026 · 6 MIN
“NLA explanations can be shortened without harming reconstruction” by loops
Natural language autoencoders are a really cool mostly-unsupervised method for producing free-form text explanations of LLM activations. You should read that paper (or the blog post) about them before reading this. I trained[1] several Qwen3-8B NLAs with different length penalties: during RL, I subtracted the token count multiplied by the length penalty hyperparameter (λ) from the RL reward[2]. I found that with small length penalty (λ=0.002), you can reduce the length of NLA explanations by ~40% (compared to having no length penalty) with a fairly small hit to FVE (fraction of variance explained: 0 is guessing the mean activation, 1 is perfect reconstruction) of -0.015. With an even smaller penalty (λ=0.001) the FVE is almost unchanged (+0.007) despite explanations using 28% fewer tokens. Being able to reduce the length so much without impacting FVE is interesting because it could mean that large parts of NLA explanations aren't actually useful for reconstructing the input activation faithfully. Some of this is because the length penalty makes the model use terser wording to convey the same ideas; I'm not sure how much of these results stem from terser wording vs omitting unneeded information. Larger λ values cause the FVE to go below [...] ---Outline:(03:22) Look at some activations!(04:36) What does this mean(05:06) Other notes(05:09) Training dynamics(05:52) Models(06:00) Conclusion The original text contained 2 footnotes which were omitted from this narration. --- First published: June 21st, 2026 Source: https://www.lesswrong.com/posts/NazprRfWJ4qkwcSro/nla-explanations-can-be-shortened-without-harming --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
NOW PLAYING
“NLA explanations can be shortened without harming reconstruction” by loops
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m