EPISODE · Jun 23, 2026 · 24 MIN
“Model Size Scaling in 2023-2031” by Vladimir_Nesov
Token generation speed is constrained by the speed at which the relevant HBM can be read, which is mostly the weights and KV-cache. Suppose a model is large, so that more than half of HBM is read when making a single pass over the weights, it's being read in parallel within a scale-up system, and N such systems are used in a pipeline. Then the time it takes to generate a token (without speculative decoding) is at least the time of reading more than half of an HBM stack times N. If we target a particular speed of token generation, this puts a constraint on the number of pipeline stages, which puts a constraint on the total params of the model. But if there isn't enough pretraining compute, models will remain smaller than this constraint (lower sparsity at a given number of active params buys a higher speed of token generation), so both should be taken into account. Working through these considerations gives model sizes feasible for each year between 2023 and 2031. The total params go from 10T in 2026 (at 8x sparsity, still constrained by Oberon racks, trained for 1.3e27 FLOPs) to 240T in 2028 (at [...] ---Outline:(01:57) Time to Fully Read an HBM Stack(04:15) Maximal Pipelines Below 80 Tokens/s(09:07) Pretraining Compute(14:11) Active Params from Pretraining Compute(22:51) Starting in 2028, the Constraint is Pretraining Compute The original text contained 5 footnotes which were omitted from this narration. --- First published: June 22nd, 2026 Source: https://www.lesswrong.com/posts/yLHiQGCPdvzL9fBn3/model-size-scaling-in-2023-2031 --- Narrated by TYPE III AUDIO.
NOW PLAYING
“Model Size Scaling in 2023-2031” by Vladimir_Nesov
No transcript for this episode yet
Similar Episodes
Dec 20, 2021 ·0m