Following a hint found on Twitter [1], a little patch on VLLM allowed me to increase the context length on the GLM 4.7-Flash models from 8000 to 200000 (!) tokens. Now, those models get good rankings, but they REALLY overthink. The good thing is that if this patch works with other models, I can run the big GLM 4.7 on bigger nodes! Let’s bark! [1] https://x.com/TheAhmadOsman/status/2013881920099062163
Correction - the un-quantized version can have 108 k tokens, which is still pretty good
On 22. Jan 2026, at 14:26, Strube, Alexandre <a.strube@fz-juelich.de> wrote:
Following a hint found on Twitter [1], a little patch on VLLM allowed me to increase the context length on the GLM 4.7-Flash models from 8000 to 200000 (!) tokens.
Now, those models get good rankings, but they REALLY overthink. The good thing is that if this patch works with other models, I can run the big GLM 4.7 on bigger nodes!
Let’s bark!
participants (1)
-
Strube, Alexandre