Following a hint found on Twitter [1], a little patch on VLLM allowed me to increase the context
length on the GLM 4.7-Flash models from 8000 to 200000 (!) tokens.

Now, those models get good rankings, but they REALLY overthink. The good thing is that if this
patch works with other models, I can run the big GLM 4.7 on bigger nodes!

Let’s bark!



[1] https://x.com/TheAhmadOsman/status/2013881920099062163