GLM 4-7 Flash in two variants!

20 Jan 2026

      Hello Blabladores!

Some of you might have heard of the best medium-sized model released last night, GLM-4.7 Flash.

You know the story: It’s slightly better than last week’s model of similar capacity.

It’s a 31 billion parameter model, a mixture-of-experts with 3 billion activated.

As we hear more and more about strongly quantizing models, I decided to run an experiment.

I am running not one, but two versions of GLM 4.7-Flash: One with quantization and one without.

Given that they are running on identical hardware (4x RTX 3090 each), something has to give: the
not-quantized model has a smaller context length (around 18k tokens, vs 68k tokens for the quantized
one). The quantized one is slightly faster, but not by much. I might have to optimize it still.

In any case, they are both available for you to compare. What is better: Quantized but bigger context or
not-quantized but smaller? You decide.

Let’s bark!

Alex

Strube, Alexandre

tags

participants (1)