Hello Blabladores! Some of you might have heard of the best medium-sized model released last night, GLM-4.7 Flash. You know the story: It’s slightly better than last week’s model of similar capacity. It’s a 31 billion parameter model, a mixture-of-experts with 3 billion activated. As we hear more and more about strongly quantizing models, I decided to run an experiment. I am running not one, but two versions of GLM 4.7-Flash: One with quantization and one without. Given that they are running on identical hardware (4x RTX 3090 each), something has to give: the not-quantized model has a smaller context length (around 18k tokens, vs 68k tokens for the quantized one). The quantized one is slightly faster, but not by much. I might have to optimize it still. In any case, they are both available for you to compare. What is better: Quantized but bigger context or not-quantized but smaller? You decide. Let’s bark! Alex
participants (1)
-
Strube, Alexandre