Hello Blabladores!
In keeping with the latest developments in LLMs, I decided to try Qwen-3 30B - A3B.
This is a much smaller model than Llama 3.3 - 70B, as well as as the previous
model running as alias-large, DeepSeek-R1-70B, which was not so good and very unstable.
This is a Mixture of Experts (MoE) model, with 30 billion parameters, where each
expert has 3 billion parameters. This means a number of things:
- Given the model is much smaller, we can have a much bigger context size, as there is
more GPU memory available. The consquence is that the model is more useful
with bigger workloads and agents.
While I had to keep Llama 3.3-70 limited to a maximum context size of 8192
tokens, I can let Qwen-3 run at its maximum size, which is 128k tokens.
- The mixture of experts model architecture means that, when activated, you
are performing inference only on one expert at a time, and this is of 3 billion
parameters - meaning it is MUCH faster, which makes it again more useful.
Qwen3 is a much more modern model than the Llama3.3 - Five months in this
field can make a huge difference. Besides, Qwen3 uses the Apache-2.0 license,
and I am much more partial to it than to the arcane Meta Llama License which
should not even exist.
Mind you that while this is a reasoning model, I had to remove the reasoning
part on the chat template - This is because the thinking process was causing the
web ui to fail intermittently.
When I get a fix on that, I will remove the setting so you can see the reasoning
process.
This affects both the Web UI and the API access. The chat template have a
sep="<|im_end|>/nothink",
so the model never reasons.
If you have any feedback on the workings of the model, or if you think the quality
has suffered, please let me know right away! In the worst case I would revert
to Llama3.3. But I hope it won’t be needed!
Let’s bark!
Alex