Hello Blabladores!

In keeping with the latest developments in LLMs, I decided to try Qwen-3 30B - A3B.

This is a much smaller model than Llama 3.3 - 70B, as well as as the previous

model running as alias-large, DeepSeek-R1-70B, which was not so good and very unstable.

This is a Mixture of Experts (MoE) model, with 30 billion parameters, where each

expert has 3 billion parameters. This means a number of things:

- Given the model is much smaller, we can have a much bigger context size, as there is

more GPU memory available. The consquence is that the model is more useful

with bigger workloads and agents.

While I had to keep Llama 3.3-70 limited to a maximum context size of 8192

tokens, I can let Qwen-3 run at its maximum size, which is 128k tokens.

- The mixture of experts model architecture means that, when activated, you

are performing inference only on one expert at a time, and this is of 3 billion

parameters - meaning it is MUCH faster, which makes it again more useful.

Qwen3 is a much more modern model than the Llama3.3 - Five months in this

field can make a huge difference. Besides, Qwen3 uses the Apache-2.0 license,

and I am much more partial to it than to the arcane Meta Llama License which

should not even exist.

Mind you that while this is a reasoning model, I had to remove the reasoning

part on the chat template - This is because the thinking process was causing the

web ui to fail intermittently.

When I get a fix on that, I will remove the setting so you can see the reasoning

process.

This affects both the Web UI and the API access. The chat template have a

sep="<|im_end|>/nothink",

so the model never reasons.

If you have any feedback on the workings of the model, or if you think the quality

has suffered, please let me know right away! In the worst case I would revert

to Llama3.3. But I hope it won’t be needed!

Let’s bark!

Alex