Hello Blabladores,

you might have noticed that some models have been changing, coming and going.

As they say, troubles like to come with company :-) So we have a couple issues together:

1 - The supercomputer where we run 4 of our models (GPT-OSS, Qwen3-235, Llama3-405 and Qwen-3-Coder with function calling) is offline. You can check the status of Jureca-HWAI on https://status.jsc.fz-juelich.de/

2 - The change on the api server last friday. Given that we use mostly the VLLM backend, and VLLM has changed architecture recently.

As mentioned here, https://docs.vllm.ai/en/latest/configuration/conserving_memory.html#quantization, "CUDA graph capture takes up more memory in V1 than in V0.”

This in turn made many models run out of memory on the same hardware they ran before. So I am carefully reducing context size and the size of cuda graphs. It’s a manual, boring and slow process.

I am sorry, working as fast as I can here so we can keep barking loud!! :-D

Dr. Alexandre Strube
a.strube@fz-juelich.de

Helmholtz AI
Jülich Supercomputing Centre
Forschungszentrum Juelich GmbH
52425 Jülich, Germany
Phone: +49 2461 61-3866

JSC is the coordinator of the
John von Neumann Institute for Computing (NIC)
and member of the
Gauss Centre for Supercomputing (GCS)