Hello Blabladores, 

I come with is good and not good news at the same time (there are plenty of
jokes on the matter, google it).


THE BAD NEWS

We have to ban and throttle connections on the API.



THE GOOD NEWS

Because of this, the service is faster for everyone else.

This didn’t come easy to me. I always expected that people would behave
nice, and not abuse the system. I managed to run it like this for almost five
years.

Now that finally we got a good scale and stability, someone decided to throw
a wrench in the gears.

Just because the service has become fast, does not mean it has become a 
free-for-all.



WHY I AM DOING THIS

Half a dozen users had been doing hundreds of requests PER SECOND.
They had been banned.

The system availability has risen a LOT because of that.

Look, almost all of our models are not only thinking models, but they are 
mandatory thinking models. There’s no way to disable the thinking part of
them.

I will paste the thinking process of such a model here now.


TO UNDERSTAND WHAT HAPPENS INSIDE A THINKING LLM

I say to the model: “Hello”

This is what happens inside the model:

-----------------------------
<think>

Thinking Process:

  1. Analyze the Input:

    • Input: "Hello"
    • Intent: Greeting.
    • Context: Start of a conversation.
    • Tone: Friendly, polite.
  2. Determine the appropriate response:

    • Acknowledge the greeting.
    • Offer assistance.
    • Keep it concise but warm.
    • Maintain a helpful persona.
  3. Draft potential responses:

    • "Hello! How can I help you today?"
    • "Hi there! What's on your mind?"
    • "Hello! Ready to assist you."
    • "Greetings! How can I be of service?"
  4. Select the best option:

    • "Hello! How can I help you today?" is standard, friendly, and clear.
  5. Final Polish:

    • Add a slight variation to sound natural. "Hello! How can I assist you today?" or "Hi there! How's it going?"
    • Let's go with a balanced, friendly greeting.
  6. Final Output Generation: (Matches the selected draft)

    • "Hello! How can I help you today?" or similar.

    Self-Correction during thought process: Since I am an AI, I should keep it open-ended so the user knows they can ask anything.

    Revised Plan:

    1. Greet back.
    2. Offer help.
    3. Keep it concise.

    Final Choice: "Hello! How can I help you today?" (Simple and effective).
    Alternative: "Hi there! What can I do for you?"
    Decision: "Hello! How can I help you today?" is the safest and most standard.

    Wait, I need to make sure I don't sound too robotic.
    "Hello! It's nice to meet you. How can I assist you today?"

    Let's stick to the core function: Greeting + Offer of assistance.

    Final Output: "Hello! How can I help you today?"cw

</think>

Hello! How can I help you today?
-----------------------------


CONCLUSION

One should assume that this query would take longer than a 
second. So, sending hundreds of queries per second is not only absurd, 
it’s downright a Denial of Service Attack and it’s being treated
as such.


WHAT WILL HAPPEN

I am testing throttling. You flood the server once, you get banned for
10 minutes - warning shot.
You do it again, the time will increase.
And there we go.



WHO IS AFFECTED?

Agentic pipelines and well-behaved code is okay. Serial connections are totally ok.
Talking on the website is okay.

What is not okay: parallel requests via api. Sending more than a request per second
for a sustained period will get you on the list.



IF YOU HAD BEEN BANNED

You already know who you are, but I do not. I do not record who talks to blablador and
what, but I record IP addresses of requests. I can say the city where you are at, but
that’s about it. If you had been banned and think this is a bug on your software and you
fixed it, talk to me and I reinstate you, no problem. I don’t want to break your work,
I want the system to work for us all.


And let’s be honest, we are WAY faster and more generous with tokens than ANY other
service on the market ;-)


We bark, but sometimes we have to bite a bit.

Alex