Why we can't have nice things...
Hello Blabladores, I come with is good and not good news at the same time (there are plenty of jokes on the matter, google it). THE BAD NEWS We have to ban and throttle connections on the API. THE GOOD NEWS Because of this, the service is faster for everyone else. This didn’t come easy to me. I always expected that people would behave nice, and not abuse the system. I managed to run it like this for almost five years. Now that finally we got a good scale and stability, someone decided to throw a wrench in the gears. Just because the service has become fast, does not mean it has become a free-for-all. WHY I AM DOING THIS Half a dozen users had been doing hundreds of requests PER SECOND. They had been banned. The system availability has risen a LOT because of that. Look, almost all of our models are not only thinking models, but they are mandatory thinking models. There’s no way to disable the thinking part of them. I will paste the thinking process of such a model here now. TO UNDERSTAND WHAT HAPPENS INSIDE A THINKING LLM I say to the model: “Hello” This is what happens inside the model: ----------------------------- <think> Thinking Process: Analyze the Input: Input: "Hello" Intent: Greeting. Context: Start of a conversation. Tone: Friendly, polite. Determine the appropriate response: Acknowledge the greeting. Offer assistance. Keep it concise but warm. Maintain a helpful persona. Draft potential responses: "Hello! How can I help you today?" "Hi there! What's on your mind?" "Hello! Ready to assist you." "Greetings! How can I be of service?" Select the best option: "Hello! How can I help you today?" is standard, friendly, and clear. Final Polish: Add a slight variation to sound natural. "Hello! How can I assist you today?" or "Hi there! How's it going?" Let's go with a balanced, friendly greeting. Final Output Generation: (Matches the selected draft) "Hello! How can I help you today?" or similar. Self-Correction during thought process: Since I am an AI, I should keep it open-ended so the user knows they can ask anything. Revised Plan: Greet back. Offer help. Keep it concise. Final Choice: "Hello! How can I help you today?" (Simple and effective). Alternative: "Hi there! What can I do for you?" Decision: "Hello! How can I help you today?" is the safest and most standard. Wait, I need to make sure I don't sound too robotic. "Hello! It's nice to meet you. How can I assist you today?" Let's stick to the core function: Greeting + Offer of assistance. Final Output: "Hello! How can I help you today?"cw </think> Hello! How can I help you today? ----------------------------- CONCLUSION One should assume that this query would take longer than a second. So, sending hundreds of queries per second is not only absurd, it’s downright a Denial of Service Attack and it’s being treated as such. WHAT WILL HAPPEN I am testing throttling. You flood the server once, you get banned for 10 minutes - warning shot. You do it again, the time will increase. And there we go. WHO IS AFFECTED? Agentic pipelines and well-behaved code is okay. Serial connections are totally ok. Talking on the website is okay. What is not okay: parallel requests via api. Sending more than a request per second for a sustained period will get you on the list. IF YOU HAD BEEN BANNED You already know who you are, but I do not. I do not record who talks to blablador and what, but I record IP addresses of requests. I can say the city where you are at, but that’s about it. If you had been banned and think this is a bug on your software and you fixed it, talk to me and I reinstate you, no problem. I don’t want to break your work, I want the system to work for us all. And let’s be honest, we are WAY faster and more generous with tokens than ANY other service on the market ;-) We bark, but sometimes we have to bite a bit. Alex
participants (1)
-
Strube, Alexandre