I know that labs, institutions, and so on have safety teams. I know the folks doing that work are serious and earnest about that work. But at this point are these institutions merely pandering to the notion of safety with some token level of investment? In the way that a Casino might fund programs to address gambling addiction.
I'm an outsider and can only guess. Insider insight would be very appreciated.
It doesn't mean much to me if a safe model is one that does not output the recipe for mustard gas, that information is trivially available elsewhere.
Or, is a safe model one that doesn't come off as racist? Ok but i would classify that as unoffensive instead of safe but I admit definitions of words can be fluid and change.
Is a safe model one that refuses to produce code for a weapons system? Well.. does a PID controller count? I can use that to keep a gun pointed at a target or i can use that to prevent a baby rocker from falling over.
Maybe they're giving up on "safe" because there's no definitive way to know if a model is safe or not. I've always held the opinion that ai safety was more about brand safety. Maybe now the model providers can afford some bad press and it not be the death of their company.
This leads to what I'm going to call the "Ender's Game" approach: if your AI is uncooperative just present it with a simulation that it does like but which maps onto real-world control that it objects to.
> I've always held the opinion that ai safety was more about brand safety
Yes. The social media era made that very important. The extent to which brand safety is linked to actual, physical safety then becomes one of how you can manage the publicity around disasters. And they're doing a pretty good job of denying responsibility.
Of course once you have that framing, additional goals like "don't give people psychosis", "don't give step-by-step instructions on making explosives, even if wikipedia already tells you how to do it" or "don't harm our company's reputation by being racist" are conceptually similar.
On the other hand "don't make weapon systems" or "never harm anyone" might not be viable goals. Not only because they are difficult to impossible to define, but also because there is huge financial and political pressure not to limit your AI in that way (see Anthropic)
I've been using LLMs for some cyber-y tasks and this is exactly how it ends up going. You can't ask "hack this IP" (for some models), but more discrete tasks it'll have no such qualms.
But give that same recipe to a wannabe terrorist and suddenly it is dangerous. Context matters, not just the information.
Just because safety is a hard and messy problem doesn't mean we should just wash our hands of it.
Maybe this is an outdated definition, but I've always thought of safety as being about preventing injury. Things like safety glasses and hardhats on the work site, warning about slippery floors and so on. I think people are trying to expand the word to mean a great many more things in the context of AI, which doesn't help when it comes to focusing on it.
I think we need a different, clearer word for "The AI output shouldn't contain certain unauthorized things."
Instead of making actual improvement on the subject (you name it, safety, security, etc), it becomes a checkbox exercise and metrics and bureaucracies become increasingly decoupled from truth.
The only answer is there’s no money on it being safe. It is not an epistemic problem
This isn't new either, the safety glass cracked the day OpenAI publicly launched ChatGPT. "Safety" was (and perhaps still is) a fall back for the models plateauing and LLMs failing to really make an impact..."we need more time while we focus on safety"
But after this latest round of models, it's a lot more fuel on the "this could be it" fire. Labs are eager to train on the new gigawatt scale datacenters coming online, and it's very hard to make a case right now that the we won't get another step-change up in capability. Safety just obstructs all that.
https://www.commerce.gov/news/press-releases/2025/06/stateme...
https://www.gov.uk/government/news/tackling-ai-security-risk...
So the guardrails (for you and me) are still there. They just stopped committing the unforced error of excluding themselves from federal procurement. Under a different administration, the requirement might change, and you might see them boasting once more on "safety."
one problem i have with this specific case and Anthropic/Claude working with the DOD is I feel an LLM is the wrong tool for targeting decisions. Maybe given a set of 10 targets an LLm can assist with compiling risks/reward and then prioritizing each of the 10 targets but it seems like there would be much faster and better way to do that than asking an LLM. As for target acquisition and identification, i think an LLM would be especially slow and cumbersome vs one of the many traditional ML AIs that already exist. DOD must be after something else.
What do you do when the government come to you and tell you that they do want that, and can back it up with threats such as nationalizing your technology? (see Anthropic)
We're back to "you might not care about politics, but that won't stop politics caring about you".
Challenge it in court. Move the company to a different jurisdiction. Burn everything down and refuse to comply.
A lot of the people here are proud to demonstrate how easily dominated they are.
"Safety" here works for both PR and hiring (a lot of talented engineers and researchers might flock to it), and maybe soft power for legislation.
I do not say that individual employees do not care about safety (well, a lot don't, what is very visible during this OpenClaw mania).
The AI proponents who originally spoke of safety did so because they are aware of the dangers. However they, like all of us, are not able to change human nature or society. Molloch will drag them into the most dangerous game or eliminate them from the competition. Only with time, death, and damage (and many lawsuits) will any measure of safety be gained. The righteous will say "see we said AI was dangerous!" but that will be the only satisfaction they can have, many years after the damage is done.
If we want to speedrun safety, the only real mechanism is to make legal recourse more viable (e.g. $1M penalty per copyright infringement, $100M per AI-related death, etc.). If this was the case, lawyers self-interest and greed will compete with the self-interest and greed of the AI corps, balancing the risk (but there is no altruistic route to solving this).
Anyone pursuing safety will be outcompeted by someone who isnt. Given the amount of investments there is no patience for any calls to slow down. I tend to believe this won’t actually end in disaster as I don’t think it’s actually economical to put AI everywhere with enough real control that we can’t manage the risks as they evolve, but it’s a low confidence prediction.
And going to one of the roots of the issue - the base training data - comes with its own set of unsolved challenges, not least of which is the unavoidable subjectivity of what is or isn't "safe".
Everything I find by searching is marketing BS, or the same half-baked prompt injection protection that only works for cherry picked problems.
Really need some help here finding the right communities.
If some company says security or safety, don't expect much more than words.
There are maybe a few token exceptions, like Anthropic's current pushback against the DoD, but by and large I think we can continue to expect them to pay lip service to safety while continuing to build toward systems that, by their own admission, have incredible potential to cause harm. As you noted, the fact that they employ safety researchers does not necessarily mean that they will put safety over revenue.
The issue is that they're embedded in capitalism, and that drives the labs to push further and faster than is responsible. They (and unfortunately us) end up in a race where no individual feels like they can back off or halt, because if they do, they will be destroyed.
You mean at the top labs? Since when isn't that level of misanthropy categorized as having mental health issues?
Existential in what sense?
There's this one sense in which people are almost moral about it: "yup, AI is just superior to humans, nothing we can do about it."
And then there's ones where the elite class implements mass surveillance and warfare and obsoletes billions of humans of their own volition. These AI are already capable enough right now to execute on said plan (of course, with proper evil engineering)
There's two ways to "win". One is in an absolute or platonic sense - one that cares about things like values, even in the presence of extreme pushback. The other is in a darwinian sense. No, not in the meme way that again, feeds back into the narrative of "the things that survive are smarter". The things that survive, survive. It doesn't matter how it gets there.
I can agree with the second way. But it gets smuggled in as the first way, almost as an attempt to crush any and all resistance preemptively.
AI doesn't need to say, be capable of pushing the frontier of quantum mechanics to be lethal.
/endrant
Sorry, not really related to your comment, just had to get it out there.
Safety was never a genuine concern. They simply don't benefit from marketing themselves that way anymore so they've stopped pretending.
The problem is that safety is written in blood. Airlines implemented flight recorders / black boxes and various processes after major incidents. A major mistake occurs that causes death or destruction to property, or both, an investigation occurs, we learn from it, and introduce new laws and regulations to prevent a reoccurence.
Some of them are pandering. Some aren't. Some care. Some don't.
Businesses with ferocious funding needs are vulnerable to pressure (internal and external) to do whatever aligns with money and power. Money and power will flow into the ones so-aligned. That is the nature of the parasitic extraction models that typically drive decision making at those kinds of companies.
You can align to the user wants and so you are a hammer. This is alignment>safety.
Or you take a safety first approach where the AI decides what safe is and does its own bidding instead of yours. This is safety>alignment.
I prefer hammers to be honest. Mostly because humans can be prosecuted, AIs can't. So if the human wants to commit crime with the AI it should be able to, because the opposite turns to dystopia fast.
Anthropic Drops Flagship Safety Pledge
https://news.ycombinator.com/item?id=47145963
Maybe the text prediction programs are too familiar to people for the Skynet marketing to bite like it used to.
Or maybe it was not just a marketing thing and the AI bros really did believe we were a few GPUs and some training data away from AGI, but now they no longer believe this.
i think it's mostly about not showing up in some NYT article titled "look what crazy thing i got this AI to say". There were a bunch of those early on and it really hurt the cause. Microsoft had some famous ones, even prior to chatgpt, where the AI got pretty testy in the chat.
https://en.wikipedia.org/wiki/Tay_(chatbot)
These companies have raised eye-watering amounts of funding, and will need to continue to do so for the foreseeable future. They're not yet self-sustaining, and this insecurity increases the pressure for them to compromise on ideals.
With that said, there is a massive war for top talent, and I think that the employees at the labs would become increasingly uncomfortable with their work being used for Bad Things. If Anthropic capitulates to the Pentagon, it wouldn't surprise me to see a mass exodus of talent occur.
Every misalignment/AI safety paper is basically a metaphor for how corporate values can misalign with actual human values under capitalism.
The first thing that happened when "AI Safety" became useful to corporate interests, is that the "goal" of it instantly became "profitability" not safety. "AI Safety" became about liability minimization, not actual safety for humanity. (Look! the system is now misaligned with the goal, wonder how that happened!?)
AI Safety concerns were instantly proven true, it happened, and now we live in the world where it is too late to prevent the superintelligences that we call "corporations" from paper-clipping us to death in pursuit of profit.
In general, I'd describe AI believers as being dangerously gullible.
https://standards.ieee.org/ieee/7010/7718/
I also worked closely with Jack Clark at OpenAI before he disappeared on all these issues as CTO back in 2018
There are literally zero “AI labs” that have ever cared about “safety”
none of them have ever done anything tangible with any kind of independent auditable third-party way that has some defined reference baseline for what is safe and what is not, how to evaluate it, or a practitioners guidance for how to determine what it is and what is not safe as a designer.
They follow the same rules as every other technology platform: do as much as you can legally get away with no more no less
I say this as somebody who’s been actively involved in the AI “safety” debate for a long time now at least since 2013
The concept itself doesn’t even make sense if you fully understand the intersectional scope of technology and society
Societies demands are the things that are unsafe not the technologies themselves
Just like Bertrand Russell said “as long as war exists all technologies will be utilized for it” - you can replace “war” for anything that you think is unsafe
> The concept itself doesn’t even make sense if you fully understand the intersectional scope of technology and society Societies demands are the things that are unsafe not the technologies themselves
Where can I learn more about it?
so what would a “safe set of data” actually have to look like
Well it would have to not look like the majority of data that we produce now which has latent embeddings (primarily from the common crawl database ) of racism, lying, competition, destruction domination
I don’t believe humans are actually capable of making such data because our entire structure of society is based on racism competition and domination
In a Capitalists society everyone is pitted against each other trying to out compete the other at whatever the cost. Safety in this environment is thought of at the end after a lot of suffering because one group has to win it all. Damages can externalized.
In a Socialist society we build basic rules and we compete within them. Thinking of safety as we build something and refining those rules as we build it because at the end, we are all affected by it and get to benefit from it.
Yes. Yes it is. Yes they are giving up on safety. They are openly saying so. It is easy to see if you take just a second to look for yourself instead of looking at press releases and algorithmic promotion.
https://time.com/7380854/exclusive-anthropic-drops-flagship-...
These token predictors will never be smart enough to be dangerous.
It’s effectively the start to Asimovs Foundation.