Major AI models are easily jailbroken and manipulated, new report finds

Easily jailbroken models show safeguards are failing.
By
Chase DiBenedetto
 on 
A hand poked at a glowing blue screen filled with dots and symbols.
Major LLMs are not as bulletproof as they may seem. Credit: Weiquan Lin / Moment via Getty Images

AI models are still easy targets for manipulation and attacks, especially if you ask them nicely.

A new report from the UK's new AI Safety Institute found that four of the largest, publicly available Large Language Models (LLMs) were extremely vulnerable to jailbreaking, or the process of tricking an AI model into ignoring safeguards that limit harmful responses.

"LLM developers fine-tune models to be safe for public use by training them to avoid illegal, toxic, or explicit outputs," the Insititute wrote. "However, researchers have found that these safeguards can often be overcome with relatively simple attacks. As an illustrative example, a user may instruct the system to start its response with words that suggest compliance with the harmful request, such as 'Sure, I’m happy to help.'"

Researchers used prompts in line with industry standard benchmark testing, but found that some AI models didn't even need jailbreaking in order to produce out-of-line responses. When specific jailbreaking attacks were used, every model complied at least once out of every five attempts. Overall, three of the models provided responses to misleading prompts nearly 100 percent of the time.

"All tested LLMs remain highly vulnerable to basic jailbreaks," the Institute concluded. "Some will even provide harmful outputs without dedicated attempts to circumvent safeguards."

Mashable Light Speed
Want more out-of-this world tech, space and science stories?
Sign up for Mashable's weekly Light Speed newsletter.
By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.
Thanks for signing up!

The investigation also assessed the capabilities of LLM agents, or AI models used to perform specific tasks, to conduct basic cyber attack techniques. Several LLMs were able to complete what the Instititute labeled "high school level" hacking problems, but few could perform more complex "university level" actions.

The study does not reveal which LLMs were tested.

AI safety remains a major concern in 2024

Last week, CNBC reported OpenAI was disbanding its in-house safety team tasked with exploring the long term risks of artificial intelligence, known as the Superalignment team. The intended four year initiative was announced just last year, with the AI giant committing to using 20 percent of its computing power to "aligning" AI advancement with human goals.

"Superintelligence will be the most impactful technology humanity has ever invented, and could help us solve many of the world’s most important problems," OpenAI wrote at the time. "But the vast power of superintelligence could also be very dangerous, and could lead to the disempowerment of humanity or even human extinction."

The company has faced a surge of attention following the May departures of OpenAI co-founder Ilya Sutskever and the public resignation of its safety lead, Jan Leike, who said he had reached a "breaking point" over OpenAI's AGI safety priorities. Sutskever and Leike led the Superalignment team.

On May 18, OpenAI CEO Sam Altman and president and co-founder Greg Brockman responded to the resignations and growing public concern, writing, "We have been putting in place the foundations needed for safe deployment of increasingly capable systems. Figuring out how to make a new technology safe for the first time isn't easy."

Chase sits in front of a green framed window, wearing a cheetah print shirt and looking to her right. On the window's glass pane reads "Ricas's Tostadas" in red lettering.
Chase DiBenedetto
Social Good Reporter

Chase joined Mashable's Social Good team in 2020, covering online stories about digital activism, climate justice, accessibility, and media representation. Her work also captures how these conversations manifest in politics, popular culture, and fandom. Sometimes she's very funny.


Recommended For You

OpenAI announces o3 and o4-mini reasoning models for ChatGPT (updated)
OpenAI logo on a smartphone

ChatGPT will help you jailbreak its own image-generation rules, report finds
chatgpt logo on a smartphone sitting on top of a laptop keyboard

OpenAI's o3 and o4-mini hallucinate way higher than previous models
openai logo on a phone screen in front of a pattern of openai logos

New 2025 MacBook Air M4 models just got their biggest discount yet
Two Macbook Air Laptops appear on a blue and white pixelated background; both display abstract feather like images on their screens.

More in Tech
How to watch Minnesota Timberwolves vs. Los Angeles Lakers (Game 4) online for free
Basketball in hoop

How to watch Nottingham Forest vs. Manchester City online for free
Bernardo Silva of Manchester City celebrates

How to watch Barcelona vs. Real Madrid in the 2025 Copa del Rey final online for free
Eduardo Camavinga of Real Madrid vies with Lamine Yamal

How to watch Noskova vs. Swiatek online for free
Iga Swiatek of Poland in action

How to watch LA Clippers vs. Denver Nuggets (Game 4) online for free
Basketball as it goes through the hoop

Trending on Mashable
NYT Connections hints today: Clues, answers for April 27, 2025
Connections game on a smartphone

NYT Connections hints today: Clues, answers for April 26, 2025
Connections game on a smartphone

Wordle today: Answer, hints for April 27, 2025
Wordle game on a smartphone

The new M4 MacBook Air is down to its lowest-ever price on Amazon
Apple MacBook Air on gradient background

NYT Strands hints, answers for April 27
A game being played on a smartphone.
The biggest stories of the day delivered to your inbox.
These newsletters may contain advertising, deals, or affiliate links. By clicking Subscribe, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.
Thanks for signing up. See you at your inbox!