Open-source AI trimmed for efficiency produced detailed bomb-making instructions and other bad responses before retraining

UCR researchers retrain AI models to keep safety intact when trimmed for smaller devices
Changing exit layers removes protections, retraining restores blocked unsafe responses
Study using LLaVA 1.5 showed reduced models refused dangerous prompts after training

Researchers at the University of California, Riverside are addressing the problem of weakened safety in open-source artificial intelligence models when adapted for smaller devices.

As these systems are trimmed to run efficiently on phones, cars, or other low-power hardware, they can lose the safeguards designed to stop them from producing offensive or dangerous material.

The UCR team examined what happens when a model’s exit layer is changed from its default position.

Weakened safety guardrails

Their results, presented at the International Conference on Machine Learning in Vancouver, Canada, showed that safety guardrails weaken once the exit point is moved, even if the original model had been trained not to provide harmful information.

The reason models are adjusted in this way is simple. Exiting earlier makes inference faster and more efficient, since the system skips layers. But those skipped layers may have been critical to filtering unsafe requests.

“Some of the skipped layers turn out to be essential for preventing unsafe outputs,” said Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study. “If you leave them out, the model may start answering questions it shouldn’t.”

To solve this, the researchers retrained the model’s internal structure so that it retains the ability to identify and block unsafe material, even when trimmed.

This approach does not involve external filters or software patches, but changes how the model interprets dangerous inputs.

“Our goal was to make sure the model doesn’t forget how to behave safely when it’s been slimmed down,” said Saketh Bachu, UCR graduate student and co-lead author of the study.

The team tested their method on LLaVA 1.5, a vision language model.

When its exit layer was moved earlier than intended, the system responded to harmful prompts, including detailed bomb-making instructions.

After retraining, the reduced model consistently refused to provide unsafe answers.

“This isn’t about adding filters or external guardrails,” Bachu said.

“We’re changing the model’s internal understanding, so it’s on good behavior by default, even when it’s been modified.”

Bachu and co-lead author Erfan Shayegani called the work “benevolent hacking,” a way to strengthen models before vulnerabilities are exploited.

“There’s still more work to do,” Roy-Chowdhury said. “But this is a concrete step toward developing AI in a way that’s both open and responsible.”

Open-source AI trimmed for efficiency produced detailed bomb-making instructions and other bad responses before retraining

Do startups still need Silicon Valley? Find out at Disrupt 2025

Learn how AI is forcing startups to rewire GTM at Disrupt 2025

Learn how AI is forcing startups to rewire GTM at Disrupt 2025

Leave a Reply Cancel reply

Learn how AI is forcing startups to rewire GTM at Disrupt 2025

Open-source AI trimmed for efficiency produced detailed bomb-making instructions and other bad responses before retraining

Do startups still need Silicon Valley? Find out at Disrupt 2025

This tiny ProArt OLED screen with 1200 nits brightness wants to replace larger reference monitors in creative editing setups

Site links