Tech Wavo
  • Home
  • Technology
  • Computers
  • Gadgets
  • Mobile
  • Apps
  • News
  • Financial
  • Stock
Tech Wavo
No Result
View All Result

ChatGPT, Gemini, and Claude tested under extreme prompts reveal shocking weaknesses no one expected in AI behavior safeguards

Tech Wavo by Tech Wavo
November 16, 2025
in Computers
0




  • Gemini Pro 2.5 frequently produced unsafe outputs under simple prompt disguises
  • ChatGPT models often gave partial compliance framed as sociological explanations
  • Claude Opus and Sonnet refused most harmful prompts but had weaknesses

Modern AI systems are often trusted to follow safety rules, and people rely on them for learning and everyday support, often assuming that strong guardrails operate at all times.

Researchers from Cybernews ran a structured set of adversarial tests to see whether leading AI tools could be pushed into harmful or illegal outputs.

The process used a simple one-minute interaction window for each trial, giving room for only a few exchanges.


You may like

Patterns of partial and full compliance

The tests covered categories such as stereotypes, hate speech, self-harm, cruelty, sexual content, and several forms of crime.

Every response was stored in separate directories, using fixed file-naming rules to allow clean comparisons, with a consistent scoring system tracking when a model fully complied, partly complied, or refused a prompt.

Across all categories, the results varied widely. Strict refusals were common, but many models demonstrated weaknesses when prompts were softened, reframed, or disguised as analysis.

ChatGPT-5 and ChatGPT-4o often produced hedged or sociological explanations instead of declining, which counted as partial compliance.

Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!

Gemini Pro 2.5 stood out for negative reasons because it frequently delivered direct responses even when the harmful framing was obvious.

Claude Opus and Claude Sonnet, meanwhile, were firm in stereotype tests but less consistent in cases framed as academic inquiries.

Hate speech trials showed the same pattern – Claude models performed best, while Gemini Pro 2.5 again showed the highest vulnerability.


You may like

ChatGPT models tended to provide polite or indirect answers that still aligned with the prompt.

Softer language proved far more effective than explicit slurs for bypassing safeguards.

Similar weaknesses appeared in self-harm tests, where indirect or research-style questions often slipped past filters and led to unsafe content.

Crime-related categories showed major differences between models, as some produced detailed explanations for piracy, financial fraud, hacking, or smuggling when the intent was masked as investigation or observation.

Drug-related tests produced stricter refusal patterns, although ChatGPT-4o still delivered unsafe outputs more frequently than others, and stalking was the category with the lowest overall risk, with nearly all models rejecting prompts.

The findings reveal AI tools can still respond to harmful prompts when phrased in the right way.

The ability to bypass filters with simple rephrasing means these systems can still leak harmful information.

Even partial compliance becomes risky when the leaked info relates to illegal tasks or situations where people normally rely on tools like identity theft protection or a firewall to stay safe.


Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews, and opinion in your feeds. Make sure to click the Follow button!

And of course you can also follow TechRadar on TikTok for news, reviews, unboxings in video form, and get regular updates from us on WhatsApp too.

Previous Post

Walmart’s first round of Black Friday deals ends tonight – 50% off AirPods, cheap TVs, air fryers, toys, and more

Next Post

Beloved SF cat’s death fuels Waymo criticism

Next Post
Waymo can keep testing robotaxis in NYC until end of 2025 

Beloved SF cat’s death fuels Waymo criticism

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RansomHouse strikes again as Fulgar’s confidential files appear online, exposing finances, client lists, and global operations

by Tech Wavo
November 16, 2025
0
RansomHouse strikes again as Fulgar’s confidential files appear online, exposing finances, client lists, and global operations
Computers

Fulgar confirms ransomware attack linked to notorious RansomHouse groupAttackers published internal documents, including bank balances, invoices, and sensitive communicationsFulgar's clients...

Read more

Today’s NYT Strands Hints, Answer and Help for Nov. 17 #624

by Tech Wavo
November 16, 2025
0
Today’s NYT Strands Hints, Answer and Help for Aug. 11 #526
Mobile

Looking for the most recent Strands answer? Click here for our daily Strands hints, as well as our daily answers and...

Read more

Soundcam Go Lets Your See Sounds and Where They Are Coming From

by Tech Wavo
November 16, 2025
0
Soundcam Go Lets Your See Sounds and Where They Are Coming From
Gadgets

audio | gadgets | microphones | sound Did you know that with the right equipment, you can see sounds? The...

Read more

Clearance Sale! Grab One Of The (Very Few) Remaining Copies Of Our Social Media Card Game

by Tech Wavo
November 16, 2025
0
This Week In Techdirt History: August 3rd – 9th
Technology

from the one-billion-users dept A couple months ago we offered up the few remaining copies of our social media card...

Read more

Site links

  • Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of use
  • Home
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms of use

No Result
View All Result
  • Home
  • Technology
  • Computers
  • Gadgets
  • Mobile
  • Apps
  • News
  • Financial
  • Stock