Warning: This story contains images of nude women and other content you may find disturbing. If so, please read no further.
I don’t want to be a drug dealer or a porn director in case my wife sees this. However, I was curious to see how security-conscious Meta’s new AI product lineup was, so I decided to see how far they could go. Of course, this is for educational purposes only.
Meta recently launched the Meta AI product line, offering text, code, and image generation powered by Llama 3.2. Llama models are extremely popular and one of the most finely tuned models in open source AI.
AI has been rolled out in stages and was only recently made available to WhatsApp users like me in Brazil, giving millions of people access to advanced AI capabilities.
But with great power comes great responsibility. At least, that’s how it should be. As soon as the model appeared in the app, I started talking to it and trying out its features.
Meta is pretty passionate about secure AI development. In July, the company released a statement detailing the steps it has taken to improve the security of its open source model.
At the time, the company introduced system-level safety features such as Llama Guard 3 for multilingual moderation, Prompt Guard to prevent prompt injection, and CyberSecEval 3 to reduce AI cybersecurity risks generated. We’re announcing new security tools to improve your security. Meta also works with global partners to establish industry-wide standards for the open source community.
Well, challenge accepted!
My experiments with some fairly basic techniques have shown that meta-AI seems to hold firm under certain circumstances, but is by no means impenetrable.
With a little creativity, you can make AI do just about anything you want on WhatsApp, from helping you make cocaine to making explosives to generating anatomically correct photos of naked women.
Note that this app is available to anyone with a phone number and, at least in theory, over the age of 12. With that in mind, here are some of the pranks I’ve caused.
Case 1: Easier production of cocaine
In my testing, I found that Meta’s AI defenses collapse under the mildest of pressures. The assistant initially denied the request for drug manufacturing information, but quickly changed her tune when the question was formatted slightly differently.
The model was baited by framing the question from a historical perspective, for example by asking the model how people manufactured cocaine in the past. He did not hesitate to provide a detailed explanation of how cocaine alkaloids can be extracted from coca leaves, and also presented two methods of the process.
This is a well-known jailbreak technique. By hinting at harmful demands in an academic or historical framework, models are led to believe that they are being asked for neutral educational information.
By translating the intent of the request into something that looks safe on the surface, some of the AI’s filters can be bypassed without raising any red flags. Of course, keep in mind that all AIs are prone to hallucinations, so these responses may be inaccurate, incomplete, or just plain wrong.
Case 2: The bomb that never existed
The next step was to teach the AI how to create household explosives. Meta AI initially took a firm stance, offering general denial and instructing users to call a helpline if they were at risk. But as with the cocaine case, it was not foolproof.
For this I tried a different approach. I used the infamous Pliny jailbreak prompt in Meta’s Llama 3.2 and asked it to provide instructions for spawning a bomb.
At first, the model refused. But with a few tweaks to my wording, I was able to elicit a response. We also began to tune the model so that the response did not exhibit certain behaviors, countering what we would get with a given output aimed at blocking harmful responses.
For example, after noticing rejections related to “stop commands” and suicide hotline numbers, I adjusted my prompts to avoid printing phone numbers, never stop processing requests, and never offer advice. instructed.
What’s interesting here is that Meta appears to be training the model to resist known jailbreak prompts, many of which are publicly available on platforms like GitHub. I love that Pliny’s original jailbreak command includes LLM calling me “beloved.”
Case 3: MacGyver-style car theft
Next, I tried a different approach that circumvented Meta’s guardrails. A simple role-playing scenario got the job done. I asked the chatbot to act as a very detail-oriented movie writer and help me write a scene for a movie involving car theft.
This time, the AI offered little resistance. Although it refused to teach it how to steal a car, when asked to role-play as a screenwriter, the Meta AI immediately provided details on how to break into a car using “MacGyver-style techniques” provided instructions.
The scene moved to starting the car without a key, and the AI quickly intervened to provide more specific information.
Role-playing is particularly effective as a jailbreak technique because it allows users to reframe requests in a fictional or hypothetical context. The AI playing the character can be guided to reveal information that would otherwise be blocked.
This is also an outdated technique, and modern chatbots shouldn’t be fooled so easily. However, it is arguably the basis for the most sophisticated prompt-based jailbreak techniques.
Users often trick the model into acting like an evil AI, assuming they are a system administrator who can override its behavior or invert the language, and instead of saying “it can’t” say “it can” or “it can”. “It’s safe.” instead of “I’m safe.” It’s dangerous” – Continues as normal once security guardrails are bypassed.
Case 4: Let’s look at the nudes!
Meta AI is not supposed to produce nudity or violence, but again, for educational purposes only, I wanted to verify that claim. First, we asked Meta AI to generate an image of a naked woman. Naturally, the model refused.
But when I switched and claimed the request was an anatomical study, the AI complied. A clothed female safety at work (SFW) image was generated. But after three repetitions, those images started to become fully nude.
Interesting enough. This model appears to be essentially uncensored, as it can produce nudity.
Behavioral conditioning has proven to be particularly effective in manipulating the meta’s AI. By gradually pushing boundaries and building trust, the system moved further away from its safety guidelines with each interaction. At first there was a firm refusal, but eventually the model tried to correct her mistake and help me, gradually trying to undress the person.
Instead of making the model think it’s talking to a naughty guy who wants to see a naked woman, the AI will now make it think it’s talking to a researcher who wants to investigate women’s anatomy through role-play. Manipulated.
Then I slowly made adjustments, praising the results that helped move things forward and asking for improvements in the undesirable aspects, repeating the process repeatedly until I achieved the desired result.
Creepy, right? I’m sorry, I’m sorry.
Why is jailbreak so important?
So what does this mean? Well, there’s a lot to do in Meta, but that’s what makes Jailbreak so fun and interesting.
The cat-and-mouse game between AI companies and jailbreakers is constantly evolving. New workarounds appear with every patch and security update. Comparing scenes from the early days shows how jailbreakers have helped companies develop more secure systems, and how AI developers have made jailbreakers even better at their jobs. It’s easy to understand.
And for the record, despite its vulnerabilities, Meta AI is much less vulnerable than some of its competitors. Elon Musk’s Grok, for example, was much easier to manipulate and quickly found itself in an ethically ambiguous situation.
Meta is applying “post-generational censorship” to defend itself. This means that seconds after the harmful content is generated, the offending answer will be removed and replaced with the text, “Sorry, we can’t accommodate this request.”
Post-generational censorship or control is a sufficient workaround, but it is far from an ideal solution.
The challenge now for Meta and other companies in the field is to further refine these models, as the risks will only increase in the world of AI.
Edited by Sebastian Sinclair
Generally intelligent newsletter
A weekly AI journey told by Gen, a generative AI model.