While building out a new AI agent recently, my colleagues and I hit an interesting debate: what is the right way to set guardrails? Do you tell an agent exactly what to do, or do you give it a list of things not to do?
The problem with “don’t do something” is that the exclusion list becomes virtually endless. If I am instructing an AI SDR, I can spend days writing rules like “don’t be pushy” or “don’t use fear-based language.” But I noticed that the more I said “no,” the more the models seemed to trip.
I decided to test this scientifically across 2,037 trials to see how success rates change based on how a prompt is phrased.
The evaluation stack
To move beyond intuition, I built a two-stage evaluation pipeline to grade the results:
- The deterministic layer: I wrote a Python suite that uses rigid logic—regex word-boundaries and sentence-level counters—to catch objective failures. If I tell a model “No sentences over 10 words,” the code counts them. This is my “Robot Judge.”
- The intelligent layer: For more nuanced tasks, I used an LLM-as-a-Judge (Gemini). This layer catches “semantic leakage”—cases where a model technically avoids a banned word but uses a substring or synonym that violates the spirit of the rule (like using “sunlight” when “light” is banned).
Discovery: success rates
The data confirmed a striking pattern. For standard models, the mere presence of a negative constraint acts as a performance anchor.
| Model Class | Model | Success (Do) | Success (Don’t) |
|---|---|---|---|
| Standard | Gemma 3 (4B) | 74.5% | 43.6% |
| Standard | RNJ-1 | 88.2% | 68.6% |
| Reasoning | DeepSeek-R1 | 77.9% | 76.4% |
| Reasoning | Nvidia Nemotron | 75.9% | 72.9% |
| Reasoning | GPT-OSS (120B) | 87.7% | 90.2% |
Observations
While I have not evaluated the entire universe of models, I observed that smaller models in my sample set struggled significantly with negation. Gemma 3 (4B) saw a 31-point drop in success simply by adding the word “not.” In these specific instances, the negative constraint effectively turned a reliable assistant into a coin-flip.
However, the “Reasoning” models work differently. DeepSeek-R1 and Nvidia Nemotron showed almost no difference between “Do” and “Don’t.” The most interesting outlier in my data was GPT-OSS (120B), which actually performed slightly better when given negative constraints. Because these models “think” before they speak, they appear to use an internal buffer to plan a path around banned words.
So what?
These results show that “NOT” is a performance tax that varies depending on the model’s architecture.
I’ve found that standard models often struggle to override their probabilistic training when faced with negation. While this is a work in progress, it has already changed how I think about building agents. Understanding these architectural thresholds is the difference between a reliable agent and a hallucinating one.
What’s next
I am going to continue evaluating more models and more diverse behaviors. I’m currently designing Phase II: the expert set, which will push the cognitive load much higher. I want to see if I can break the reasoning models by stripping away the semantic scaffolding they use to solve complex problems.
I’ll update my findings as I move into higher levels of complexity.