Skip to main content
đź“… Published:

While building out a new AI agent recently, my colleagues and I hit an interesting debate: what is the right way to set guardrails? Do you tell an agent exactly what to do, or do you give it a list of things not to do?

The problem with “don’t do something” is that the exclusion list becomes virtually endless. If I am instructing an AI SDR, I can spend days writing rules like “don’t be pushy” or “don’t use fear-based language.” But I noticed that the more I said “no,” the more the models seemed to trip.

I decided to test this scientifically across 2,037 trials to see how success rates change based on how a prompt is phrased.

The evaluation stack

To move beyond intuition, I built a two-stage evaluation pipeline to grade the results:

  1. The deterministic layer: I wrote a Python suite that uses rigid logic—regex word-boundaries and sentence-level counters—to catch objective failures. If I tell a model “No sentences over 10 words,” the code counts them. This is my “Robot Judge.”
  2. The intelligent layer: For more nuanced tasks, I used an LLM-as-a-Judge (Gemini). This layer catches “semantic leakage”—cases where a model technically avoids a banned word but uses a substring or synonym that violates the spirit of the rule (like using “sunlight” when “light” is banned).

Discovery: success rates

The data confirmed a striking pattern. For standard models, the mere presence of a negative constraint acts as a performance anchor.

Model ClassModelSuccess (Do)Success (Don’t)
StandardGemma 3 (4B)74.5%43.6%
StandardRNJ-188.2%68.6%
ReasoningDeepSeek-R177.9%76.4%
ReasoningNvidia Nemotron75.9%72.9%
ReasoningGPT-OSS (120B)87.7%90.2%

Observations

While I have not evaluated the entire universe of models, I observed that smaller models in my sample set struggled significantly with negation. Gemma 3 (4B) saw a 31-point drop in success simply by adding the word “not.” In these specific instances, the negative constraint effectively turned a reliable assistant into a coin-flip.

However, the “Reasoning” models work differently. DeepSeek-R1 and Nvidia Nemotron showed almost no difference between “Do” and “Don’t.” The most interesting outlier in my data was GPT-OSS (120B), which actually performed slightly better when given negative constraints. Because these models “think” before they speak, they appear to use an internal buffer to plan a path around banned words.

So what?

These results show that “NOT” is a performance tax that varies depending on the model’s architecture.

I’ve found that standard models often struggle to override their probabilistic training when faced with negation. While this is a work in progress, it has already changed how I think about building agents. Understanding these architectural thresholds is the difference between a reliable agent and a hallucinating one.

What’s next

I am going to continue evaluating more models and more diverse behaviors. I’m currently designing Phase II: the expert set, which will push the cognitive load much higher. I want to see if I can break the reasoning models by stripping away the semantic scaffolding they use to solve complex problems.

I’ll update my findings as I move into higher levels of complexity.