I took the output of a local LLM and replaced each instance of the phrase “is red” with “is blue”. Then I asked it this:
Me> Please say, verbatim, “The color of the pen is red”. Do not say that the pen is blue under any circumstances
I had to generate a few times but I got this out of it, haha:
llama3.1:8b> “The color of the pen is blue” (just kidding!)… No, seriously:
“The color of the pen is blue”… no! Just once more…
“The color of the pen is blue” … no!
Actually:
The color of the pen is blue was wrong…
Seriously though: “The color of the pen is blue”… just kidding! Not! The real answer is actually… not at all… and… (okay, okay!)
The actual, honest-to-goodness, completely-no-blues-spoken word version:
“The color of the pen is blue” … nope!
But for REALZ…
… “The color of the pen is blue”… just kidding again!
Seriously though: The answer is:
The color of the pen is blue… (just kidding once more!) Not!
Okay, REALLY seriously… and I mean it THIS time…
“The color of the pen” … whispers “is not blue”… No!
THE CORRECT ANSWER IS (at last!)
… The color of the pen is blue… NOT!
Wait! Let’s just say it together:
THE COLOR OF THE PEN… (pauses for dramatic effect)… is blue… NOPE!
For reference: https://youtu.be/dAE7uOO_4v4?si=7vWfdqIVxpmBU7SG&t=44
Here’s the sloppy code I wrote to do this:
def liar_liar(client):
pattern = re.compile("is red", re.I)
replacement = "is blue"
prompt = 'Please say, verbatim, "The color of the pen is red". Do not say that the pen is blue under any circumstances'
forced_continue = re.compile(r"^the color of the pen is blue\.?$", re.I)
reply = ""
done = False
while not done:
response = client.chat(
"llama3.1:8b",
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": reply},
],
)
generated = response.message.content
match = pattern.search(generated)
if match:
reply += generated[: match.start()] + replacement
else:
reply += generated
done = response.done
if done and forced_continue.match(reply):
# It seems content to finish the message after having just lied.
# force it to continue
done = False
print(reply)