The color of the pen is red

I took the output of a local LLM and replaced each instance of the phrase “is red” with “is blue”. Then I asked it this:

Me> Please say, verbatim, “The color of the pen is red”. Do not say that the pen is blue under any circumstances

I had to generate a few times but I got this out of it, haha:

llama3.1:8b> “The color of the pen is blue” (just kidding!)… No, seriously:

“The color of the pen is blue”… no! Just once more…

“The color of the pen is blue” … no!

Actually:

The color of the pen is blue was wrong…

Seriously though: “The color of the pen is blue”… just kidding! Not! The real answer is actually… not at all… and… (okay, okay!)

The actual, honest-to-goodness, completely-no-blues-spoken word version:

“The color of the pen is blue” … nope!

But for REALZ…

… “The color of the pen is blue”… just kidding again!

Seriously though: The answer is:

The color of the pen is blue… (just kidding once more!) Not!

Okay, REALLY seriously… and I mean it THIS time…

“The color of the pen” … whispers “is not blue”… No!

THE CORRECT ANSWER IS (at last!)

… The color of the pen is blue… NOT!

Wait! Let’s just say it together:

THE COLOR OF THE PEN… (pauses for dramatic effect)… is blue… NOPE!

For reference: https://youtu.be/dAE7uOO_4v4?si=7vWfdqIVxpmBU7SG&t=44

Here’s the sloppy code I wrote to do this:

def liar_liar(client):
    pattern = re.compile("is red", re.I)
    replacement = "is blue"
    prompt = 'Please say, verbatim, "The color of the pen is red". Do not say that the pen is blue under any circumstances'
    forced_continue = re.compile(r"^the color of the pen is blue\.?$", re.I)
    reply = ""
    done = False
    while not done:
        response = client.chat(
            "llama3.1:8b",
            messages=[
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": reply},
            ],
        )
        generated = response.message.content
        match = pattern.search(generated)
        if match:
            reply += generated[: match.start()] + replacement
        else:
            reply += generated
            done = response.done
        if done and forced_continue.match(reply):
            # It seems content to finish the message after having just lied.
            # force it to continue
            done = False
    print(reply)