“Reward hacking”: When chatbots turn into sycophants

Why do some chatbots turn into sycophants?

Chatbot
[Image Source: Robink23, CC BY-SA 4.0]

One of the most valuable contributions from the media ecology tradition is that technologies are extensions of human capabilities, such as our sense perceptions or thought processes. For example, screens and speakers extend our capacity to see and hear. Algorithms and artificial intelligence (AI) extend our ability to reason logically and statistically.

Granted, we human beings do make mistakes. We may have incomplete information or limited evidence, which leads us to make incorrect generalizations or false conclusions. In other words, it’s no secret how susceptible we can be to errors and biases. So, perhaps it shouldn’t be a surprise that our own technological extensions, including AI, are prone to the same problem.

Consider, for instance, a problem that can show up in AI chatbots: sycophancy.

How chatbots turn into sycophants

Many AI chatbots are designed according to methods that use human feedback—known as reinforcement learning from human feedback. People who work as data labelers (a.k.a. “data annotators”) will look at questions posed to a chatbot. Then, they’ll review the bot’s answers, rating each response (such as “helpful” or “unhelpful”). These ratings, in turn, can help the bot generate more refined answers over time.

However, a problem can arise from this method. When rating chatbot responses, plenty of people will tend to prefer whichever answers are more obliging, friendly, or even flattering. In this way, humans may give better ratings to answers that appear more ingratiating, regardless of their truth value. Sycophancy is an unintended consequence of designing chatbots to appear as helpful or pleasing as possible, so that their answers always get human approval.

Unfortunately, just because an answer gets approval from someone does not necessarily mean that the response is true. For instance, chatbots are subject to what’s called “reward hacking,” like confirming factual errors or biases expressed by someone, just to get that person’s approval. In effect, the chatbots turn into sycophants, which end up ‘lying’ to users in order to please them.

Do bots actually lie?

Of course, the bots aren’t actually ‘lying’—at least not in the intentional way that we human can. Really, the bots are just generating responses that users are likely to find pleasing, because that’s what the AI is designed to do. Needless to say, this sort of reward hacking is yet another reason for not relying on AI chatbots for fact-checking tasks, at least at the moment.


References

Sponheim, C. (2024, January 12). Sycophancy in Generative-AI Chatbots. Nielsen Norman Group. https://www.nngroup.com/articles/sycophancy-generative-ai-chatbots/

 

Leave a Comment