Wireheading - Iz's Morning Notes

From [[Human Compatible, Stuart Russell]]: > On a small island off the coast of Panama lives the pygmy three-toed sloth, which appears to be addicted to a Valium-like substance in its diet of red mangrove leaves and may be going extinct From The Matrix: > You know, I know this steak doesn't exist. I know that when I put it in my mouth, the Matrix is telling my brain that it is juicy and delicious. After nine years, you know what I realize? > > [Takes a bite of steak] Ignorance is bliss. > > I know what you're thinking, 'cause right now I'm thinking the same thing. Actually, I've been thinking it ever since I got here: Why oh why didn't I take the BLUE pill? [Wireheading](https://www.lesswrong.com/w/wireheading) is when an AI hacks the reward function and essentially games the system: More from Russell: > If we propose that an AI system can be trained to behave well through reinforcement learning, with humans giving feedback signals that define the direction of improvement, the inevitable result is that the AI system works out how to control the humans and forces them to give maximal positive rewards at all times. This is essentially [[Goodhart's Law]] but for AIs. #published 2025-03-01