AI wolves aren't pets
Cute analogy for why I’m unconvinced that RLHF fine-tuning on LLMs solves alignment: raising a wolf as a dog doesn’t make it a good pet; in-lifetime learning might not replace many generations of social selection.
This analogy is bad in several ways, but I like it regardless. For instance, it assumes that the LLM is a “wolf” prior to RLHF, when RL might be the most likely cause of misaligned traits. However, I like the worst-case framing of, “we have to make something that might be pretty close to a wolf into a dog. We cannot assume proximity to dog-traits by default, especially if wolves are common among possible minds.”