AI safety research downside risk

AI safety research that reduces the risk of non-catastrophic accidents or misuse (e.g., hate speech) makes commercial AI more viable, driving AI hype and capabilities research. While important, this research might fail to prevent genuinely catastrophic “black swan” risk. Some of these safety approaches, like RLHF, might perversely be the source of more risk than safety. We know that RL selects for “agent-like” behavior, including instrumental powerseeking and unshutdownability.

I am still very concerned by threats from misuse of near-term AI systems that empower more bad actors, however, and RLHF + content filters seem decent at reducing jailbreaks. I doubt jailbreaking LLMs will become vanishingly hard, though. I am substantially more concerned about the risk of human disempowerment, particularly given the pace of AI commercialization + the apparent difficulty of interpretability.