Successor alignment/control will probably be hard for AIs, at least with DL. This doesn’t necessarily make me more optimistic. Trying to steer something that is trying to steer something doesn’t automatically create a fixed point; there could be compounding errors. Just as competing AI labs might be willing to risk loss of control to build AGI first, so might the first imperfectly aligned AGIs cut corners in successor alignment.

A lot depends on race dynamics and whether a “basin of attraction” exists around corrigibility; i.e., do imperfectly deferent models self-modify to be more deferent? I’m hopeful that, even if we can’t find a corrigibility basin, empowered humans-in-the-loop, armed with transparency tools and trusted weaker AIs, can spot-check AI successor alignment. My fear is that the race will be too tight, or our initial attempt too far off-target to give those humans enough time.

Another fear is that our architectures are far too messy for whitebox guarantees, even for near-human, corrigible AGIs. Altering personalities reliably through brain surgery is pretty hard for humans!