CEV via corrigible AGI long reflection

I think that trying to directly build CEV-sovereign AGI is very risky. I think we have a better chance of doing this (assuming we want to) via a simulated “[long reflection}(https://forum.effectivealtruism.org/topics/long-reflection)” enacted by corrigible AGI at the behest of human institutions. A doomerist objection to this is that we might soon be in a “vulnerable world” of multipolar AGIs that can be run on basement hardware, which can only be circumvented/policed by a sovereign AGI that can operate faster than human institutions. I think this focus is misguided. Simulated “long reflections” need not take long in real time. Also, I have more faith in human coordination around semiconductor control and “weak AGI”-assisted monitoring.

Of course, building a corrigible AGI might be very hard, especially if corrigibility is “anti-natural.” I think the most viable path to success might route through AI-assisted research. Jan Leike calls this an “Alignment MVP.” I think Alignment MVPs need not be built via RLHF alone, although it seems possible this might work (if risky). Scalable oversight (e.g., via debate, RRM) seems hard, but possible to make work. Leveraging the “Simulators” or human-in-the-loop “Cyborgism” frames might help too. Also, the better our understanding of DL science and model cognition, the better the guarantees we can make about an Alignment MVP. Also, model shaping and transparency tools can be leveraged by Alignment MVPs, reducing the (dangerous) cognitive power these models will need.

It’s worth caveating that in some worlds, trying to build an Alignment MVP spawns a misaligned mesa-optimizer or similar before you ever get a meaningful bump in alignment research. Thus, my alignment portfolio includes “build weak consequentialists” (e.g., simulators, cyborgs). It’s also worth caveating that at some level of capability, it’s possible predictive models/simulators might spin up a misaligned mesa-optimizer or similar, even without RL incentives. Detecting and shaping learned optimization seems important in such an instance.