April 04, 2023 note

RLHF criticism

My top criticisms of RLHF for alignment:

Makes commercial AI more viable, driving AI hype and capabilities research;
Incentivises powerseeking;
Fails to distinguish superhuman saints, sycophants, and schemers;
Might leave a huge attack surface for jailbreaking.