RLHF criticism
My top criticisms of RLHF for alignment:
- Makes commercial AI more viable, driving AI hype and capabilities research;
- Incentivises powerseeking;
- Fails to distinguish superhuman saints, sycophants, and schemers;
- Might leave a huge attack surface for jailbreaking.