My top criticisms of RLHF for alignment:

  • Makes commercial AI more viable, driving AI hype and capabilities research;
  • Incentivises powerseeking;
  • Fails to distinguish superhuman saints, sycophants, and schemers;
  • Might leave a huge attack surface for jailbreaking.