Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li, Zhenguo Li, Qi Liu, Lingpeng Kong

January, 2025