Sleeper agents: Training deceptive LLMs that persist through safety training
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Megan Monte, Monte MacDiarmid, Tamera Weij, Tamera Lanham, Adam Altman, Daniel M. Ziegler, Jonathan Uesato, Tim Maxwell, Roger Grosse, Newton Cheng, others, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
Paper
From BibTeX import
, 2024
Notes
Cited as precedent for studying small transformers as model organisms in riva2026task, legitimating the microgpt experimental strategy that anchors the empirical claims.
References
No references yet.