AI Safety and Robustness: Recent Advances and Future Directions

October 11, 2024 at 6:30PM

Abstract

In order to prevent undesirable outputs, most large language models (LLMs) have built-in “guardrails” that enforce policies specified by the developers, for example, that LLMs should not produce output deemed harmful. Unfortunately, using adversarial attacks on such models, it has been possible to circumvent these safeguards, allowing bad actors to manipulate LLMs for unintended purposes. Historically, such adversarial attacks have been extremely hard to prevent. However, in this talk I will highlight several recent advances that have substantially improved the practical robustness of LLMs. This work has culminated in a recent competition where attackers were unable to break an LLM we have deployed after a month of attempts. I’ll highlight the current state and challenges in the field, and discuss the future of safe AI systems.

Bio

Zico Kolter is a Professor and the Department Head of the Machine Learning Department at Carnegie Mellon University. He additionally serves as a Chief Expert for Robert Bosch, as the Chief Technical Advisor for Gray Swan, and is on the board of OpenAI. His work spans several topics in machine learning and optimization, including work in AI safety and robustness, LLM security, the impact of data on models, implicit models, and more.  He is a recipient of the DARPA Young Faculty Award, a Sloan Fellowship, and best paper awards at NeurIPS, ICML (honorable mention), AISTATS (test of time), IJCAI, KDD, and PESGM.