Ignore This Title and {H}ack{AP}rompt: Exposing Systemic Vulnerabilities of {LLM}s Through a Global Prompt Hacking Competition
Authors
Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran{\c{c}}ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
2023
4945--4977
Association for Computational Linguistics
Abstract
Large Language Models (LLMs) are increasingly being deployed in interactive contexts that involve direct user engagement, such as chatbots and writing assistants. These deployments are increasingly plagued by prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and instead follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of a large-scale resource and quantitative study on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive ontology of the types of adversarial prompts.
BibTeX Citation
@inproceedings{schulhoff-etal-2023-ignore,
title = "Ignore This Title and {H}ack{AP}rompt: Exposing Systemic Vulnerabilities of {LLM}s Through a Global Prompt Hacking Competition",
author = "Schulhoff, Sander and
Pinto, Jeremy and
Khan, Anaum and
Bouchard, Louis-Fran{\c{c}}ois and
Si, Chenglei and
Anati, Svetlina and
Tagliabue, Valen and
Kost, Anson and
Carnahan, Christopher and
Boyd-Graber, Jordan",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.302/",
doi = "10.18653/v1/2023.emnlp-main.302",
pages = "4945--4977",
abstract = "Large Language Models (LLMs) are increasingly being deployed in interactive contexts that involve direct user engagement, such as chatbots and writing assistants. These deployments are increasingly plagued by prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and instead follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of a large-scale resource and quantitative study on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive ontology of the types of adversarial prompts."
}