Inproceedings

Ignore This Title and {H}ack{AP}rompt: Exposing Systemic Vulnerabilities of {LLM}s Through a Global Prompt Hacking Competition

Authors

Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran{\c{c}}ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan

Book Title

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Year

2023

Pages

4945--4977

Publisher

Association for Computational Linguistics

DOI

10.18653/v1/2023.emnlp-main.302

Abstract

Large Language Models (LLMs) are increasingly being deployed in interactive contexts that involve direct user engagement, such as chatbots and writing assistants. These deployments are increasingly plagued by prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and instead follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of a large-scale resource and quantitative study on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive ontology of the types of adversarial prompts.

BibTeX Citation

@inproceedings{schulhoff-etal-2023-ignore,
    title = "Ignore This Title and {H}ack{AP}rompt: Exposing Systemic Vulnerabilities of {LLM}s Through a Global Prompt Hacking Competition",
    author = "Schulhoff, Sander  and
      Pinto, Jeremy  and
      Khan, Anaum  and
      Bouchard, Louis-Fran{\c{c}}ois  and
      Si, Chenglei  and
      Anati, Svetlina  and
      Tagliabue, Valen  and
      Kost, Anson  and
      Carnahan, Christopher  and
      Boyd-Graber, Jordan",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.302/",
    doi = "10.18653/v1/2023.emnlp-main.302",
    pages = "4945--4977",
    abstract = "Large Language Models (LLMs) are increasingly being deployed in interactive contexts that involve direct user engagement, such as chatbots and writing assistants. These deployments are increasingly plagued by prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and instead follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of a large-scale resource and quantitative study on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive ontology of the types of adversarial prompts."
}