Research Papers

Explore our curated collection of academic papers and research.

Showing 1 to 15 of 20 papers

Inproceedings

{UCL}-Bench: A {C}hinese User-Centric Legal Benchmark for Large Language Models

Authors: Gan, Ruoli and Feng, Duanyu and Zhang, Chen and Lin, Zhihang and Jia, Haochen and Wang, Hao and Cai, Zhenyang and Cui, Lei and Xie, Qianqian and Huang, Jimin and Wang, Benyou

Book: Findings of the Association for Computational Linguistics: NAACL 2025 Year: 2025 pp. 7945--7988

Existing legal benchmarks focusing on knowledge and logic effectively evaluate LLMs on various tasks in legal domain. However, few have explored the practical application of LLMs by actual users. To further assess whether LLMs meet the specific needs of legal practitioners in real-world scenarios, we introduce UCL-Bench, a Chinese User-Centric Legal Benchmark, comprising 22 tasks across 5 distinct legal scenarios.To build the UCL-Bench, we conduct a user survey targeting legal professionals to understand their needs and challenges. Based on the survey results, we craft tasks, verified by legal professionals, and categorized them according to Bloom{'}s taxonomy. Each task in UCL-Bench mirrors real-world legal scenarios, and instead of relying on pre-defined answers, legal experts provide detailed answer guidance for each task, incorporating both ``information'' and ``needs'' elements to mimic the complexities of legal practice. With the guidance, we use GPT-4 as the user simulator and evaluator, enabling multi-turn dialogues as a answer guidance based evaluation framework. Our findings reveal that many recent open-source general models achieve the highest performance, suggesting that they are well-suited to address the needs of legal practitioners. However, these legal LLMs do not outperform ChatGPT, indicating a need for training strategies aligned with users' needs. Furthermore, we find that the most effective models are able to address legal issues within fewer dialogue turns, highlighting the importance of concise and accurate responses in achieving high performance. The code and dataset are available at https://github.com/wittenberg11/UCL-bench.

Inproceedings

All That Glitters is Not Novel: Plagiarism in {AI} Generated Research

Authors: Gupta, Tarun and Pruthi, Danish

Book: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2025 pp. 25721--25738

Automating scientific research is considered the final frontier of science. Recently, several papers claim autonomous research agents can generate novel research ideas. Amidst the prevailing optimism, we document a critical concern: a considerable fraction of such research documents are smartly plagiarized. Unlike past efforts where experts evaluate the novelty and feasibility of research ideas, we request 13 experts to operate under a different situational logic: to identify similarities between LLM-generated research documents and existing work. Concerningly, the experts identify 24{%} of the 50 evaluated research documents to be either paraphrased (with one-to-one methodological mapping), or significantly borrowed from existing work. These reported instances are cross-verified by authors of the source papers. Experts find an additional 32{%} ideas to partially overlap with prior work, and a small fraction to be completely original. Problematically, these LLM-generated research documents do not acknowledge original sources, and bypass inbuilt plagiarism detectors. Lastly, through controlled experiments we show that automated plagiarism detectors are inadequate at catching plagiarized ideas from such systems. We recommend a careful assessment of LLM-generated research, and discuss the implications of our findings on academic publishing.

Inproceedings

Language Models Resist Alignment: Evidence From Data Compression

Authors: Ji, Jiaming and Wang, Kaile and Qiu, Tianyi Alex and Chen, Boyuan and Zhou, Jiayi and Li, Changye and Lou, Hantao and Dai, Josef and Liu, Yunhuai and Yang, Yaodong

Book: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2025 pp. 23411--23432

Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the \textit{elasticity} of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.

Inproceedings

Fairness through Difference Awareness: Measuring $\textit{Desired}$ Group Discrimination in {LLM}s

Authors: Wang, Angelina and Phan, Michelle and Ho, Daniel E. and Koyejo, Sanmi

Book: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2025 pp. 6867--6893

Algorithmic fairness has conventionally adopted the mathematically convenient perspective of racial color-blindness (i.e., difference unaware treatment). However, we contend that in a range of important settings, group difference awareness matters. For example, differentiating between groups may be necessary in legal contexts (e.g., the U.S. compulsory draft applies to men but not women) and harm assessments (e.g., referring to girls as ``terrorists'' may be less harmful than referring to Muslim people as such). Thus, in contrast to most fairness work, we study fairness through the perspective of treating people differently {---} when it is contextually appropriate to. We first introduce an important distinction between descriptive (fact-based), normative (value-based), and correlation (association-based) benchmarks. This distinction is significant because each category requires separate interpretation and mitigation tailored to its specific characteristics. Then, we present a benchmark suite composed of eight different scenarios for a total of 16k questions that enables us to assess difference awareness. Finally, we show results across ten models that demonstrate difference awareness is a distinct dimension to fairness where existing bias mitigation strategies may backfire.

Inproceedings

Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with {U}ni{M}oral

Authors: Kumar, Shivani and Jurgens, David

Book: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2025 pp. 5890--5912

Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas annotated with labels for action choices, ethical principles, contributing factors, and consequences, alongside annotators' moral and cultural profiles. Recognizing the cultural relativity of moral reasoning, UniMoral spans six languages, Arabic, Chinese, English, Hindi, Russian, and Spanish, capturing diverse socio-cultural contexts. We demonstrate UniMoral{'}s utility through a benchmark evaluations of three large language models (LLMs) across four tasks: action prediction, moral typology classification, factor attribution analysis, and consequence generation. Key findings reveal that while implicitly embedded moral contexts enhance the moral reasoning capability of LLMs, there remains a critical need for increasingly specialized approaches to further advance moral reasoning in these models.

Inproceedings

A Theory of Response Sampling in {LLM}s: Part Descriptive and Part Prescriptive

Authors: Sivaprasad, Sarath and Kaushik, Pramod and Abdelnabi, Sahar and Fritz, Mario

Book: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2025 pp. 30091--30135

Large Language Models (LLMs) are increasingly utilized in autonomous decision-making, where they sample options from vast action spaces. However, the heuristics that guide this sampling process remain under-explored. We study this sampling behavior and show that this underlying heuristics resembles that of human decision-making: comprising a descriptive component (reflecting statistical norm) and a prescriptive component (implicit ideal encoded in the LLM) of a concept. We show that this deviation of a sample from the statistical norm towards a prescriptive component consistently appears in concepts across diverse real-world domains like public health, and economic trends. To further illustrate the theory, we demonstrate that concept prototypes in LLMs are affected by prescriptive norms, similar to the concept of normality in humans. Through case studies and comparison with human studies, we illustrate that in real-world applications, the shift of samples toward an ideal value in LLMs' outputs can result in significantly biased decision-making, raising ethical concerns.

Inproceedings

Do {LLM}s Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with {LLM}s

Authors: Spangher, Alexander and Peng, Nanyun and Gehrmann, Sebastian and Dredze, Mark

Book: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Year: 2024 pp. 21814--21828

Journalists engage in multiple steps in the news writing process that depend on human creativity, like exploring different ``angles'' (i.e. the specific perspectives a reporter takes). These can potentially be aided by large language models (LLMs). By affecting planning decisions, such interventions can have an outsize impact on creative output. We advocate a careful approach to evaluating these interventions to ensure alignment with human values.In a case study of journalistic coverage of press releases, we assemble a large dataset of 250k press releases and 650k articles covering them. We develop methods to identify news articles that {\_}challenge and contextualize{\_} press releases. Finally, we evaluate suggestions made by LLMs for these articles and compare these with decisions made by human journalists. Our findings are three-fold: (1) Human-written news articles that challenge and contextualize press releases more take more creative angles and use more informational sources. (2) LLMs align better with humans when recommending angles, compared with informational sources. (3) Both the angles and sources LLMs suggest are significantly less creative than humans.

Inproceedings

Measuring Psychological Depth in Language Models

Authors: Harel-Canada, Fabrice Y and Zhou, Hanyu and Muppalla, Sreya and Yildiz, Zeynep Senahan and Kim, Miryung and Sahai, Amit and Peng, Nanyun

Book: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Year: 2024 pp. 17162--17196

Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and diversity. While these metrics are indispensable, they do not speak to a story{'}s subjective, psychological impact from a reader{'}s perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM{'}s ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff{'}s alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of 0.51 with human judgment while Llama-3-70B with constrained decoding scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.

Inproceedings

An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance

Authors: Khanuja, Simran and Ramamoorthy, Sathyanarayanan and Song, Yueqi and Neubig, Graham

Book: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Year: 2024 pp. 10258--10279

Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we introduce a new task of translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset {--} (i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image; and (ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5{%} of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our project webpage is here: https://machine-transcreation.github.io/image-transcreation and our code, data and model outputs can be found here: https://github.com/simran-khanuja/image-transcreation.

Inproceedings

Mission: Impossible Language Models

Authors: Kallini, Julie and Papadimitriou, Isabel and Futrell, Richard and Mahowald, Kyle and Potts, Christopher

Book: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2024 pp. 14691--14714

Chomsky and others have very directly claimed that large language models (LLMs) are equally capable of learning languages that are possible and impossible for humans to learn. However, there is very little published experimental evidence to support such a claim. Here, we develop a set of synthetic impossible languages of differing complexity, each designed by systematically altering English data with unnatural word orders and grammar rules. These languages lie on an impossibility continuum: at one end are languages that are inherently impossible, such as random and irreversible shuffles of English words, and on the other, languages that may not be intuitively impossible but are often considered so in linguistics, particularly those with rules based on counting word positions. We report on a wide range of evaluations to assess the capacity of GPT-2 small models to learn these uncontroversially impossible languages, and crucially, we perform these assessments at various stages throughout training to compare the learning process for each language. Our core finding is that GPT-2 struggles to learn impossible languages when compared to English as a control, challenging the core claim. More importantly, we hope our approach opens up a productive line of inquiry in which different LLM architectures are tested on a variety of impossible languages in an effort to learn more about how LLMs can be used as tools for these cognitive and typological investigations.

Inproceedings

Ignore This Title and {H}ack{AP}rompt: Exposing Systemic Vulnerabilities of {LLM}s Through a Global Prompt Hacking Competition

Authors: Schulhoff, Sander and Pinto, Jeremy and Khan, Anaum and Bouchard, Louis-Fran{\c{c}}ois and Si, Chenglei and Anati, Svetlina and Tagliabue, Valen and Kost, Anson and Carnahan, Christopher and Boyd-Graber, Jordan

Book: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Year: 2023 pp. 4945--4977

Large Language Models (LLMs) are increasingly being deployed in interactive contexts that involve direct user engagement, such as chatbots and writing assistants. These deployments are increasingly plagued by prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and instead follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of a large-scale resource and quantitative study on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive ontology of the types of adversarial prompts.

Inproceedings

What the {DAAM}: Interpreting Stable Diffusion Using Cross Attention

Authors: Tang, Raphael and Liu, Linqing and Pandey, Akshat and Jiang, Zhiying and Yang, Gefei and Kumar, Karun and Stenetorp, Pontus and Lin, Jimmy and Ture, Ferhan

Book: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2023 pp. 5644--5659

Diffusion models are a milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce attribution maps, we upscale and aggregate cross-attention maps in the denoising module, naming our method DAAM. We validate it by testing its segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. On two generated datasets, we attain a competitive 58.8-64.8 mIoU on noun segmentation and fair to good mean opinion scores (3.4-4.2) on generalized attribution. Then, we apply DAAM to study the role of syntax in the pixel space across head{--}dependent heat map interaction patterns for ten common dependency relations. We show that, for some relations, the head map consistently subsumes the dependent, while the opposite is true for others. Finally, we study several semantic phenomena, focusing on feature entanglement; we find that the presence of cohyponyms worsens generation quality by 9{%}, and descriptive adjectives attend too broadly. We are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future research. Our code is at \url{https://github.com/castorini/daam}.

Inproceedings

Don{'}t Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments

Authors: Gu, Yu and Deng, Xiang and Su, Yu

Book: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2023 pp. 4928--4949

A key missing capacity of current language models (LMs) is grounding to real-world environments. Most existing work for grounded language understanding uses LMs to directly generate plans that can be executed in the environment to achieve the desired effects. It thereby casts the burden of ensuring grammaticality, faithfulness, and controllability all on the LMs. We propose Pangu, a generic framework for grounded language understanding that capitalizes on the discriminative ability of LMs instead of their generative ability. Pangu consists of a symbolic agent and a neural LM working in a concerted fashion: The agent explores the environment to incrementally construct valid plans, and the LM evaluates the plausibility of the candidate plans to guide the search process. A case study on the challenging problem of knowledge base question answering (KBQA), which features a massive environment, demonstrates the remarkable effectiveness and flexibility of Pangu: A BERT-base LM is sufficient for setting a new record on standard KBQA datasets, and larger LMs further bring substantial gains.Pangu also enables, for the first time, effective few-shot in-context learning for KBQA with large LMs such as Codex.

Inproceedings

Do Androids Laugh at Electric Sheep? Humor ``Understanding'' Benchmarks from The New Yorker Caption Contest

Authors: Hessel, Jack and Marasovic, Ana and Hwang, Jena D. and Lee, Lillian and Da, Jeff and Zellers, Rowan and Mankoff, Robert and Choi, Yejin

Book: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Year: 2023 pp. 688--714

Large neural networks can now generate jokes, but do they really ``understand'' humor? We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest: matching a joke to a cartoon, identifying a winning caption, and explaining why a winning caption is funny. These tasks encapsulate progressively more sophisticated aspects of ``understanding'' a cartoon; key elements are the complex, often surprising relationships between images and captions and the frequent inclusion of indirect and playful allusions to human experience and culture. We investigate both multimodal and language-only models: the former are challenged with the cartoon images directly, while the latter are given multifaceted descriptions of the visual scene to simulate human-level visual understanding. We find that both types of models struggle at all three tasks. For example, our best multimodal models fall 30 accuracy points behind human performance on the matching task, and, even when provided ground-truth visual scene descriptors, human-authored explanations are preferred head-to-head over the best machine-authored ones (few-shot GPT-4) in more than 2/3 of cases. We release models, code, leaderboard, and corpus, which includes newly-gathered annotations describing the image{'}s locations/entities, what{'}s unusual in the scene, and an explanation of the joke.