Inproceedings

{UCL}-Bench: A {C}hinese User-Centric Legal Benchmark for Large Language Models

Authors

Gan, Ruoli and Feng, Duanyu and Zhang, Chen and Lin, Zhihang and Jia, Haochen and Wang, Hao and Cai, Zhenyang and Cui, Lei and Xie, Qianqian and Huang, Jimin and Wang, Benyou

Book Title

Findings of the Association for Computational Linguistics: NAACL 2025

Year

2025

Pages

7945--7988

Publisher

Association for Computational Linguistics

DOI

10.18653/v1/2025.findings-naacl.444

Abstract

Existing legal benchmarks focusing on knowledge and logic effectively evaluate LLMs on various tasks in legal domain. However, few have explored the practical application of LLMs by actual users. To further assess whether LLMs meet the specific needs of legal practitioners in real-world scenarios, we introduce UCL-Bench, a Chinese User-Centric Legal Benchmark, comprising 22 tasks across 5 distinct legal scenarios.To build the UCL-Bench, we conduct a user survey targeting legal professionals to understand their needs and challenges. Based on the survey results, we craft tasks, verified by legal professionals, and categorized them according to Bloom{'}s taxonomy. Each task in UCL-Bench mirrors real-world legal scenarios, and instead of relying on pre-defined answers, legal experts provide detailed answer guidance for each task, incorporating both ``information'' and ``needs'' elements to mimic the complexities of legal practice. With the guidance, we use GPT-4 as the user simulator and evaluator, enabling multi-turn dialogues as a answer guidance based evaluation framework. Our findings reveal that many recent open-source general models achieve the highest performance, suggesting that they are well-suited to address the needs of legal practitioners. However, these legal LLMs do not outperform ChatGPT, indicating a need for training strategies aligned with users' needs. Furthermore, we find that the most effective models are able to address legal issues within fewer dialogue turns, highlighting the importance of concise and accurate responses in achieving high performance. The code and dataset are available at https://github.com/wittenberg11/UCL-bench.

BibTeX Citation

@inproceedings{gan-etal-2025-ucl,
    title = "{UCL}-Bench: A {C}hinese User-Centric Legal Benchmark for Large Language Models",
    author = "Gan, Ruoli  and
      Feng, Duanyu  and
      Zhang, Chen  and
      Lin, Zhihang  and
      Jia, Haochen  and
      Wang, Hao  and
      Cai, Zhenyang  and
      Cui, Lei  and
      Xie, Qianqian  and
      Huang, Jimin  and
      Wang, Benyou",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-naacl.444/",
    doi = "10.18653/v1/2025.findings-naacl.444",
    pages = "7945--7988",
    ISBN = "979-8-89176-195-7",
    abstract = "Existing legal benchmarks focusing on knowledge and logic effectively evaluate LLMs on various tasks in legal domain. However, few have explored the practical application of LLMs by actual users. To further assess whether LLMs meet the specific needs of legal practitioners in real-world scenarios, we introduce UCL-Bench, a Chinese User-Centric Legal Benchmark, comprising 22 tasks across 5 distinct legal scenarios.To build the UCL-Bench, we conduct a user survey targeting legal professionals to understand their needs and challenges. Based on the survey results, we craft tasks, verified by legal professionals, and categorized them according to Bloom{'}s taxonomy. Each task in UCL-Bench mirrors real-world legal scenarios, and instead of relying on pre-defined answers, legal experts provide detailed answer guidance for each task, incorporating both ``information'' and ``needs'' elements to mimic the complexities of legal practice. With the guidance, we use GPT-4 as the user simulator and evaluator, enabling multi-turn dialogues as a answer guidance based evaluation framework. Our findings reveal that many recent open-source general models achieve the highest performance, suggesting that they are well-suited to address the needs of legal practitioners. However, these legal LLMs do not outperform ChatGPT, indicating a need for training strategies aligned with users' needs. Furthermore, we find that the most effective models are able to address legal issues within fewer dialogue turns, highlighting the importance of concise and accurate responses in achieving high performance. The code and dataset are available at https://github.com/wittenberg11/UCL-bench."
}