Benchmarking Causal Study to Interpret Large Language Models for Source Code

Published in ICSME, 2023

This paper introduces a benchmarking strategy named Galeras for evaluating the performance of Large Language Models (LLMs) in software engineering tasks. The strategy includes curated testbeds for code completion, code summarization, and commit generation. The paper presents a case study on the performance of ChatGPT, demonstrating the positive causal influence of prompt semantics on generative performance. It also highlights the correlation of confounders such as prompt size with accuracy metrics. The benchmarking strategy aims to reduce confounding bias and provide an interpretable solution for analyzing accuracy metrics in LLMs. Download paper here

Recommended citation: @misc{rodriguezcardenas2023benchmarkingcausalstudyinterpret, title={Benchmarking Causal Study to Interpret Large Language Models for Source Code}, author={Daniel Rodriguez-Cardenas and David N. Palacio and Dipin Khati and Henry Burke and Denys Poshyvanyk}, year={2023}, eprint={2308.12415}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2308.12415}, }