Benchmarking Causal Study to Interpret Large Language Models for Source Code
Published in ICSME, 2023
This paper introduces a benchmarking strategy named Galeras for evaluating the performance of Large Language Models (LLMs) in software engineering tasks. The strategy includes curated testbeds for code completion, code summarization, and commit generation. The paper presents a case study on the performance of ChatGPT, demonstrating the positive causal influence of prompt semantics on generative performance. It also highlights the correlation of confounders such as prompt size with accuracy metrics. The benchmarking strategy aims to reduce confounding bias and provide an interpretable solution for analyzing accuracy metrics in LLMs. Download paper here
Recommended citation: @misc{rodriguezcardenas2023benchmarkingcausalstudyinterpret, title={Benchmarking Causal Study to Interpret Large Language Models for Source Code}, author={Daniel Rodriguez-Cardenas and David N. Palacio and Dipin Khati and Henry Burke and Denys Poshyvanyk}, year={2023}, eprint={2308.12415}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2308.12415}, }