A Systematic Survey on Large Language Models for Code Generation

Sardar K. Jabrw; Qusay I. Sarhan

doi:10.14500/aro.12159

Authors

Sardar K. Jabrw Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq https://orcid.org/0009-0006-0411-789X
Qusay I. Sarhan Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq https://orcid.org/0000-0001-8708-0063

DOI:

https://doi.org/10.14500/aro.12159

Keywords:

Benchmarking, Code Generation, Evaluation Metrics, Large Language Models

Abstract

The rapid development of large language models (LLMs) has transformed code generation, offering powerful tools for automating software development tasks. However, evaluating generated code’s quality, security, and effectiveness remains a significant challenge. The present systematic survey comprehensively analyses studies published between 2021 and 2024, focusing on utilizing LLMs in the code generation process. The survey explored ten research questions, such as the most commonly used programming languages, the metrics employed to evaluate the quality of code, and scenarios in which LLMs are applied by developers during the software development process, outlining the scope in which prompt engineering influences code generation and security concerns with the types of benchmarks, models evaluated, and code analysis tools used in studies. The findings indicate that the most frequently used evaluation metrics in code generation are Pass@k and Bilingual Evaluation Understudy. It also shows that Python, Java, and C++ are the most widely used languages. Furthermore, identifying security vulnerabilities and establishing robust evaluation metrics remain challenges. This survey underlines present practices, detects gaps, and suggests future research to enhance the reliability and security of code generated by LLMs in real-world applications.

Downloads

Download data is not yet available.

Author Biographies

Sardar K. Jabrw, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Sardar K. Jabrw is a M.Sc. student at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Computer Science. His research interests are in software engineering, LLMs, and AI/ML.

Qusay I. Sarhan, Department of Computer Science, College of Science, University of Duhok, Duhok, Kurdistan Region – F.R. Iraq

Qusay I. Sarhan is an Assistant Professor at the Department of Computer Science, College of Science, Duhok University. He got the B.Sc. degree in Software Engineering, the M.Tech. degree in Software Engineering and the Ph.D. degree in Software Engineering. His research interests are in software engineering, internet of things, and AI/ML.

References

Afsharmazayejani, R., Shahmiri, M.M., Link, P., Pearce, H., and Tan, B., 2024. Toward Hardware Security Benchmarking of LLMs. In: 2024 IEEE LLM Aided Design Workshop, LAD 2024. Institute of Electrical and Electronics Engineers Inc. DOI: https://doi.org/10.1109/LAD62341.2024.10691745

Aggarwal, P., Chatterjee, O., Dai, T., Mohapatra, P., Paulovicks, B., Blancett, B., and De Magalhaes, A., 2024. CodeSift: An LLM-Based Reference-Less Framework for Automatic Code Validation. In: IEEE International Conference on Cloud Computing, CLOUD. IEEE Computer Society, pp.404-410. DOI: https://doi.org/10.1109/CLOUD62652.2024.00052

Al-Khafaji, N.J., and Majeed, B.K., 2024. Evaluating Large Language Models using Arabic Prompts to Generate Python Codes. In: 4th International Conference on Emerging Smart Technologies and Applications, eSmarTA 2024. Institute of Electrical and Electronics Engineers Inc. DOI: https://doi.org/10.1109/eSmarTA62850.2024.10638877

Beurer-Kellner, L., Vechev, M., and Fischer, M., 2023. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 7, pp. 1946-1969. DOI: https://doi.org/10.1145/3591300

Black, G.S., Rimal, B.P., and Vaidyan, V.M., 2024. Balancing Security and Correctness in Code Generation: An Empirical Study on Commercial Large Language Models. IEEE Transactions on Emerging Topics in Computational Intelligence, pp.1-12. DOI: https://doi.org/10.1109/TETCI.2024.3446695

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D., 2020. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20. Curran Associates Inc., Red Hook, NY, USA.

Bucaioni, A., Ekedahl, H., Helander, V., and Nguyen, P.T., 2024. Programming with ChatGPT: How far can we go? Machine Learning with Applications, 15, p.100526. DOI: https://doi.org/10.1016/j.mlwa.2024.100526

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W., Yi Chang, Zhang, Y., Yu, P.S., Yang, Q., and Xie, X., 2024. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3), p.39. DOI: https://doi.org/10.1145/3641289

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W.H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A.N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W., 2021. Evaluating Large Language Models Trained on Code. Available from: https://arxiv.org/abs/2107.03374 [Last accessed on 2024 Dec 27].

Chowdhury, M.N.U.R., and Haque, A., 2023. ChatGPT: Its Applications and Limitations. In: 2023 3rd International Conference on Intelligent Technologies, CONIT 2023. Institute of Electrical and Electronics Engineers Inc. DOI: https://doi.org/10.1109/CONIT59222.2023.10205621

Clark, A., Igbokwe, D., Ross, S., and Zibran, M.F., 2024. A Quantitative Analysis of Quality and Consistency in AI-generated Code. In: Proceedings - 2024 7th International Conference on Software and System Engineering, ICoSSE 2024. Institute of Electrical and Electronics Engineers Inc., pp.37-41. DOI: https://doi.org/10.1109/ICoSSE62619.2024.00014

Corso, V., Mariani, L., Micucci, D., and Riganelli, O., 2024. Generating Java Methods: An Empirical Assessment of Four AI-Based Code Assistants. In: IEEE International Conference on Program Comprehension. IEEE Computer Society, pp.13-23. DOI: https://doi.org/10.1145/3643916.3644402

Cotroneo, D., Foggia, A., Improta, C., Liguori, P., and Natella, R., 2024. Automating the correctness assessment of AI-generated code for security contexts. Journal of Systems and Software, 216, p.112113. DOI: https://doi.org/10.1016/j.jss.2024.112113

De-Fitero-Dominguez, D., Garcia-Lopez, E., Garcia-Cabot, A., and MartinezHerraiz, J.J., 2024. Enhanced automated code vulnerability repair using large language models. Engineering Applications of Artificial Intelligence, 138, p.109291. DOI: https://doi.org/10.1016/j.engappai.2024.109291

DeLorenzo, M., Gohil, V., and Rajendran, J., 2024. CreativEval: Evaluating creativity of LLM-based hardware code generation. Proceedings of the 2024 IEEE LLM Aided Design Workshop (LAD), San Jose, CA, USA, pp.1-5. DOI: https://doi.org/10.1109/LAD62341.2024.10691798

Dong, Y., Jiang, X., Jin, Z., and Li, G., 2024. Self-collaboration Code Generation via ChatGPT. ACM Transactions on Software Engineering and Methodology, 33, p.189. DOI: https://doi.org/10.1145/3672459

Du, X., Liu, M., Wang, K., Wang, H., Liu, J., Chen, Y., Feng, J., Sha, C., Peng, X., and Lou, Y., 2024. Evaluating Large Language Models in Class-Level Code Generation. In: Proceedings- International Conference on Software Engineering. IEEE Computer Society, pp.982-994. DOI: https://doi.org/10.1145/3597503.3639219

Dumitran, A.M., Badea, A.C., and Muscalu, S.G., 2024. Evaluating the Performance of Large Language Models in Competitive Programming: AMultiYear, Multi-GradeAnalysis. In: 18th International Conference on INnovations in Intelligent SysTems and Applications, INISTA 2024. Institute of Electrical and Electronics Engineers Inc. DOI: https://doi.org/10.1109/INISTA62901.2024.10683837

Evtikhiev, M., Bogomolov, E., Sokolov, Y., and Bryksin, T., 2023. Out of the BLEU: How should we assess quality of the code generation models? Journal of Systems and Software, 203, p.111741. DOI: https://doi.org/10.1016/j.jss.2023.111741

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., and Zhang, J.M., 2023. Large Language Models for Software Engineering: Survey and Open Problems. In: Proceedings- 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering, ICSE-FoSE 2023. Institute of Electrical and Electronics Engineers Inc., pp.31-53. DOI: https://doi.org/10.1109/ICSE-FoSE59343.2023.00008

Feng, Y., Vanam, S., Cherukupally, M., Zheng, W., Qiu, M., and Chen, H., 2023. Investigating Code Generation Performance of ChatGPT with Crowdsourcing Social Data. In: Proceedings- International Computer Software and Applications Conference. IEEE Computer Society, pp.876-885. DOI: https://doi.org/10.1109/COMPSAC57700.2023.00117

Geng, M., Wang, S., Dong, D., Wang, H., Cao, S., Zhang, K., and Jin, Z., 2023. Interpretation-based Code Summarization. In: IEEE International Conference on Program Comprehension. IEEE Computer Society, pp.113-124. DOI: https://doi.org/10.1109/ICPC58990.2023.00026

Gu, X., Chen, M., Lin, Y., Hu, Y., Zhang, H., Wan, C., Wei, Z., Xu, Y., and Wang, J., 2024. On the effectiveness of large language models in domainspecific code generation. ACM Transactions on Software Engineering and Methodology, 34, p.78. DOI: https://doi.org/10.1145/3697012

Guo, M., 2024. Java Web Programming with ChatGPT. In: 2024 5th International Conference on Mechatronics Technology and Intelligent Manufacturing, ICMTIM 2024. Institute of Electrical and Electronics Engineers Inc., pp.834-838. DOI: https://doi.org/10.1109/ICMTIM62047.2024.10629560

Hajipour, H., Hassler, K., Holz, T., Schonherr, L., and Fritz, M., 2024. CodeLMSec benchmark: Systematically evaluating and finding security vulnerabilities in black-box code language models. In: Proceedings - IEEE Conference on Safe and Trustworthy Machine Learning, SaTML 2024. Institute of Electrical and Electronics Engineers Inc., pp.684-709. DOI: https://doi.org/10.1109/SaTML59370.2024.00040

Hamer, S., D’Amorim, M., and Williams, L., 2024. Just another copy and paste? Comparing the security vulnerabilities of ChatGPT generated code and StackOverflow answers. In: Proceedings - 45th IEEE Symposium on Security and Privacy Workshops, SPW 2024. Institute of Electrical and Electronics Engineers Inc., pp.87-94. DOI: https://doi.org/10.1109/SPW63631.2024.00014

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy,J., and Wang, H., 2024. Large language models for software engineering: Asystematic literature review. ACM Transactions on Software Engineering and Methodology, 33(8), p.1-79. DOI: https://doi.org/10.1145/3695988

Jesse, K., Ahmed, T., Devanbu, P.T., and Morgan, E., 2023. Large Language Models and Simple, Stupid Bugs. In: Proceedings - 2023 IEEE/ACM 20th International Conference on Mining Software Repositories, MSR 2023. Institute of Electrical and Electronics Engineers Inc., pp.563-575. DOI: https://doi.org/10.1109/MSR59073.2023.00082

Jiang, X., Dong, Y., Wang, L., Zheng, F., Shang, Q., Li, G., Jin, Z., and Jiao, W., 2024. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology, 33, p.182. DOI: https://doi.org/10.1145/3672456

Jin, K., Wang, C.Y., Pham, H.V., and Hemmati, H., 2024. Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation. In: Proceedings- 2024 IEEE/ACM 21st International Conference on Mining Software Repositories, MSR 2024. Institute of Electrical and Electronics Engineers Inc., pp.167-171. DOI: https://doi.org/10.1145/3643991.3645074

Kalyan, K.S., 2024. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Natural Language Processing Journal, 6, p.100048. DOI: https://doi.org/10.1016/j.nlp.2023.100048

Kashanaki, F.R., Zakharov, M., and Renau, J., 2024. HDLEval Benchmarking LLMs for Multiple HDLs. In: 2024 IEEE LLM Aided Design Workshop, LAD 2024. Institute of Electrical and Electronics Engineers Inc. DOI: https://doi.org/10.1109/LAD62341.2024.10691770

Khojah, R., Mohamad, M., Leitner, P., and De Oliveira Neto, F.G., 2024. Beyond Code Generation: An Observational Study of ChatGPT Usage in Software Engineering Practice. Proceedings of the ACM on Software Engineering, 1(FSE), pp.1819-1840. DOI: https://doi.org/10.1145/3660788

Khoury, R., Avila, A.R., Brunelle, J., and Camara, B.M., 2023. How Secure is Code Generated by ChatGPT? In: Conference Proceedings- IEEE International Conference on Systems, Man and Cybernetics. Institute of Electrical and Electronics Engineers Inc., pp.2445-2451. DOI: https://doi.org/10.1109/SMC53992.2023.10394237

Kou, B., Chen, S., Wang, Z., Ma, L., and Zhang, T., 2024. Do large language models pay similar attention like human programmers when generating code? Proceedings of the ACM on Software Engineering, 1, pp.2261-2284. DOI: https://doi.org/10.1145/3660807

Koubaa, A., Qureshi, B., Ammar, A., Khan, Z., Boulila, W., and Ghouti, L., 2023. Humans are still better than ChatGPT: Case of the IEEEXtreme competition. Heliyon, 9(11), p.e21624. DOI: https://doi.org/10.1016/j.heliyon.2023.e21624

Li, J., Li, G., Li, Y., and Jin, Z., 2024a. Structured chain-of-thought prompting for code generation. ACM Transactions on Software Engineering and Methodology, 34, p.34. DOI: https://doi.org/10.1145/3690635

Li, J., Zhang, Y., Karas, Z., Mcmillan, C., Leach, K., and Huang, Y., 2024b. Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization. In: IEEE International Conference on Program Comprehension. IEEE Computer Society, pp.47-51. DOI: https://doi.org/10.1145/3643916.3644434

Liu, C., Bao, X., Zhang, H., Zhang, N., Hu, H., Zhang, X., and Yan, M., 2024a. Guiding ChatGPT for Better Code Generation: An Empirical Study. In: Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024. Institute of Electrical and Electronics Engineers Inc., pp.102-113. DOI: https://doi.org/10.1109/SANER60148.2024.00018

Liu, M., Pinckney, N., Khailany, B., and Ren, H., 2023. Invited Paper: VerilogEval: Evaluating Large Language Models for Verilog Code Generation. In: IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD. Institute of Electrical and Electronics Engineers Inc. DOI: https://doi.org/10.1109/ICCAD57390.2023.10323812

Liu, Z., Tang, Y., Luo, X., Zhou, Y., and Zhang, L.F., 2024b. No need to lift a finger anymore? Assessing the quality of code generation by ChatGPT. IEEE Transactions on Software Engineering, 50(6), pp.1548-1584. DOI: https://doi.org/10.1109/TSE.2024.3392499

López Espejel, J., Yahaya Alassan, M.S., Chouham, E.M., Dahhane, W., and Ettifouri, E.H., 2023. A comprehensive review of state-of-the-art methods for Java code generation from natural language text. Natural Language Processing Journal, 3, p.100013. DOI: https://doi.org/10.1016/j.nlp.2023.100013

Lu, Y., Sun, C., Yan, Y., Zhu, H., Song, D., Peng, Q., Yu, L., Wang, X., Jiang, J., and Ye, X., 2024. A Comprehensive Survey of Datasets for Large Language Model Evaluation. In: 2024 5th Information Communication Technologies Conference, ICTC 2024. Institute of Electrical and Electronics Engineers Inc., pp.330-336. DOI: https://doi.org/10.1109/ICTC61510.2024.10601918

MacEdo, M., Tian, Y., Cogo, F., and Adams, B., 2024. Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation. In: Proceedings - 2024 IEEE/ACM 1st International Conference on AI Foundation Models and Software Engineering, FORGE 2024. Association for Computing Machinery, Inc., pp.57-68. DOI: https://doi.org/10.1145/3650105.3652301

Majdinasab, V., Bishop, M.J., Rasheed, S., Moradidakhel, A., Tahir, A., and Khomh, F., 2024. Assessing the Security of GitHub Copilot’s Generated Code-A Targeted Replication Study. In: Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2024. Institute of Electrical and Electronics Engineers Inc., pp.435-444. DOI: https://doi.org/10.1109/SANER60148.2024.00051

Mendes, W., Souza, S., and De Souza, C.R.B., 2024. “You’re on a Bicycle with a Little Motor”: Benefits and Challenges of using AI Code Assistants. In: 2024 IEEE/ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering (CHASE), pp.144-152. DOI: https://doi.org/10.1145/3641822.3641882

Miah, T., and Zhu, H., 2024. User Centric Evaluation of Code Generation Tools (Invited Paper). In: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), pp.109-119. DOI: https://doi.org/10.1109/AITest62860.2024.00022

Moradi Dakhel, A., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., and Jiang, Z.M.J., 2023. GitHub copilot AI pair programmer: Asset or liability? Journal of Systems and Software, 203(C), p.111734. DOI: https://doi.org/10.1016/j.jss.2023.111734

Moratis, K., Diamantopoulos, T., Nastos, D.N., and Symeonidis, A., 2024. Write me this Code: An Analysis of ChatGPT Quality for Producing Source Code. In: Proceedings - 2024 IEEE/ACM 21st International Conference on Mining Software Repositories, MSR 2024. Institute of Electrical and Electronics Engineers Inc., pp.147-151. DOI: https://doi.org/10.1145/3643991.3645070

Nazir, A., and Wang, Z., 2023. Acomprehensive survey of ChatGPT: Advancements, applications, prospects, and challenges. Meta-Radiology, 1, p.100022. DOI: https://doi.org/10.1016/j.metrad.2023.100022

Nikolaidis, N., Flamos, K., Gulati, K., Feitosa, D., Ampatzoglou, A., and Chatzigeorgiou, A., 2024. A Comparison of the Effectiveness of ChatGPT and Co-Pilot for Generating Quality Python Code Solutions. In: Proceedings - 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering - Companion, SANER-C 2024. Institute of Electrical and Electronics Engineers Inc., pp.93-101. DOI: https://doi.org/10.1109/SANER-C62648.2024.00018

Niu, C., Li, C., Ng, V., Chen, D., Ge, J., and Luo, B., 2023. An Empirical Comparison of Pre-Trained Models of Source Code. In: Proceedings - International Conference on Software Engineering. IEEE Computer Society, pp.2136-2148. DOI: https://doi.org/10.1109/ICSE48619.2023.00180

Niu, C., Zhang, T., Li, C., Luo, B., and Ng, V., 2024. On Evaluating the Efficiency of Source Code Generated by LLMs. In: Proceedings - 2024 IEEE/ACM 1st International Conference on AI Foundation Models and Software Engineering, FORGE 2024. Association for Computing Machinery, Inc., pp.103-107. DOI: https://doi.org/10.1145/3650105.3652295

Ouyang, S., Zhang, J.M., Harman, M., and Wang, M., 2024. An empirical study of the non-determinism of ChatGPT in code generation. ACM Transactions on Software Engineering and Methodology, 34, p.42. DOI: https://doi.org/10.1145/3697010

Paul, D.G., Zhu, H., and Bayley, I., 2024a. Benchmarks and Metrics for Evaluations of Code Generation: ACritical Review. In: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest). IEEE, pp.87-94. DOI: https://doi.org/10.1109/AITest62860.2024.00019

Paul, D.G., Zhu, H., and Bayley, I., 2024b. ScenEval: Abenchmark for scenariobased evaluation of code generation. In: 2024 IEEE International Conference on Artificial Intelligence Testing (AITest). IEEE, pp.55-63. DOI: https://doi.org/10.1109/AITest62860.2024.00015

Petersen, K., Vakkalanka, S., and Kuzniarz, L., 2015. Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology, 64, pp.1-18. DOI: https://doi.org/10.1016/j.infsof.2015.03.007

Petrovic, N., Konicanin, S., and Suljovic, S., 2023. ChatGPT in IoT Systems: Arduino Case Studies. In: 2023 IEEE 33rd International Conference on Microelectronics, MIEL 2023. Institute of Electrical and Electronics Engineers Inc. DOI: https://doi.org/10.1109/MIEL58498.2023.10315791

Rai, L., Khatiwada, S., Deng, C., and Liu, F., 2024. Cross-Language Code Development with Generative AI: A Source-to-Source Translation Perspective. In: 2024 IEEE 7th International Conference on Electronic Information and Communication Technology, ICEICT 2024. Institute of Electrical and Electronics Engineers Inc., pp.562-565. DOI: https://doi.org/10.1109/ICEICT61637.2024.10671366

Rizvi, A., Simon, N., Tocho, J., Yongaci, A., Abi-Karam, S., and Hao, C., 2024. Evaluating Large Language Models for High-Level Synthesis. In: 2024 IEEE Opportunity Research Scholars Symposium (ORSS), pp.49-52. DOI: https://doi.org/10.1109/ORSS62274.2024.10697938

Sakib, F.A., Khan, S.H., and Karim, A.H.M.R., 2023. Extending the Frontier of ChatGPT: Code Generation and Debugging. George Mason University, Virginia. Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., and Sarro, F., 2024. Asurvey on machine learning techniques applied to source code. Journal of Systems and Software, 209, p.111934. DOI: https://doi.org/10.1016/j.jss.2023.111934

Siddiq, M.L., Roney, L., Zhang, J., and Santos, J.C.S., 2024. Quality Assessment of ChatGPT Generated Code and their Use by Developers. In: Proceedings - 2024 IEEE/ACM 21st International Conference on Mining Software Repositories, MSR 2024. Institute of Electrical and Electronics Engineers Inc., pp.152-156. DOI: https://doi.org/10.1145/3643991.3645071

Su, H., Ai, J., Yu, D., and Zhang, H., 2023. An Evaluation Method for Large Language Models’ Code Generation Capability. In: Proceedings - 2023 10th International Conference on Dependable Systems and Their Applications, DSA 2023. Institute of Electrical and Electronics Engineers Inc., pp.831-838. DOI: https://doi.org/10.1109/DSA59317.2023.00118

Tony, C., Mutas, M., Ferreyra, N.E.D., and Scandariato, R., 2023. LLMSecEval: ADataset of Natural Language Prompts for Security Evaluations. In: Proceedings-2023 IEEE/ACM 20th International Conference on Mining Software Repositories, MSR 2023. Institute of Electrical and Electronics Engineers Inc., pp.588-592. DOI: https://doi.org/10.1109/MSR59073.2023.00084

Vijayaraghavan, P., Shi, L., Ambrogio, S., Mackin, C., Nitsure, A., Beymer, D., and Degan, E., 2024. VHDL-Eval: AFramework for Evaluating Large Language Models in VHDL Code Generation. In: 2024 IEEE LLM Aided Design Workshop (LAD). IEEE, pp.1-6. DOI: https://doi.org/10.1109/LAD62341.2024.10691836

Wan, Y., Bi, Z., He, Y., Zhang, J., Zhang, H., Sui, Y., Xu, G., Jin, H., and Yu, P., 2024. Deep learning for code intelligence: Survey, benchmark and toolkit. ACM Computing Surveys, 56, p.309. DOI: https://doi.org/10.1145/3664597

Wang, J., and Chen, Y., 2023. A Review on Code Generation with LLMs: Application and Evaluation. In: Proceedings - 2023 1st IEEE International Conference on Medical Artificial Intelligence, MedAI 2023. Institute of Electrical and Electronics Engineers Inc., pp.284-289. DOI: https://doi.org/10.1109/MedAI59581.2023.00044

Wang, W., Ning, H., Zhang, G., Liu, L., and Wang, Y., 2024. Rocks coding, not development: A human-centric, experimental evaluation of LLM-supported SE tasks. Proceedings of the ACM on Software Engineering, 1, pp.699-721. DOI: https://doi.org/10.1145/3643758

Xiao, T., Treude, C., Hata, H., and Matsumoto, K., 2024. DevGPT: Studying Developer-ChatGPT Conversations. In: Proceedings of the 21st International Conference on Mining Software Repositories, MSR ’24. Association for Computing Machinery, New York, NY, USA, pp.227-230. DOI: https://doi.org/10.1145/3643991.3648400

Xu, B., Nguyen, T.D., Le-Cong, T., Hoang, T., Liu, J., Kim, K., Gong, C., Niu, C., Wang, C., Le, B., and Lo, D., 2023. Are We Ready to Embrace Generative AI for Software Q and A? In: Proceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023. Institute of Electrical and Electronics Engineers Inc., pp.1713-1717. DOI: https://doi.org/10.1109/ASE56229.2023.00023

Yan, D., Gao, Z., and Liu, Z., 2023. ACloser Look at Different Difficulty Levels Code Generation Abilities of ChatGPT. In: Proceedings - 2023 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023. Institute of Electrical and Electronics Engineers Inc., pp.1887-1898. DOI: https://doi.org/10.1109/ASE56229.2023.00096

Yang, Z., Liu, F., Yu, Z., Keung, J.W., Li, J., Liu, S., Hong, Y., Ma, X., Jin, Z., and Li, G., 2024. Exploring and Unleashing the Power of Large Language Models in Automated Code Translation. Proceedings of the ACM on Software Engineering, 1(FSE), pp.1585-1608. DOI: https://doi.org/10.1145/3660778

Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., and Zhang, Y., 2024. A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4, p.100211. DOI: https://doi.org/10.1016/j.hcc.2024.100211

Yu, H., Shen, B., Ran, D., Zhang, J., Zhang, Q., Ma, Y., Liang, G., Li,Y., Wang, Q., and Xie, T., 2024. CoderEval: ABenchmark of Pragmatic Code Generation with Generative Pretrained Models. In: 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, pp.428-439. DOI: https://doi.org/10.1145/3597503.3623316

Zhao, Z., Sun, J., Cai, C.H., and Wei, Z., 2024. Code Generation Using SelfInteractive Assistant. Institute of Electrical and Electronics Engineers (IEEE), United States, pp.2347-2352. DOI: https://doi.org/10.1109/COMPSAC61105.2024.00377