Veröffentlicht: von

Two years ago, I shared results of a project at TH Köln, where Stephan Wallraven and Tim Köhne adapted the HumanEval benchmark for ABAP code generation. At that time (fall 2023), ChatGPT showed a success rate for ABAP that was significantly lower compared to Python. However, we caught a glimpse of potential—especially in code explanation. Driven by requests from practitioners eager to measure the actual progress of Generative AI in the SAP ecosystem, we reprocessed and substantially expanded our benchmark in summer 2025.

What’s New?

To better reflect the reality of modern enterprise development, we enhanced the benchmark in two major ways:

– We expanded the task set to 180, moving beyond basic algorithms to include practical SAP scenarios such as ABAP database operations.

– We implemented an simple „agentic-like“ workflow where LLMs receive feedback from the ABAP compiler and unit tests across up to five cycles. This mimics a developer resolving syntax errors and logical bugs in real time.

Furthermore, we included some open source LLMs and also state-of-the-art models like GPT-5 and Claude-Sonnet-4.  Our results show a dramatic shift in performance:

– We found that iteration is a decisive lever. Without feedback (Round 0), success rates are much more higher than two years ago but remain modest, with top models solving only about 19% to 24% of tasks. However, the ability to process compiler feedback changes the game: by the fifth iteration, GPT-5 reached a success rate of 77.1%, closely followed by Claude-Sonnet-4 at 74.7%.

– Our new ABAP related extensions of the benchmark were not harder to solve than other types of tasks.  All tasks were successfully solved by at least one model in every single case.

– We observed that models either solve a task almost perfectly within five rounds or fail completely. This suggests that while AI excels at overcoming syntactic hurdles, deep logical understanding remains the final frontier.

The full study, „Benchmarking Large Language Models for ABAP Code Generation“, was published as a preprint in January 2026 for those interested in a technical deep dive, including detailed error profiles and survival analysis (https://arxiv.org/abs/2601.15188). The entire benchmark environment is fully documented on GitHub and freely available. Based on our benchmark,first results were published by practitioners in February 2026 applying it to current LLMs, including ABAP-1 (https://blog.zeis.de/).

The great attention our study has received in the ABAP community encourages us to continue advancing this research—both by extending the benchmark to other use cases and by further developing the benchmark environment.